Scrapling MCP Server - Project Documentation

Project Overview
Project Scope
Architecture & Structure
Key Components
MCP Tools Reference
Stealth Levels
Configuration
Security Considerations
Development Guidelines
Usage Examples

1. Project Overview

What is the Scrapling MCP Server?

The Scrapling MCP Server is a Model Context Protocol (MCP) server that provides web scraping capabilities through an integrated stealth-aware scraping engine. Built on top of the FastMCP framework and leveraging the scrapling library, this server exposes powerful web scraping tools that AI agents and applications can invoke through a standardized MCP interface.

The project bridges the gap between AI agents that need to fetch web content and the complex reality of modern web scraping—including anti-bot protections, JavaScript rendering requirements, Cloudflare challenges, and the need for stealthy request patterns.

Scrapling Library

The server leverages Scrapling, an adaptive web scraping framework with 9.1k GitHub stars that provides multiple fetcher types:

Fetcher	Use Case
`Fetcher`	Fast HTTP requests with TLS fingerprinting and HTTP/3 support
`DynamicFetcher`	Full browser automation using Playwright
`StealthyFetcher`	Advanced anti-bot bypass using Camoufox (modified Firefox)
`AsyncStealthySession`	Concurrent stealth browsing with tab pooling

Purpose and Goals

The primary purpose of this MCP server is to enable AI agents to:

Fetch web content reliably from websites with varying levels of anti-bot protection
Render JavaScript when necessary to access dynamically loaded content
Bypass common anti-bot measures through configurable stealth settings
Handle session-based scraping for websites requiring authentication or stateful interactions
Extract structured data using CSS selectors from scraped pages

The project aims to provide a balance between:

Ease of use - Simple API for common scraping tasks
Flexibility - Extensive configuration options for advanced use cases
Reliability - Built-in retry logic and error handling
Security - URL validation and safe defaults

Key Features and Capabilities

Feature	Description
JavaScript Rendering	Full browser-based rendering for dynamic content
Stealth Modes	Multiple pre-configured stealth levels (Minimal, Standard, Maximum)
Cloudflare Support	Automatic Cloudflare challenge detection and solving
Session Management	Persistent sessions for stateful scraping
Proxy Rotation	Support for proxy lists with automatic rotation
Retry Logic	Exponential backoff with configurable retry attempts
CSS Extraction	Structured data extraction using CSS selectors
URL Validation	Built-in SSRF protection and security checks
MCP Integration	Native MCP protocol support for AI agent integration
Spider Framework	Scrapy-like API with async callbacks, concurrent crawling, and pause/resume support
Adaptive Parsing	Smart element tracking that survives website design changes
Camoufox Integration	Modified Firefox browser with stealth patches for maximum anti-detection

2. Project Scope

What the Project Does

The Scrapling MCP Server provides a collection of MCP tools that allow AI agents to:

Simple Scraping - Fetch a URL and retrieve HTML content
Stealth Scraping - Fetch URLs with configurable anti-detection measures
Session-Based Scraping - Maintain cookies and state across multiple requests
Structured Extraction - Extract specific data using CSS selectors
Batch Scraping - Process multiple URLs in sequence

Target Use Cases

The server is designed for the following use cases:

AI Agent Web Research - Enabling AI agents to gather information from the web
Data Collection - Automated gathering of publicly available web data
Content Aggregation - Building datasets from multiple web sources
Monitoring & Alerting - Watching web pages for changes
API Alternative - Accessing websites that lack public APIs

Supported Scraping Modes

Simple Mode

Basic HTTP requests without stealth features
Fastest performance
Suitable for well-behaved websites without anti-bot protection
No JavaScript rendering

Stealth Mode

Configurable anti-detection features
User-Agent randomization
Human-like behavior simulation
Browser automation with headless Chrome

Session-Based Mode

Persistent cookie storage
State maintenance across requests
Authentication handling
Ideal for authenticated scraping

What's In Scope

HTTP/HTTPS scraping with JavaScript rendering support
Stealth configuration with multiple preset levels
Session management for stateful interactions
Error handling with automatic retry logic
URL validation for security
MCP protocol integration

What's Out of Scope

Authentication handling - While sessions are supported, credential management is outside scope
CAPTCHA solving - No built-in CAPTCHA solving capabilities (Cloudflare challenges only)
Distributed scraping - Single-instance operation
Data storage - The server fetches and returns data but doesn't persist it
Legal compliance - Users are responsible for ensuring their scraping activities are legal

3. Architecture & Structure

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         AI Agent / Client                              │
│                  (Claude, GPT, or other MCP clients)                   │
└─────────────────────────────────┬───────────────────────────────────────┘
                                  │ MCP Protocol (JSON-RPC)
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        MCP Server (FastMCP)                            │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      MCP Tools Layer                            │   │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────────┐  │   │
│  │  │  scrape   │ │  stealth  │ │  session  │ │  extract      │  │   │
│  │  │  simple   │ │  scrape   │ │  scrape   │ │  structured   │  │   │
│  │  └───────────┘ └───────────┘ └───────────┘ └───────────────┘  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
│                                  ▼                                      │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    Core Logic Layer                             │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐  │   │
│  │  │   Stealth   │ │   Session   │ │     Retry & Error       │  │   │
│  │  │  Config     │ │  Management │ │     Handling            │  │   │
│  │  └─────────────┘ └─────────────┘ └─────────────────────────┘  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    Scrapling Integration Layer                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │              AsyncStealthySession (scrapling library)          │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐   │   │
│  │  │   Browser   │ │   HTTP       │ │   Anti-Detection       │   │   │
│  │  │   Pool      │ │   Client     │ │   │   │   Features             │
│  │  └──────────────┘ └──────────────┘ └────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           Target Website                                │
│                  (Any HTTP/HTTPS accessible URL)                        │
└─────────────────────────────────────────────────────────────────────────┘

Directory Structure

mcp-scraper/
├── .env.example           # Example environment configuration
├── .gitignore             # Git ignore patterns
├── pyproject.toml         # Python project configuration
├── README.md              # Project README
├── src/
│   └── mcp_scraper/
│       ├── __init__.py    # Package initialization
│       ├── config.py      # Configuration classes and settings
│       └── stealth.py     # Stealth utilities and scraping logic
└── tests/                 # Test directory (to be added)

Key Components and Their Responsibilities

MCP Server (`init.py`)

Responsibility: Package initialization and exports
Exports: Settings, StealthConfig, version information

Configuration Module (`config.py`)

Responsibility: Define settings and configuration classes
Components:
- StealthConfig dataclass - Detailed stealth configuration options
- Settings class - Environment-based settings using Pydantic
- StealthProfiles class - Pre-configured stealth profiles

Stealth Module (`stealth.py`)

Responsibility: Core scraping logic, session management, and utilities
Components:
- StealthConfig class - Stealth configuration with all options
- StealthLevel enum - Preset stealth levels (MINIMAL, STANDARD, MAXIMUM)
- scrape_with_retry() - Main scraping function with retry logic
- get_session() - Session management
- validate_url() - URL security validation
- format_response() - Response formatting utility

Data Flow

Request Receipt: Client sends MCP request with URL and optional parameters
URL Validation: System validates URL for security (SSRF protection)
Configuration: Stealth settings are applied based on parameters
Session Management: Get or create stealth session
Scraping: Execute HTTP request through scrapling engine
Response Processing: Format response with requested data
Error Handling: Apply retry logic if needed
Return: Send formatted response back to client

4. Key Components

MCP Server (FastMCP)

The MCP server is built using FastMCP, a modern framework for creating MCP servers in Python. FastMCP provides:

Simple tool definition using decorators
Automatic type conversion between Python and JSON
Built-in error handling for tool execution
Async support for concurrent operations

The server exposes scraping functionality as MCP tools that clients can invoke.

Scrapling Integration

The server integrates with the scrapling library, which provides:

AsyncStealthySession: An async session with built-in anti-detection features and tab pooling
StealthyFetcher: Advanced anti-bot bypass using Camoufox (modified Firefox)
Page object: Unified interface for accessing page content
Browser automation: Headless browser with stealth features
JavaScript rendering: Full DOM rendering for dynamic content
Spider Framework: Scrapy-like API with concurrent crawling, pause/resume, and streaming mode
Adaptive Parsing: Smart element tracking that survives website design changes

Stealth Configuration

The stealth system provides multiple configuration options:

Option	Description	Default
`headless`	Run browser in headless mode	`True`
`solve_cloudflare`	Attempt Cloudflare challenges	`False`
`humanize`	Human-like behavior simulation	`True`
`humanize_duration`	Maximum cursor movement duration in seconds	`1.5`
`geoip`	GeoIP-based routing	`False`
`os_randomize`	Randomize OS fingerprint	`True`
`block_webrtc`	Block WebRTC to prevent IP leaks	`True`
`allow_webgl`	Allow WebGL fingerprinting	`True`
`google_search`	Simulate Chrome browser	`True`
`block_images`	Block image loading	`False`
`block_ads`	Block advertisements	`True`
`disable_resources`	Disable CSS/JS resources	`False`
`network_idle`	Wait for network inactivity before returning	`False`
`load_dom`	Wait for DOMContentLoaded event	`False`
`wait_selector`	Wait for specific element to appear	`None`
`wait_selector_state`	Element state to wait for (visible/hidden/attached)	`None`
`timeout`	Request timeout in milliseconds	`30000`
`proxy`	Proxy URL for requests	`None`

Session Management

The session management system:

Global session cache: Maintains a single session instance
Config-aware: Recreates session when configuration changes
Proper cleanup: Ensures resources are released on close
Cookie persistence: Maintains cookies across requests

Error Handling & Retry Logic

The retry system implements:

Exponential backoff: Delay increases exponentially between retries
Proxy rotation: Automatic proxy switching on block detection
Cloudflare handling: Detection and optional solving of challenges
Block detection: Identifies when requests are blocked
Custom exceptions: Specific error types for different failure modes

Exception Hierarchy:

ScrapeError (base)
├── CloudflareError - Cloudflare protection detected
├── BlockedError - Request blocked by anti-bot
└── TimeoutError - Request timed out

5. MCP Tools Reference

Available Tools

The following MCP tools are available. Note: The exact tool names and parameters depend on the implementation. Below are the conceptual tools provided by the server.

Tool: `scrape_simple`

Simple web scraping without stealth features. Uses the Fetcher class for fast HTTP requests with TLS fingerprinting.

When to use:

Fast scraping of static content
Well-behaved websites without protection
Initial testing and development

Parameters:

Parameter	Type	Required	Description
`url`	string	Yes	URL to scrape
`selector`	string	No	CSS selector for targeted extraction
`extract`	string	No	What to extract: "text", "html", or "both" (default: "text")
`timeout`	integer	No	Request timeout in milliseconds (default: 30000)

Example:

{
  "url": "https://example.com",
  "timeout": 30000
}

Tool: `scrape_stealth`

Stealth web scraping with configurable anti-detection. Uses the StealthyFetcher class with Camoufox (modified Firefox) for maximum stealth.

When to use:

Websites with basic anti-bot measures
When avoiding detection is important
Rate-limited endpoints
Cloudflare-protected sites

Parameters:

Parameter	Type	Required	Description
`url`	string	Yes	URL to scrape
`stealth_level`	string	No	"minimal", "standard", or "maximum" (default: "standard")
`solve_cloudflare`	boolean	No	Attempt Cloudflare challenges (default: false)
`network_idle`	boolean	No	Wait for network inactivity (default: true)
`load_dom`	boolean	No	Wait for DOMContentLoaded (default: true)
`timeout`	integer	No	Request timeout in milliseconds (default: 30000)
`proxy`	string	No	Proxy URL for requests

Example:

{
  "url": "https://example.com",
  "stealth_level": "maximum",
  "solve_cloudflare": true,
  "network_idle": true,
  "timeout": 60000
}

Tool: `scrape_session`

Session-based scraping with persistent state.

When to use:

Websites requiring authentication
Multi-step interactions
Maintaining login state

Parameters:

Parameter	Type	Required	Description
`url`	string	Yes	URL to scrape
`session_id`	string	No	Session identifier for persistence
`cookies`	object	No	Initial cookies to set
`stealth_level`	string	No	Stealth level (default: "standard")

Example:

{
  "url": "https://example.com/dashboard",
  "session_id": "user-session-123",
  "cookies": {"auth": "token-value"}
}

Tool: `extract_structured`

Extract structured data using CSS selectors.

When to use:

Extracting specific data from pages
Building datasets from web content
Structured data acquisition

Parameters:

Parameter	Type	Required	Description
`url`	string	Yes	URL to scrape
`selectors`	object \| string	Yes	Map of name → CSS selector. Can be either a JSON object or a JSON string representation.
`stealth_level`	string	No	Stealth level (default: "standard")

Selector Syntax:

The selectors parameter supports CSS selector syntax with the following extensions:

Syntax	Description	Example
`selector`	Extract text content	`"title": "h1"`
`selector::html`	Extract HTML content	`"content": "div::html"`
`selector::text`	Extract text using ::text pseudo-element	`"text": "p::text"`
`selector::attr(name)`	Extract attribute value	`"link": "a::attr(href)"`
`selector@attr`	Extract attribute (alternative syntax)	`"image": "img@src"`
`selector@attr1@attr2`	Extract multiple attributes	`"data": "img@src@alt"`

Example with dict input:

{
  "url": "https://example.com/blog",
  "selectors": {
    "title": "h1.article-title",
    "content": "div.article-content",
    "author": "span.author-name",
    "date": "time.publish-date",
    "link": "a.read-more::attr(href)",
    "image": "img.featured@src@alt"
  }
}

Example with JSON string input:

{
  "url": "https://example.com/blog",
  "selectors": "{\"title\": \"h1.article-title\", \"content\": \"div.article-content\", \"link\": \"a.read-more::attr(href)\"}"
}

Tool: `scrape_batch`

Scrape multiple URLs in sequence.

When to use:

Processing multiple pages
Building site-wide datasets
Bulk data collection

Parameters:

Parameter	Type	Required	Description
`urls`	array	Yes	List of URLs to scrape
`stealth_level`	string	No	Stealth level (default: "standard")
`delay`	float	No	Delay between requests in seconds (default: 1.0)

Example:

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "stealth_level": "minimal",
  "delay": 2.0
}

6. Stealth Levels

Overview

The server provides three pre-configured stealth levels, each balancing speed, anonymity, and success rate differently.

Minimal Stealth

Profile: StealthLevel.MINIMAL or get_minimal_stealth()

Setting	Value
Headless	Yes
Humanize	No
Humanize Duration	N/A
Cloudflare solving	No
OS randomization	No
WebRTC blocking	No
Chrome simulation	No
Image blocking	Yes
Resource disabling	Yes
Ad blocking	Yes
Network Idle	No
Load DOM	No
Timeout	15s

When to use:

Simple websites without anti-bot protection
High-speed scraping where stealth is not critical
Testing and development
Static content and APIs

Performance: Fastest - suitable for high-volume scraping of cooperative sites

Standard Stealth

Profile: StealthLevel.STANDARD or get_standard_stealth()

Setting	Value
Headless	Yes
Humanize	Yes
Humanize Duration	1.5s
Cloudflare solving	No
OS randomization	Yes
WebRTC blocking	Yes
Chrome simulation	Yes
Image blocking	No
Resource disabling	No
Ad blocking	Yes
Network Idle	Yes
Load DOM	Yes
Timeout	30s

When to use:

Most web scraping tasks
Sites with basic anti-bot protection
General-purpose scraping
Balance of speed and anonymity required

Performance: Moderate - suitable for most common scraping scenarios

Maximum Stealth

Profile: StealthLevel.MAXIMUM or get_maximum_stealth()

Setting	Value
Headless	Yes
Humanize	Yes
Humanize Duration	1.5s
Cloudflare solving	Yes
OS randomization	Yes
WebRTC blocking	Yes
Chrome simulation	Yes
Image blocking	No
Resource disabling	No
Ad blocking	Yes
GeoIP routing	Yes
Network Idle	Yes
Load DOM	Yes
Wait Selector	body
Wait Selector State	visible
Timeout	60s

When to use:

Heavily protected websites
Cloudflare-protected sites
Rate-limited endpoints
Maximum anonymity required
Challenging anti-bot systems

Performance: Slowest - but highest success rate on protected sites

Configuration Options

You can also create custom stealth configurations:

from mcp_scraper.stealth import StealthConfig

custom_config = StealthConfig(
    headless=True,
    solve_cloudflare=True,
    humanize=True,
    humanize_duration=1.5,
    geoip=False,
    os_randomize=True,
    block_webrtc=True,
    allow_webgl=True,
    google_search=True,
    block_images=False,
    block_ads=True,
    disable_resources=False,
    network_idle=True,
    load_dom=True,
    wait_selector="body",
    wait_selector_state="visible",
    timeout=45000,
    proxy="http://proxy:8080"
)

7. Configuration

Environment Variables

Create a .env file based on .env.example:

# Proxy URL for requests (optional)
# Format: http://user:pass@host:port or socks5://host:port
PROXY_URL=

# Default timeout for requests in seconds (1-300)
DEFAULT_TIMEOUT=30

# Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_LEVEL=INFO

# Maximum number of retries for failed requests (0-10)
MAX_RETRIES=3

Configuration Files

pyproject.toml

The project uses pyproject.toml for Python package configuration:

[project]
name = "mcp-scraper"
version = "0.1.0"
requires-python = ">=3.10"

dependencies = [
    "scrapling[all]",
    "fastmcp>=2.0",
    "httpx>=0.25",
    "pydantic>=2.0",
    "pydantic-settings>=2.0",
    "python-dotenv>=1.0",
    "loguru>=0.7",
]

Proxy Setup

Single Proxy

Set via environment variable:

PROXY_URL=http://proxy.example.com:8080

Or programmatically:

config = StealthConfig(proxy="http://proxy.example.com:8080")

Proxy Rotation

For proxy rotation, pass a list of proxies to the scraping function:

proxy_list = [
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://proxy3:8080",
]

page = await scrape_with_retry(
    url="https://example.com",
    proxy_list=proxy_list,
    max_retries=3
)

Supported Proxy Formats

Protocol	Format
HTTP	`http://host:port`
HTTPS	`https://host:port`
SOCKS5	`socks5://host:port`
With auth	`http://user:pass@host:port`

Timeout Settings

Request Timeout

Set per-request or globally:

# Per-request (in milliseconds)
page = await session.fetch(url, timeout=60000)

# Global (via Settings)
# DEFAULT_TIMEOUT=60 in .env (in seconds)

Recommended values:

Scenario	Timeout
Simple static pages	15-30s (15000-30000ms)
Standard scraping	30-45s (30000-45000ms)
Complex JavaScript	45-60s (45000-60000ms)
Slow/blocked sites	60-120s (60000-120000ms)

8. Security Considerations

URL Validation

The server implements robust URL validation to prevent Server-Side Request Forgery (SSRF) attacks:

Allowed:

http:// and https:// protocols only
Public IP addresses
Public domain names

Blocked:

file://, ftp://, and other protocols
Private IP addresses (10.x.x.x, 172.16-31.x.x, 192.168.x.x)
Localhost variants (localhost, 127.0.0.1, ::1)
Internal hostnames (*.local, *.internal, *.corp)
Link-local addresses (169.254.x.x)

The validation function validate_url() is called automatically before any scraping operation.

Proxy Security

When using proxies:

Use trusted proxies - Avoid free/public proxy lists
Encrypt credentials - Don't hardcode proxy credentials
Validate proxy URLs - Ensure proxy URLs are valid format
Rotate responsibly - Don't abuse proxy rotation

Rate Limiting

To avoid overwhelming target sites:

Use appropriate delays - Set random_delay between requests
Implement backoff - Use exponential backoff on failures
Respect robots.txt - Check and follow site policies
Monitor responses - Watch for rate limit indicators

Legal Compliance

Users are responsible for:

Ensuring their scraping activities are legal in their jurisdiction
Respecting website Terms of Service
Complying with robots.txt directives
Not bypassing authentication mechanisms they don't have access to
Handling personal data appropriately

Best practices:

Only scrape publicly available data
Identify your scraper in User-Agent when appropriate
Cache responses to minimize repeated requests
Consider using official APIs when available

9. Development Guidelines

Best Practices for Undetectable Scraping

Follow these best practices to maximize scraping success while minimizing detection:

Always use sessions: Reuse browser instances to maintain consistent fingerprints
Enable geoip with proxies: Match browser locale to proxy location for better anonymity
Use solve_cloudflare sparingly: Only when needed - it increases detection surface and slows down requests
Implement exponential backoff: Start slow, increase speed gradually on successful requests
Rotate user agents: Even with Camoufox, periodic rotation helps avoid pattern detection
Monitor for blocks: Track 403/429 responses and adjust strategy accordingly
Enable network_idle and load_dom: Wait for page to fully load before extracting data
Use wait_selector for dynamic content: Wait for specific elements to appear before extraction

Recommended Configuration Patterns

# Pattern 1: Simple static pages
simple_config = StealthConfig(
    headless=True,
    disable_resources=True,
    timeout=10000
)

# Pattern 2: Protected sites (Cloudflare)
protected_config = StealthConfig(
    headless=True,
    solve_cloudflare=True,
    humanize=True,
    geoip=True,
    os_randomize=True,
    timeout=60000,
    google_search=True,
    network_idle=True,
    load_dom=True
)

# Pattern 3: High-anonymity scraping
anonymous_config = StealthConfig(
    headless=True,
    block_webrtc=True,
    block_images=True,
    disable_resources=True,
    os_randomize=True,
    geoip=True,
    solve_cloudflare=True,
    humanize=True,
    humanize_duration=1.5,
    proxy=rotation.next()  # Use rotating proxies
)

# Pattern 4: Debugging (visible browser)
debug_config = StealthConfig(
    headless=False,  # Visible browser
    timeout=120000   # Long timeout for manual intervention
)

How to Extend the Server

Adding New Tools

To add a new MCP tool, follow this pattern:

from fastmcp import FastMCP
from mcp_scraper.stealth import scrape_with_retry, get_stealth_config

mcp = FastMCP("My Scraper")

@mcp.tool()
async def scrape_with_custom_option(
    url: str,
    custom_option: bool = False
) -> dict:
    """Description of what this tool does.
    
    Args:
        url: The URL to scrape
        custom_option: Description of custom option
        
    Returns:
        Dictionary with scraping results
    """
    # Validate URL
    if not validate_url(url):
        raise ValueError(f"Invalid URL: {url}")
    
    # Get stealth config
    config = get_standard_stealth()
    
    # Apply custom options
    if custom_option:
        # Custom logic
        
    # Scrape
    page = await scrape_with_retry(url, config)
    
    # Format and return
    return format_response(page, url)

Adding New Stealth Profiles

Add new preset configurations in config.py:

@staticmethod
def custom_profile() -> StealthConfig:
    """Custom profile description.
    
    Suitable for: Your specific use case
    """
    return StealthConfig(
        # Custom settings
        enable_js=True,
        # ... other options
    )

Adding Error Types

Extend the exception hierarchy in stealth.py:

class RateLimitError(ScrapeError):
    """Exception raised when rate limited."""
    pass

Testing Approach

Unit Tests:

Test URL validation
Test configuration classes
Test response formatting

Integration Tests:

Test scraping with mock servers
Test retry logic
Test error handling

Example test structure:

tests/
├── __init__.py
├── test_config.py
├── test_stealth.py
├── test_validation.py
└── test_integration.py

Code Style

The project follows:

Black for code formatting (100 character line length)
Ruff for linting
MyPy for type checking
PEP 8 naming conventions

Key style rules:

Use type hints for all function parameters and return values
Use dataclasses for configuration objects
Use async/await for I/O operations
Use Loguru for logging
Document all public functions with docstrings

Pre-commit hooks:

pip install pre-commit
pre-commit install

10. Usage Examples

Basic Scraping Example

Simple scraping of a static webpage:

from mcp_scraper.stealth import scrape_with_retry, format_response

async def basic_example():
    url = "https://example.com"
    
    # Simple scrape
    page = await scrape_with_retry(url)
    
    # Format response
    result = format_response(page, url)
    
    print(f"Title: {result.get('title')}")
    print(f"Text content: {result.get('text')[:500]}")
    print(f"Status: {result.get('status')}")

Stealth Scraping Example

Scraping a protected website with maximum stealth:

from mcp_scraper.stealth import (
    scrape_with_retry,
    get_maximum_stealth,
    format_response
)

async def stealth_example():
    url = "https://protected-site.com/data"
    
    # Use maximum stealth
    config = get_maximum_stealth()
    
    try:
        page = await scrape_with_retry(
            url,
            config=config,
            max_retries=3
        )
        
        result = format_response(page, url)
        print(f"Success! Content length: {len(result.get('html', ''))}")
        
    except Exception as e:
        print(f"Scraping failed: {e}")

Batch Scraping Example

Processing multiple URLs:

from mcp_scraper.stealth import scrape_with_retry, format_response, validate_url
import asyncio

async def batch_example():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ]
    
    results = []
    delay = 1.0  # Delay between requests
    
    for url in urls:
        # Validate first
        if not validate_url(url):
            print(f"Skipping invalid URL: {url}")
            continue
            
        try:
            page = await scrape_with_retry(url)
            result = format_response(page, url)
            results.append(result)
            print(f"Scraped: {url}")
            
        except Exception as e:
            print(f"Failed to scrape {url}: {e}")
            
        # Delay between requests
        await asyncio.sleep(delay)
    
    print(f"Successfully scraped {len(results)}/{len(urls)} URLs")
    return results

Structured Extraction Example

Extracting specific data using CSS selectors:

from mcp_scraper.stealth import scrape_with_retry, format_response

async def structured_example():
    url = "https://example.com/blogposts"
    
    # Define selectors for data extraction
    selectors = {
        "titles": "h2.post-title",
        "authors": "span.author",
        "dates": "time.published",
        "summaries": "p.summary",
        "links": "a.read-more@href"
    }
    
    # Scrape with selectors
    page = await scrape_with_retry(url, selectors=selectors)
    
    # Get formatted response with extracted data
    result = format_response(page, url, selectors=selectors)
    
    # Access extracted data
    extracted = result.get("selectors", {})
    for i, title in enumerate(extracted.get("titles", [])):
        print(f"Post {i+1}: {title}")
        print(f"  Author: {extracted.get('authors', [None])[i]}")
        print(f"  Date: {extracted.get('dates', [None])[i]}")

Custom Configuration Example

Using custom stealth settings:

from mcp_scraper.stealth import StealthConfig, scrape_with_retry

async def custom_config_example():
    # Create custom stealth configuration
    config = StealthConfig(
        headless=True,
        solve_cloudflare=True,  # Attempt Cloudflare challenges
        humanize=True,
        humanize_duration=1.5,
        geoip=False,
        os_randomize=True,
        block_webrtc=True,
        allow_webgl=True,
        google_search=True,
        block_images=True,  # Reduce bandwidth
        block_ads=True,
        disable_resources=False,
        network_idle=True,
        load_dom=True,
        timeout=45000,
        proxy="http://my-proxy:8080"  # Use specific proxy
    )
    
    url = "https://cloudflare-protected-site.com"
    page = await scrape_with_retry(url, config=config, max_retries=5)
    print(f"Success! Content: {page.text[:200]}")

Proxy Rotation Example

Using multiple proxies with automatic rotation:

from mcp_scraper.stealth import scrape_with_retry, get_standard_stealth

async def proxy_rotation_example():
    # List of proxy servers
    proxy_list = [
        "http://proxy1:8080",
        "http://proxy2:8080", 
        "http://proxy3:8080",
    ]
    
    config = get_standard_stealth()
    
    try:
        page = await scrape_with_retry(
            url="https://example.com",
            config=config,
            max_retries=3,
            proxy_list=proxy_list
        )
        print(f"Success with proxy rotation!")
        
    except Exception as e:
        print(f"All proxies failed: {e}")

Appendix: API Reference

Core Classes

`StealthConfig`

Configuration class for stealth web scraping.

Attributes:

headless (bool): Run browser in headless mode
solve_cloudflare (bool): Attempt Cloudflare challenges
humanize (bool): Add human-like behavior
humanize_duration (float): Maximum cursor movement duration
geoip (bool): Use geoIP-based routing
os_randomize (bool): Randomize OS fingerprint
block_webrtc (bool): Block WebRTC
allow_webgl (bool): Allow WebGL
google_search (bool): Simulate Chrome
block_images (bool): Block images
block_ads (bool): Block advertisements
disable_resources (bool): Disable CSS/JS
network_idle (bool): Wait for network inactivity
load_dom (bool): Wait for DOMContentLoaded
wait_selector (str): Wait for specific element
wait_selector_state (str): Element state to wait for
timeout (int): Request timeout in milliseconds
proxy (str): Proxy URL

`StealthLevel`

Enum for preset stealth levels.

Values:

MINIMAL: Fast, minimal protection
STANDARD: Balanced protection
MAXIMUM: Highest protection

Core Functions

`scrape_with_retry(url, config, max_retries, backoff_factor, proxy_list, selectors)`

Scrape a URL with retry logic.

Parameters:

url (str): URL to scrape
config (StealthConfig): Stealth configuration
max_retries (int): Maximum retry attempts
backoff_factor (float): Exponential backoff multiplier
proxy_list (list): List of proxy URLs
selectors (dict): CSS selectors for extraction

Returns: Page object

Raises: ScrapeError, CloudflareError, BlockedError, TimeoutError

`validate_url(url)`

Validate URL for security.

Parameters:

url (str): URL to validate

Returns: bool - True if URL is safe

`format_response(page, url, selectors)`

Format scraping response.

Parameters:

page (Page): Scraped page object
url (str): Original URL
selectors (dict): Optional CSS selectors

Returns: dict with response data

`get_element_text(element)`

Extract text content from a scraping element with fallbacks.

Parameters:

element (Any): A page element object from scrapling

Returns: str - The text content of the element

Description: Checks for .text property first, then .inner_text, and falls back to str().

`get_element_html(element)`

Extract HTML content from a scraping element with fallbacks.

Parameters:

element (Any): A page element object from scrapling

Returns: str - The HTML content of the element

Description: Checks for .html property first, then .innerHTML.

`get_element_attribute(element, attribute)`

Extract an attribute value from a scraping element with fallbacks.

Parameters:

element (Any): A page element object from scrapling
attribute (str): The name of the attribute to retrieve

Returns: str | None - The attribute value, or None if not found

Description: Checks for .get_attribute() method first, then direct property access.

Additional Resources

This documentation was generated for the Scrapling MCP Server project.

Active Technologies

Python 3.10+ (already configured in pyproject.toml) + FastMCP (MCP framework), Scrapling (scraping engine), Pydantic (config), Loguru (logging) (001-mcp-server-implementation)
N/A - stateless server with optional session caching in memory (001-mcp-server-implementation)

Recent Changes

001-mcp-server-implementation: Added Python 3.10+ (already configured in pyproject.toml) + FastMCP (MCP framework), Scrapling (scraping engine), Pydantic (config), Loguru (logging)

ナビゲーション

Skillsとは？

リンク

Scrapling MCP Server - Project Documentation

Scrapling MCP Server - Project Documentation

Table of Contents

1. Project Overview

What is the Scrapling MCP Server?

Scrapling Library

Purpose and Goals

Key Features and Capabilities

2. Project Scope

What the Project Does

Target Use Cases

Supported Scraping Modes

Simple Mode

Stealth Mode

Session-Based Mode

What's In Scope

What's Out of Scope

3. Architecture & Structure

High-Level Architecture

Directory Structure

Key Components and Their Responsibilities

MCP Server (__init__.py)

Configuration Module (config.py)

Stealth Module (stealth.py)

Data Flow

4. Key Components

MCP Server (FastMCP)

Scrapling Integration

Stealth Configuration

Session Management

Error Handling & Retry Logic

5. MCP Tools Reference

Available Tools

Tool: scrape_simple

Tool: scrape_stealth

Tool: scrape_session

Tool: extract_structured

Tool: scrape_batch

6. Stealth Levels

Overview

Minimal Stealth

Standard Stealth

Maximum Stealth

Configuration Options

7. Configuration

Environment Variables

Configuration Files

pyproject.toml

Proxy Setup

Single Proxy

Proxy Rotation

Supported Proxy Formats

Timeout Settings

Request Timeout

8. Security Considerations

URL Validation

Proxy Security

Rate Limiting

Legal Compliance

9. Development Guidelines

Best Practices for Undetectable Scraping

Recommended Configuration Patterns

How to Extend the Server

Adding New Tools

Adding New Stealth Profiles

Adding Error Types

Testing Approach

Code Style

10. Usage Examples

Basic Scraping Example

Stealth Scraping Example

Batch Scraping Example

Structured Extraction Example

Custom Configuration Example

Proxy Rotation Example

Appendix: API Reference

Core Classes

MCP Server (`init.py`)

Configuration Module (`config.py`)

Stealth Module (`stealth.py`)

Tool: `scrape_simple`

Tool: `scrape_stealth`

Tool: `scrape_session`

Tool: `extract_structured`

Tool: `scrape_batch`

`StealthConfig`

`StealthLevel`

`scrape_with_retry(url, config, max_retries, backoff_factor, proxy_list, selectors)`

`validate_url(url)`

`format_response(page, url, selectors)`

`get_element_text(element)`

`get_element_html(element)`

`get_element_attribute(element, attribute)`