Scrapling MCP Server - Project Documentation
Table of Contents
- Project Overview
- Project Scope
- Architecture & Structure
- Key Components
- MCP Tools Reference
- Stealth Levels
- Configuration
- Security Considerations
- Development Guidelines
- Usage Examples
1. Project Overview
What is the Scrapling MCP Server?
The Scrapling MCP Server is a Model Context Protocol (MCP) server that provides web scraping capabilities through an integrated stealth-aware scraping engine. Built on top of the FastMCP framework and leveraging the scrapling library, this server exposes powerful web scraping tools that AI agents and applications can invoke through a standardized MCP interface.
The project bridges the gap between AI agents that need to fetch web content and the complex reality of modern web scraping—including anti-bot protections, JavaScript rendering requirements, Cloudflare challenges, and the need for stealthy request patterns.
Scrapling Library
The server leverages Scrapling, an adaptive web scraping framework with 9.1k GitHub stars that provides multiple fetcher types:
| Fetcher | Use Case |
|---|---|
Fetcher | Fast HTTP requests with TLS fingerprinting and HTTP/3 support |
DynamicFetcher | Full browser automation using Playwright |
StealthyFetcher | Advanced anti-bot bypass using Camoufox (modified Firefox) |
AsyncStealthySession | Concurrent stealth browsing with tab pooling |
Purpose and Goals
The primary purpose of this MCP server is to enable AI agents to:
- Fetch web content reliably from websites with varying levels of anti-bot protection
- Render JavaScript when necessary to access dynamically loaded content
- Bypass common anti-bot measures through configurable stealth settings
- Handle session-based scraping for websites requiring authentication or stateful interactions
- Extract structured data using CSS selectors from scraped pages
The project aims to provide a balance between:
- Ease of use - Simple API for common scraping tasks
- Flexibility - Extensive configuration options for advanced use cases
- Reliability - Built-in retry logic and error handling
- Security - URL validation and safe defaults
Key Features and Capabilities
| Feature | Description |
|---|---|
| JavaScript Rendering | Full browser-based rendering for dynamic content |
| Stealth Modes | Multiple pre-configured stealth levels (Minimal, Standard, Maximum) |
| Cloudflare Support | Automatic Cloudflare challenge detection and solving |
| Session Management | Persistent sessions for stateful scraping |
| Proxy Rotation | Support for proxy lists with automatic rotation |
| Retry Logic | Exponential backoff with configurable retry attempts |
| CSS Extraction | Structured data extraction using CSS selectors |
| URL Validation | Built-in SSRF protection and security checks |
| MCP Integration | Native MCP protocol support for AI agent integration |
| Spider Framework | Scrapy-like API with async callbacks, concurrent crawling, and pause/resume support |
| Adaptive Parsing | Smart element tracking that survives website design changes |
| Camoufox Integration | Modified Firefox browser with stealth patches for maximum anti-detection |
2. Project Scope
What the Project Does
The Scrapling MCP Server provides a collection of MCP tools that allow AI agents to:
- Simple Scraping - Fetch a URL and retrieve HTML content
- Stealth Scraping - Fetch URLs with configurable anti-detection measures
- Session-Based Scraping - Maintain cookies and state across multiple requests
- Structured Extraction - Extract specific data using CSS selectors
- Batch Scraping - Process multiple URLs in sequence
Target Use Cases
The server is designed for the following use cases:
- AI Agent Web Research - Enabling AI agents to gather information from the web
- Data Collection - Automated gathering of publicly available web data
- Content Aggregation - Building datasets from multiple web sources
- Monitoring & Alerting - Watching web pages for changes
- API Alternative - Accessing websites that lack public APIs
Supported Scraping Modes
Simple Mode
- Basic HTTP requests without stealth features
- Fastest performance
- Suitable for well-behaved websites without anti-bot protection
- No JavaScript rendering
Stealth Mode
- Configurable anti-detection features
- User-Agent randomization
- Human-like behavior simulation
- Browser automation with headless Chrome
Session-Based Mode
- Persistent cookie storage
- State maintenance across requests
- Authentication handling
- Ideal for authenticated scraping
What's In Scope
- HTTP/HTTPS scraping with JavaScript rendering support
- Stealth configuration with multiple preset levels
- Session management for stateful interactions
- Error handling with automatic retry logic
- URL validation for security
- MCP protocol integration
What's Out of Scope
- Authentication handling - While sessions are supported, credential management is outside scope
- CAPTCHA solving - No built-in CAPTCHA solving capabilities (Cloudflare challenges only)
- Distributed scraping - Single-instance operation
- Data storage - The server fetches and returns data but doesn't persist it
- Legal compliance - Users are responsible for ensuring their scraping activities are legal
3. Architecture & Structure
High-Level Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ AI Agent / Client │
│ (Claude, GPT, or other MCP clients) │
└─────────────────────────────────┬───────────────────────────────────────┘
│ MCP Protocol (JSON-RPC)
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MCP Server (FastMCP) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ MCP Tools Layer │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────────┐ │ │
│ │ │ scrape │ │ stealth │ │ session │ │ extract │ │ │
│ │ │ simple │ │ scrape │ │ scrape │ │ structured │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Core Logic Layer │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Stealth │ │ Session │ │ Retry & Error │ │ │
│ │ │ Config │ │ Management │ │ Handling │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
└──────────────────────────────────┼──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Scrapling Integration Layer │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ AsyncStealthySession (scrapling library) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │ │
│ │ │ Browser │ │ HTTP │ │ Anti-Detection │ │ │
│ │ │ Pool │ │ Client │ │ │ │ Features │
│ │ └──────────────┘ └──────────────┘ └────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
└──────────────────────────────────┼──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Target Website │
│ (Any HTTP/HTTPS accessible URL) │
└─────────────────────────────────────────────────────────────────────────┘
Directory Structure
mcp-scraper/
├── .env.example # Example environment configuration
├── .gitignore # Git ignore patterns
├── pyproject.toml # Python project configuration
├── README.md # Project README
├── src/
│ └── mcp_scraper/
│ ├── __init__.py # Package initialization
│ ├── config.py # Configuration classes and settings
│ └── stealth.py # Stealth utilities and scraping logic
└── tests/ # Test directory (to be added)
Key Components and Their Responsibilities
MCP Server (__init__.py)
- Responsibility: Package initialization and exports
- Exports: Settings, StealthConfig, version information
Configuration Module (config.py)
- Responsibility: Define settings and configuration classes
- Components:
StealthConfigdataclass - Detailed stealth configuration optionsSettingsclass - Environment-based settings using PydanticStealthProfilesclass - Pre-configured stealth profiles
Stealth Module (stealth.py)
- Responsibility: Core scraping logic, session management, and utilities
- Components:
StealthConfigclass - Stealth configuration with all optionsStealthLevelenum - Preset stealth levels (MINIMAL, STANDARD, MAXIMUM)scrape_with_retry()- Main scraping function with retry logicget_session()- Session managementvalidate_url()- URL security validationformat_response()- Response formatting utility
Data Flow
- Request Receipt: Client sends MCP request with URL and optional parameters
- URL Validation: System validates URL for security (SSRF protection)
- Configuration: Stealth settings are applied based on parameters
- Session Management: Get or create stealth session
- Scraping: Execute HTTP request through scrapling engine
- Response Processing: Format response with requested data
- Error Handling: Apply retry logic if needed
- Return: Send formatted response back to client
4. Key Components
MCP Server (FastMCP)
The MCP server is built using FastMCP, a modern framework for creating MCP servers in Python. FastMCP provides:
- Simple tool definition using decorators
- Automatic type conversion between Python and JSON
- Built-in error handling for tool execution
- Async support for concurrent operations
The server exposes scraping functionality as MCP tools that clients can invoke.
Scrapling Integration
The server integrates with the scrapling library, which provides:
- AsyncStealthySession: An async session with built-in anti-detection features and tab pooling
- StealthyFetcher: Advanced anti-bot bypass using Camoufox (modified Firefox)
- Page object: Unified interface for accessing page content
- Browser automation: Headless browser with stealth features
- JavaScript rendering: Full DOM rendering for dynamic content
- Spider Framework: Scrapy-like API with concurrent crawling, pause/resume, and streaming mode
- Adaptive Parsing: Smart element tracking that survives website design changes
Stealth Configuration
The stealth system provides multiple configuration options:
| Option | Description | Default |
|---|---|---|
headless | Run browser in headless mode | True |
solve_cloudflare | Attempt Cloudflare challenges | False |
humanize | Human-like behavior simulation | True |
humanize_duration | Maximum cursor movement duration in seconds | 1.5 |
geoip | GeoIP-based routing | False |
os_randomize | Randomize OS fingerprint | True |
block_webrtc | Block WebRTC to prevent IP leaks | True |
allow_webgl | Allow WebGL fingerprinting | True |
google_search | Simulate Chrome browser | True |
block_images | Block image loading | False |
block_ads | Block advertisements | True |
disable_resources | Disable CSS/JS resources | False |
network_idle | Wait for network inactivity before returning | False |
load_dom | Wait for DOMContentLoaded event | False |
wait_selector | Wait for specific element to appear | None |
wait_selector_state | Element state to wait for (visible/hidden/attached) | None |
timeout | Request timeout in milliseconds | 30000 |
proxy | Proxy URL for requests | None |
Session Management
The session management system:
- Global session cache: Maintains a single session instance
- Config-aware: Recreates session when configuration changes
- Proper cleanup: Ensures resources are released on close
- Cookie persistence: Maintains cookies across requests
Error Handling & Retry Logic
The retry system implements:
- Exponential backoff: Delay increases exponentially between retries
- Proxy rotation: Automatic proxy switching on block detection
- Cloudflare handling: Detection and optional solving of challenges
- Block detection: Identifies when requests are blocked
- Custom exceptions: Specific error types for different failure modes
Exception Hierarchy:
ScrapeError (base)
├── CloudflareError - Cloudflare protection detected
├── BlockedError - Request blocked by anti-bot
└── TimeoutError - Request timed out
5. MCP Tools Reference
Available Tools
The following MCP tools are available. Note: The exact tool names and parameters depend on the implementation. Below are the conceptual tools provided by the server.
Tool: scrape_simple
Simple web scraping without stealth features. Uses the Fetcher class for fast HTTP requests with TLS fingerprinting.
When to use:
- Fast scraping of static content
- Well-behaved websites without protection
- Initial testing and development
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL to scrape |
selector | string | No | CSS selector for targeted extraction |
extract | string | No | What to extract: "text", "html", or "both" (default: "text") |
timeout | integer | No | Request timeout in milliseconds (default: 30000) |
Example:
{
"url": "https://example.com",
"timeout": 30000
}
Tool: scrape_stealth
Stealth web scraping with configurable anti-detection. Uses the StealthyFetcher class with Camoufox (modified Firefox) for maximum stealth.
When to use:
- Websites with basic anti-bot measures
- When avoiding detection is important
- Rate-limited endpoints
- Cloudflare-protected sites
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL to scrape |
stealth_level | string | No | "minimal", "standard", or "maximum" (default: "standard") |
solve_cloudflare | boolean | No | Attempt Cloudflare challenges (default: false) |
network_idle | boolean | No | Wait for network inactivity (default: true) |
load_dom | boolean | No | Wait for DOMContentLoaded (default: true) |
timeout | integer | No | Request timeout in milliseconds (default: 30000) |
proxy | string | No | Proxy URL for requests |
Example:
{
"url": "https://example.com",
"stealth_level": "maximum",
"solve_cloudflare": true,
"network_idle": true,
"timeout": 60000
}
Tool: scrape_session
Session-based scraping with persistent state.
When to use:
- Websites requiring authentication
- Multi-step interactions
- Maintaining login state
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL to scrape |
session_id | string | No | Session identifier for persistence |
cookies | object | No | Initial cookies to set |
stealth_level | string | No | Stealth level (default: "standard") |
Example:
{
"url": "https://example.com/dashboard",
"session_id": "user-session-123",
"cookies": {"auth": "token-value"}
}
Tool: extract_structured
Extract structured data using CSS selectors.
When to use:
- Extracting specific data from pages
- Building datasets from web content
- Structured data acquisition
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL to scrape |
selectors | object | string | Yes | Map of name → CSS selector. Can be either a JSON object or a JSON string representation. |
stealth_level | string | No | Stealth level (default: "standard") |
Selector Syntax:
The selectors parameter supports CSS selector syntax with the following extensions:
| Syntax | Description | Example |
|---|---|---|
selector | Extract text content | "title": "h1" |
selector::html | Extract HTML content | "content": "div::html" |
selector::text | Extract text using ::text pseudo-element | "text": "p::text" |
selector::attr(name) | Extract attribute value | "link": "a::attr(href)" |
selector@attr | Extract attribute (alternative syntax) | "image": "img@src" |
selector@attr1@attr2 | Extract multiple attributes | "data": "img@src@alt" |
Example with dict input:
{
"url": "https://example.com/blog",
"selectors": {
"title": "h1.article-title",
"content": "div.article-content",
"author": "span.author-name",
"date": "time.publish-date",
"link": "a.read-more::attr(href)",
"image": "img.featured@src@alt"
}
}
Example with JSON string input:
{
"url": "https://example.com/blog",
"selectors": "{\"title\": \"h1.article-title\", \"content\": \"div.article-content\", \"link\": \"a.read-more::attr(href)\"}"
}
Tool: scrape_batch
Scrape multiple URLs in sequence.
When to use:
- Processing multiple pages
- Building site-wide datasets
- Bulk data collection
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
urls | array | Yes | List of URLs to scrape |
stealth_level | string | No | Stealth level (default: "standard") |
delay | float | No | Delay between requests in seconds (default: 1.0) |
Example:
{
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"stealth_level": "minimal",
"delay": 2.0
}
6. Stealth Levels
Overview
The server provides three pre-configured stealth levels, each balancing speed, anonymity, and success rate differently.
Minimal Stealth
Profile: StealthLevel.MINIMAL or get_minimal_stealth()
| Setting | Value |
|---|---|
| Headless | Yes |
| Humanize | No |
| Humanize Duration | N/A |
| Cloudflare solving | No |
| OS randomization | No |
| WebRTC blocking | No |
| Chrome simulation | No |
| Image blocking | Yes |
| Resource disabling | Yes |
| Ad blocking | Yes |
| Network Idle | No |
| Load DOM | No |
| Timeout | 15s |
When to use:
- Simple websites without anti-bot protection
- High-speed scraping where stealth is not critical
- Testing and development
- Static content and APIs
Performance: Fastest - suitable for high-volume scraping of cooperative sites
Standard Stealth
Profile: StealthLevel.STANDARD or get_standard_stealth()
| Setting | Value |
|---|---|
| Headless | Yes |
| Humanize | Yes |
| Humanize Duration | 1.5s |
| Cloudflare solving | No |
| OS randomization | Yes |
| WebRTC blocking | Yes |
| Chrome simulation | Yes |
| Image blocking | No |
| Resource disabling | No |
| Ad blocking | Yes |
| Network Idle | Yes |
| Load DOM | Yes |
| Timeout | 30s |
When to use:
- Most web scraping tasks
- Sites with basic anti-bot protection
- General-purpose scraping
- Balance of speed and anonymity required
Performance: Moderate - suitable for most common scraping scenarios
Maximum Stealth
Profile: StealthLevel.MAXIMUM or get_maximum_stealth()
| Setting | Value |
|---|---|
| Headless | Yes |
| Humanize | Yes |
| Humanize Duration | 1.5s |
| Cloudflare solving | Yes |
| OS randomization | Yes |
| WebRTC blocking | Yes |
| Chrome simulation | Yes |
| Image blocking | No |
| Resource disabling | No |
| Ad blocking | Yes |
| GeoIP routing | Yes |
| Network Idle | Yes |
| Load DOM | Yes |
| Wait Selector | body |
| Wait Selector State | visible |
| Timeout | 60s |
When to use:
- Heavily protected websites
- Cloudflare-protected sites
- Rate-limited endpoints
- Maximum anonymity required
- Challenging anti-bot systems
Performance: Slowest - but highest success rate on protected sites
Configuration Options
You can also create custom stealth configurations:
from mcp_scraper.stealth import StealthConfig
custom_config = StealthConfig(
headless=True,
solve_cloudflare=True,
humanize=True,
humanize_duration=1.5,
geoip=False,
os_randomize=True,
block_webrtc=True,
allow_webgl=True,
google_search=True,
block_images=False,
block_ads=True,
disable_resources=False,
network_idle=True,
load_dom=True,
wait_selector="body",
wait_selector_state="visible",
timeout=45000,
proxy="http://proxy:8080"
)
7. Configuration
Environment Variables
Create a .env file based on .env.example:
# Proxy URL for requests (optional)
# Format: http://user:pass@host:port or socks5://host:port
PROXY_URL=
# Default timeout for requests in seconds (1-300)
DEFAULT_TIMEOUT=30
# Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_LEVEL=INFO
# Maximum number of retries for failed requests (0-10)
MAX_RETRIES=3
Configuration Files
pyproject.toml
The project uses pyproject.toml for Python package configuration:
[project]
name = "mcp-scraper"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
"scrapling[all]",
"fastmcp>=2.0",
"httpx>=0.25",
"pydantic>=2.0",
"pydantic-settings>=2.0",
"python-dotenv>=1.0",
"loguru>=0.7",
]
Proxy Setup
Single Proxy
Set via environment variable:
PROXY_URL=http://proxy.example.com:8080
Or programmatically:
config = StealthConfig(proxy="http://proxy.example.com:8080")
Proxy Rotation
For proxy rotation, pass a list of proxies to the scraping function:
proxy_list = [
"http://proxy1:8080",
"http://proxy2:8080",
"http://proxy3:8080",
]
page = await scrape_with_retry(
url="https://example.com",
proxy_list=proxy_list,
max_retries=3
)
Supported Proxy Formats
| Protocol | Format |
|---|---|
| HTTP | http://host:port |
| HTTPS | https://host:port |
| SOCKS5 | socks5://host:port |
| With auth | http://user:pass@host:port |
Timeout Settings
Request Timeout
Set per-request or globally:
# Per-request (in milliseconds)
page = await session.fetch(url, timeout=60000)
# Global (via Settings)
# DEFAULT_TIMEOUT=60 in .env (in seconds)
Recommended values:
| Scenario | Timeout |
|---|---|
| Simple static pages | 15-30s (15000-30000ms) |
| Standard scraping | 30-45s (30000-45000ms) |
| Complex JavaScript | 45-60s (45000-60000ms) |
| Slow/blocked sites | 60-120s (60000-120000ms) |
8. Security Considerations
URL Validation
The server implements robust URL validation to prevent Server-Side Request Forgery (SSRF) attacks:
Allowed:
http://andhttps://protocols only- Public IP addresses
- Public domain names
Blocked:
file://,ftp://, and other protocols- Private IP addresses (10.x.x.x, 172.16-31.x.x, 192.168.x.x)
- Localhost variants (localhost, 127.0.0.1, ::1)
- Internal hostnames (*.local, *.internal, *.corp)
- Link-local addresses (169.254.x.x)
The validation function validate_url() is called automatically before any scraping operation.
Proxy Security
When using proxies:
- Use trusted proxies - Avoid free/public proxy lists
- Encrypt credentials - Don't hardcode proxy credentials
- Validate proxy URLs - Ensure proxy URLs are valid format
- Rotate responsibly - Don't abuse proxy rotation
Rate Limiting
To avoid overwhelming target sites:
- Use appropriate delays - Set random_delay between requests
- Implement backoff - Use exponential backoff on failures
- Respect robots.txt - Check and follow site policies
- Monitor responses - Watch for rate limit indicators
Legal Compliance
Users are responsible for:
- Ensuring their scraping activities are legal in their jurisdiction
- Respecting website Terms of Service
- Complying with robots.txt directives
- Not bypassing authentication mechanisms they don't have access to
- Handling personal data appropriately
Best practices:
- Only scrape publicly available data
- Identify your scraper in User-Agent when appropriate
- Cache responses to minimize repeated requests
- Consider using official APIs when available
9. Development Guidelines
Best Practices for Undetectable Scraping
Follow these best practices to maximize scraping success while minimizing detection:
- Always use sessions: Reuse browser instances to maintain consistent fingerprints
- Enable geoip with proxies: Match browser locale to proxy location for better anonymity
- Use solve_cloudflare sparingly: Only when needed - it increases detection surface and slows down requests
- Implement exponential backoff: Start slow, increase speed gradually on successful requests
- Rotate user agents: Even with Camoufox, periodic rotation helps avoid pattern detection
- Monitor for blocks: Track 403/429 responses and adjust strategy accordingly
- Enable network_idle and load_dom: Wait for page to fully load before extracting data
- Use wait_selector for dynamic content: Wait for specific elements to appear before extraction
Recommended Configuration Patterns
# Pattern 1: Simple static pages
simple_config = StealthConfig(
headless=True,
disable_resources=True,
timeout=10000
)
# Pattern 2: Protected sites (Cloudflare)
protected_config = StealthConfig(
headless=True,
solve_cloudflare=True,
humanize=True,
geoip=True,
os_randomize=True,
timeout=60000,
google_search=True,
network_idle=True,
load_dom=True
)
# Pattern 3: High-anonymity scraping
anonymous_config = StealthConfig(
headless=True,
block_webrtc=True,
block_images=True,
disable_resources=True,
os_randomize=True,
geoip=True,
solve_cloudflare=True,
humanize=True,
humanize_duration=1.5,
proxy=rotation.next() # Use rotating proxies
)
# Pattern 4: Debugging (visible browser)
debug_config = StealthConfig(
headless=False, # Visible browser
timeout=120000 # Long timeout for manual intervention
)
How to Extend the Server
Adding New Tools
To add a new MCP tool, follow this pattern:
from fastmcp import FastMCP
from mcp_scraper.stealth import scrape_with_retry, get_stealth_config
mcp = FastMCP("My Scraper")
@mcp.tool()
async def scrape_with_custom_option(
url: str,
custom_option: bool = False
) -> dict:
"""Description of what this tool does.
Args:
url: The URL to scrape
custom_option: Description of custom option
Returns:
Dictionary with scraping results
"""
# Validate URL
if not validate_url(url):
raise ValueError(f"Invalid URL: {url}")
# Get stealth config
config = get_standard_stealth()
# Apply custom options
if custom_option:
# Custom logic
# Scrape
page = await scrape_with_retry(url, config)
# Format and return
return format_response(page, url)
Adding New Stealth Profiles
Add new preset configurations in config.py:
@staticmethod
def custom_profile() -> StealthConfig:
"""Custom profile description.
Suitable for: Your specific use case
"""
return StealthConfig(
# Custom settings
enable_js=True,
# ... other options
)
Adding Error Types
Extend the exception hierarchy in stealth.py:
class RateLimitError(ScrapeError):
"""Exception raised when rate limited."""
pass
Testing Approach
Unit Tests:
- Test URL validation
- Test configuration classes
- Test response formatting
Integration Tests:
- Test scraping with mock servers
- Test retry logic
- Test error handling
Example test structure:
tests/
├── __init__.py
├── test_config.py
├── test_stealth.py
├── test_validation.py
└── test_integration.py
Code Style
The project follows:
- Black for code formatting (100 character line length)
- Ruff for linting
- MyPy for type checking
- PEP 8 naming conventions
Key style rules:
- Use type hints for all function parameters and return values
- Use dataclasses for configuration objects
- Use async/await for I/O operations
- Use Loguru for logging
- Document all public functions with docstrings
Pre-commit hooks:
pip install pre-commit
pre-commit install
10. Usage Examples
Basic Scraping Example
Simple scraping of a static webpage:
from mcp_scraper.stealth import scrape_with_retry, format_response
async def basic_example():
url = "https://example.com"
# Simple scrape
page = await scrape_with_retry(url)
# Format response
result = format_response(page, url)
print(f"Title: {result.get('title')}")
print(f"Text content: {result.get('text')[:500]}")
print(f"Status: {result.get('status')}")
Stealth Scraping Example
Scraping a protected website with maximum stealth:
from mcp_scraper.stealth import (
scrape_with_retry,
get_maximum_stealth,
format_response
)
async def stealth_example():
url = "https://protected-site.com/data"
# Use maximum stealth
config = get_maximum_stealth()
try:
page = await scrape_with_retry(
url,
config=config,
max_retries=3
)
result = format_response(page, url)
print(f"Success! Content length: {len(result.get('html', ''))}")
except Exception as e:
print(f"Scraping failed: {e}")
Batch Scraping Example
Processing multiple URLs:
from mcp_scraper.stealth import scrape_with_retry, format_response, validate_url
import asyncio
async def batch_example():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
results = []
delay = 1.0 # Delay between requests
for url in urls:
# Validate first
if not validate_url(url):
print(f"Skipping invalid URL: {url}")
continue
try:
page = await scrape_with_retry(url)
result = format_response(page, url)
results.append(result)
print(f"Scraped: {url}")
except Exception as e:
print(f"Failed to scrape {url}: {e}")
# Delay between requests
await asyncio.sleep(delay)
print(f"Successfully scraped {len(results)}/{len(urls)} URLs")
return results
Structured Extraction Example
Extracting specific data using CSS selectors:
from mcp_scraper.stealth import scrape_with_retry, format_response
async def structured_example():
url = "https://example.com/blogposts"
# Define selectors for data extraction
selectors = {
"titles": "h2.post-title",
"authors": "span.author",
"dates": "time.published",
"summaries": "p.summary",
"links": "a.read-more@href"
}
# Scrape with selectors
page = await scrape_with_retry(url, selectors=selectors)
# Get formatted response with extracted data
result = format_response(page, url, selectors=selectors)
# Access extracted data
extracted = result.get("selectors", {})
for i, title in enumerate(extracted.get("titles", [])):
print(f"Post {i+1}: {title}")
print(f" Author: {extracted.get('authors', [None])[i]}")
print(f" Date: {extracted.get('dates', [None])[i]}")
Custom Configuration Example
Using custom stealth settings:
from mcp_scraper.stealth import StealthConfig, scrape_with_retry
async def custom_config_example():
# Create custom stealth configuration
config = StealthConfig(
headless=True,
solve_cloudflare=True, # Attempt Cloudflare challenges
humanize=True,
humanize_duration=1.5,
geoip=False,
os_randomize=True,
block_webrtc=True,
allow_webgl=True,
google_search=True,
block_images=True, # Reduce bandwidth
block_ads=True,
disable_resources=False,
network_idle=True,
load_dom=True,
timeout=45000,
proxy="http://my-proxy:8080" # Use specific proxy
)
url = "https://cloudflare-protected-site.com"
page = await scrape_with_retry(url, config=config, max_retries=5)
print(f"Success! Content: {page.text[:200]}")
Proxy Rotation Example
Using multiple proxies with automatic rotation:
from mcp_scraper.stealth import scrape_with_retry, get_standard_stealth
async def proxy_rotation_example():
# List of proxy servers
proxy_list = [
"http://proxy1:8080",
"http://proxy2:8080",
"http://proxy3:8080",
]
config = get_standard_stealth()
try:
page = await scrape_with_retry(
url="https://example.com",
config=config,
max_retries=3,
proxy_list=proxy_list
)
print(f"Success with proxy rotation!")
except Exception as e:
print(f"All proxies failed: {e}")
Appendix: API Reference
Core Classes
StealthConfig
Configuration class for stealth web scraping.
Attributes:
headless(bool): Run browser in headless modesolve_cloudflare(bool): Attempt Cloudflare challengeshumanize(bool): Add human-like behaviorhumanize_duration(float): Maximum cursor movement durationgeoip(bool): Use geoIP-based routingos_randomize(bool): Randomize OS fingerprintblock_webrtc(bool): Block WebRTCallow_webgl(bool): Allow WebGLgoogle_search(bool): Simulate Chromeblock_images(bool): Block imagesblock_ads(bool): Block advertisementsdisable_resources(bool): Disable CSS/JSnetwork_idle(bool): Wait for network inactivityload_dom(bool): Wait for DOMContentLoadedwait_selector(str): Wait for specific elementwait_selector_state(str): Element state to wait fortimeout(int): Request timeout in millisecondsproxy(str): Proxy URL
StealthLevel
Enum for preset stealth levels.
Values:
MINIMAL: Fast, minimal protectionSTANDARD: Balanced protectionMAXIMUM: Highest protection
Core Functions
scrape_with_retry(url, config, max_retries, backoff_factor, proxy_list, selectors)
Scrape a URL with retry logic.
Parameters:
url(str): URL to scrapeconfig(StealthConfig): Stealth configurationmax_retries(int): Maximum retry attemptsbackoff_factor(float): Exponential backoff multiplierproxy_list(list): List of proxy URLsselectors(dict): CSS selectors for extraction
Returns: Page object
Raises: ScrapeError, CloudflareError, BlockedError, TimeoutError
validate_url(url)
Validate URL for security.
Parameters:
url(str): URL to validate
Returns: bool - True if URL is safe
format_response(page, url, selectors)
Format scraping response.
Parameters:
page(Page): Scraped page objecturl(str): Original URLselectors(dict): Optional CSS selectors
Returns: dict with response data
get_element_text(element)
Extract text content from a scraping element with fallbacks.
Parameters:
element(Any): A page element object from scrapling
Returns: str - The text content of the element
Description: Checks for .text property first, then .inner_text, and falls back to str().
get_element_html(element)
Extract HTML content from a scraping element with fallbacks.
Parameters:
element(Any): A page element object from scrapling
Returns: str - The HTML content of the element
Description: Checks for .html property first, then .innerHTML.
get_element_attribute(element, attribute)
Extract an attribute value from a scraping element with fallbacks.
Parameters:
element(Any): A page element object from scraplingattribute(str): The name of the attribute to retrieve
Returns: str | None - The attribute value, or None if not found
Description: Checks for .get_attribute() method first, then direct property access.
Additional Resources
This documentation was generated for the Scrapling MCP Server project.
Active Technologies
- Python 3.10+ (already configured in pyproject.toml) + FastMCP (MCP framework), Scrapling (scraping engine), Pydantic (config), Loguru (logging) (001-mcp-server-implementation)
- N/A - stateless server with optional session caching in memory (001-mcp-server-implementation)
Recent Changes
- 001-mcp-server-implementation: Added Python 3.10+ (already configured in pyproject.toml) + FastMCP (MCP framework), Scrapling (scraping engine), Pydantic (config), Loguru (logging)