name: web-scraper description: >- Web scraper with SPA/JavaScript rendering, page interaction, and JS execution. Two-tier engine (HTTP → Playwright browser). Smart discovery, batch fetch, interactive content extraction, OpenAPI parsing. Use when read_url_content fails, SPA rendering needed, or page interaction required. metadata: version: 0.2.0
Web Scraper
./scrape <command> [options]— Run./scrape helpfor full option reference.
Content Precision Levels
The fetch command outputs filtered content by default. Three precision levels:
| Level | Flag | Behavior | When to use |
|---|---|---|---|
| Default | (none) | PruningContentFilter removes boilerplate | Most scenarios — good enough |
| Precision | --selector ".css" | Extract only the matched container | Production-quality docs, zero noise |
| Raw | --raw | Full page, no filtering | Debugging, page structure analysis |
Precision workflow: fetch a sample page → inspect remaining noise → identify content container via CSS selector → re-fetch all pages with --selector.
Engine Architecture
L0: Pure HTTP (httpx + selectolax + markdownify)
fetch → strip boilerplate → markdownify → clean content
L1: Browser (crawl4ai + Playwright, networkidle wait)
fetch → [JS interaction] → PruningContentFilter → fit_markdown → clean content
Handles: SPAs, anti-bot, JS-rendered content, folded/lazy content
Auto mode: L0 probe → detect SPA/empty → fallback to L1
Smart Discovery (3-tier):
1. sitemap.xml → 2. Nav DOM extraction → 3. Browser deep crawl
Workflow Patterns
Single page fetch
# Auto-detects if browser needed
./scrape fetch https://docs.example.com/api/create
# Force browser for known SPA sites
./scrape fetch https://open.feishu.cn/document/... --engine cdp
# Precision extraction with CSS selector
./scrape fetch https://developer.work.weixin.qq.com/document/path/90196 --engine cdp --selector ".ep-doc-area"
Interactive content extraction (SPA/folded content)
# Auto-expand all collapsed/folded sections (universal preset)
./scrape fetch https://open.feishu.cn/document/... --expand-all --wait 5000
# Scroll through entire page to trigger lazy loading
./scrape fetch https://example.com/infinite-scroll --scroll-full
# Custom JavaScript before extraction
./scrape fetch URL --js "document.querySelectorAll('.expand-btn').forEach(e => e.click())"
# JS from file (for complex interactions)
./scrape fetch URL --js-file /path/to/interact.js --wait 5000
# Wait for specific element before extraction
./scrape fetch URL --wait-for ".api-response-table"
# Combine: expand + custom JS + wait
./scrape fetch URL --expand-all --js "extraAction()" --wait-for ".loaded" -o /tmp/docs/
Execute JavaScript on a page
# Execute JS and return metadata (no markdown conversion)
./scrape exec URL --js "return document.title"
# Extract data from page's JavaScript state
./scrape exec URL --js "return JSON.stringify(window.__NEXT_DATA__)"
# Execute JS then fetch page as markdown
./scrape exec URL --js "expandAll()" --then-fetch -o /tmp/result.md
Batch-download documentation
./scrape fetch https://docs.example.com --auto -o /tmp/docs/
./scrape fetch --from-file urls.txt -o /tmp/docs/
./scrape fetch --from-file urls.txt --merge -o /tmp/docs/all.md
Files are automatically named by page title (e.g., 读取成员.md, Getting_Started.md).
Build organized local documentation
- Discover site structure:
./scrape discover <url> --json→ get URLs + titles - Filter relevant URLs (Agent selects subset based on user's needs)
- Write filtered URLs to a file
- Fetch to output directory:
./scrape fetch --from-file urls.txt -o /path/to/docs/ - Organize (Agent moves/renames files into logical folder structure)
Analyze site structure
./scrape discover https://docs.example.com
./scrape discover https://docs.example.com --engine cdp --deep --max-pages 100
Extract OpenAPI specs
./scrape openapi https://api.example.com/v3/api-docs -o /tmp/api.md
./scrape openapi https://api.example.com/swagger-ui/ -o /tmp/api.md
Key Options
| Option | Commands | Description |
|---|---|---|
--engine MODE | discover, fetch | auto (default), http (L0 only), cdp (L1 only) |
--selector CSS | fetch, exec | CSS selector for content area (precision mode) |
--raw | fetch, exec | Output full page (skip content filtering) |
--js CODE | fetch, exec | JavaScript to execute before extraction |
--js-file PATH | fetch, exec | JavaScript file to execute before extraction |
--wait-for CSS | fetch | Wait for CSS selector to appear before extraction |
--expand-all | fetch | Auto-expand collapsed/folded sections (universal preset) |
--scroll-full | fetch | Scroll entire page to trigger lazy loading |
--then-fetch | exec | After JS execution, also fetch page as markdown |
--session ID | exec | Reuse browser session for multi-step interactions |
--js-only | exec | Skip navigation, execute JS on existing session |
--auto | fetch | Auto-discover pages then fetch all |
--from-file FILE | fetch | Read URLs from file (one per line) |
-o PATH | fetch, exec, openapi | Output directory or file path |
--merge | fetch | Merge all pages into single markdown |
--json | discover, fetch | Machine-readable JSON output |
--summary-only | fetch | Only show URL, title, char count |
--max-pages N | discover, fetch | Limit pages (discover: 200, fetch: 50) |
--deep | discover | Follow links for browser-based BFS crawl |
Requirements
- uv:
curl -LsSf https://astral.sh/uv/install.sh | sh - Dependencies and Playwright Chromium auto-installed on first run