name: web-scraper description: Web scraping and data extraction from websites. Use when users need to extract data from web pages, scrape multiple URLs, gather structured information from websites, or collect data in bulk from online sources. Supports single-page scraping, multi-page crawling, and handling dynamic JavaScript-rendered content.
Web Scraper Skill
Extract data from websites using deterministic Python scripts.
Quick Start
- Single URL: Run
scripts/scrape_single.pywith a URL - Multiple URLs: Run
scripts/scrape_batch.pywith a file of URLs - Dynamic content: Use
--browserflag for JavaScript-rendered pages
Scripts
scrape_single.py
Extract content from a single URL.
python scripts/scrape_single.py "https://example.com" --output .tmp/result.json
Options:
--output: Output file path (JSON or CSV)--selector: CSS selector to target specific content--browser: Use headless browser for JavaScript content--wait: Seconds to wait for page load (browser mode)
scrape_batch.py
Scrape multiple URLs from a file.
python scripts/scrape_batch.py urls.txt --output .tmp/results.json --delay 1
Options:
--output: Output file path--delay: Delay between requests (seconds)--browser: Use headless browser--max-concurrent: Max parallel requests (default: 5)
Output Format
Results are saved as JSON with this structure:
{
"url": "https://example.com",
"title": "Page Title",
"text": "Extracted text content...",
"links": ["https://..."],
"metadata": {"scraped_at": "2024-01-01T12:00:00Z"}
}
Error Handling
- Rate limits: Scripts auto-retry with exponential backoff
- Timeouts: Default 30s, configurable via
--timeout - Failed URLs: Logged to
{output}_errors.json
When to Create New Scripts
Create a new script when:
- Existing scripts don't support required output format
- Site requires specific authentication flow
- Complex multi-step navigation is needed
Follow the self-annealing pattern: fix issues, update scripts, update this skill.