name: web-scraper description: Web scraping and data extraction from websites. Use when users need to extract data from web pages, scrape multiple URLs, gather structured information from websites, or collect data in bulk from online sources. Supports single-page scraping, multi-page crawling, and handling dynamic JavaScript-rendered content.

Web Scraper Skill

Extract data from websites using deterministic Python scripts.

Quick Start

Single URL: Run scripts/scrape_single.py with a URL
Multiple URLs: Run scripts/scrape_batch.py with a file of URLs
Dynamic content: Use --browser flag for JavaScript-rendered pages

Scripts

scrape_single.py

Extract content from a single URL.

python scripts/scrape_single.py "https://example.com" --output .tmp/result.json

Options:

--output: Output file path (JSON or CSV)
--selector: CSS selector to target specific content
--browser: Use headless browser for JavaScript content
--wait: Seconds to wait for page load (browser mode)

scrape_batch.py

Scrape multiple URLs from a file.

python scripts/scrape_batch.py urls.txt --output .tmp/results.json --delay 1

Options:

--output: Output file path
--delay: Delay between requests (seconds)
--browser: Use headless browser
--max-concurrent: Max parallel requests (default: 5)

Output Format

Results are saved as JSON with this structure:

{
  "url": "https://example.com",
  "title": "Page Title",
  "text": "Extracted text content...",
  "links": ["https://..."],
  "metadata": {"scraped_at": "2024-01-01T12:00:00Z"}
}

Error Handling

Rate limits: Scripts auto-retry with exponential backoff
Timeouts: Default 30s, configurable via --timeout
Failed URLs: Logged to {output}_errors.json

When to Create New Scripts

Create a new script when:

Existing scripts don't support required output format
Site requires specific authentication flow
Complex multi-step navigation is needed

Follow the self-annealing pattern: fix issues, update scripts, update this skill.

ナビゲーション

Skillsとは？

リンク

web-scraper