name: harvest-deep-crawl description: Multi-page deep crawling - documentation sites, wikis, knowledge bases allowed-tools: [Bash, Read, Write, WebFetch, WebSearch] keywords: [crawl, deep, multi-page, documentation, wiki, site, knowledge-base, depth]
Harvest Deep Crawl
Crawl multi-page websites following internal links to a specified depth. Ideal for building complete knowledge bases from documentation sites, wikis, and reference materials.
Usage
/crawl <url> --depth <N>
Examples
# Crawl docs site 3 levels deep
/crawl https://docs.example.com --depth 3
# Crawl a specific section
/crawl https://docs.example.com/api --depth 2
# Crawl with page limit
/crawl https://wiki.example.com --depth 5 --max-pages 50
Parameters
| Param | Default | Description |
|---|---|---|
--depth | 2 | Max link-following depth |
--max-pages | 100 | Max pages to crawl |
--same-domain | true | Stay on same domain |
--include | * | URL pattern to include |
--exclude | - | URL pattern to exclude |
How It Works
- Start at root URL, extract all internal links
- Follow links up to specified depth (BFS order)
- Extract content from each page
- Deduplicate pages with > 90% content overlap
- Build table of contents from page hierarchy
- Merge into coherent knowledge base
- Save to
.claude/cache/agents/harvest/crawl-{domain}/
Output Structure
crawl-{domain}-{timestamp}/
index.md # Table of contents + summary
page-001.md # First page content
page-002.md # Second page content
...
metadata.json # Crawl stats, URLs, timings
Crawl Engine
Primary: crawl4ai (Docker port 11235)
curl -s http://localhost:11235/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://docs.example.com"],
"max_depth": 3,
"same_domain": true,
"word_count_threshold": 50
}'
Fallback: Manual Link Following
When Docker unavailable:
- WebFetch root URL
- Parse links from markdown output
- WebFetch each linked page (depth-limited)
- Compile results
Use Cases
| Scenario | Depth | Max Pages |
|---|---|---|
| API reference | 2-3 | 50 |
| Full documentation site | 3-5 | 100 |
| Wiki section | 2 | 30 |
| Changelog history | 1-2 | 20 |
| Tutorial series | 2-3 | 30 |
Rules
- Respect robots.txt
- Max 2 requests/second
- Skip binary files (PDF, images, videos)
- Detect and skip infinite pagination
- Cache results for 24 hours