name: bio-batch-downloads description: Download large datasets from NCBI efficiently using history server, batching, and rate limiting. Use when performing bulk sequence downloads, handling large query results, or production-scale data retrieval. tool_type: python primary_tool: Bio.Entrez
Batch Downloads
Download large numbers of records from NCBI efficiently using the history server, batching, and proper rate limiting.
Required Setup
from Bio import Entrez
import time
Entrez.email = 'your.email@example.com' # Required by NCBI
Entrez.api_key = 'your_api_key' # Recommended for large downloads
Rate Limits
| Authentication | Requests/Second | Delay Between |
|---|---|---|
| Email only | 3 | 0.34 seconds |
| Email + API key | 10 | 0.1 seconds |
Get an API key at: https://www.ncbi.nlm.nih.gov/account/settings/
History Server
The history server stores search results on NCBI servers, enabling efficient batch retrieval without re-sending large ID lists.
How It Works
- Search with
usehistory='y' - Get
WebEnv(session ID) andquery_key(result set ID) - Fetch results in batches using these identifiers
- Results stay available for ~15 minutes
# Search with history
handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND mRNA[fkey]', usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv = search['WebEnv']
query_key = search['QueryKey']
total = int(search['Count'])
print(f"Found {total} records, stored in history")
Core Pattern: Batch Download
from Bio import Entrez, SeqIO
import time
Entrez.email = 'your.email@example.com'
def batch_download(db, term, output_file, rettype='fasta', batch_size=500):
# Search with history
handle = Entrez.esearch(db=db, term=term, usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv = search['WebEnv']
query_key = search['QueryKey']
total = int(search['Count'])
print(f"Downloading {total} records...")
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
print(f" Fetching {start+1}-{min(start+batch_size, total)}...")
handle = Entrez.efetch(
db=db,
rettype=rettype,
retmode='text',
retstart=start,
retmax=batch_size,
webenv=webenv,
query_key=query_key
)
out.write(handle.read())
handle.close()
time.sleep(0.34) # Rate limiting (no API key)
print(f"Saved to {output_file}")
Code Patterns
Download All Search Results
from Bio import Entrez
import time
Entrez.email = 'your.email@example.com'
Entrez.api_key = 'your_api_key' # Optional
def download_search_results(db, term, output_file, rettype='fasta', batch_size=500):
# Search with history server
handle = Entrez.esearch(db=db, term=term, usehistory='y', retmax=0)
search = Entrez.read(handle)
handle.close()
webenv = search['WebEnv']
query_key = search['QueryKey']
total = int(search['Count'])
if total == 0:
print("No records found")
return
delay = 0.1 if Entrez.api_key else 0.34
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
end = min(start + batch_size, total)
print(f"Downloading {start+1}-{end} of {total}")
attempts = 3
for attempt in range(attempts):
try:
handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
out.write(handle.read())
handle.close()
break
except Exception as e:
if attempt < attempts - 1:
print(f" Retry {attempt+1}: {e}")
time.sleep(5)
else:
raise
time.sleep(delay)
print(f"Downloaded {total} records to {output_file}")
download_search_results('nucleotide', 'human[orgn] AND insulin[gene] AND mRNA[fkey]', 'insulin_mrna.fasta')
Download by ID List
def download_by_ids(db, ids, output_file, rettype='fasta', batch_size=200):
total = len(ids)
delay = 0.1 if Entrez.api_key else 0.34
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
batch = ids[start:start+batch_size]
print(f"Downloading {start+1}-{start+len(batch)} of {total}")
handle = Entrez.efetch(db=db, id=','.join(batch), rettype=rettype, retmode='text')
out.write(handle.read())
handle.close()
time.sleep(delay)
print(f"Downloaded {total} records to {output_file}")
# Example with list of IDs
ids = ['NM_007294', 'NM_000059', 'NM_000546', 'NM_001126112', 'NM_004985']
download_by_ids('nucleotide', ids, 'genes.fasta')
Post IDs to History (EPost)
For very large ID lists, post them to the history server first:
def post_and_download(db, ids, output_file, rettype='fasta', batch_size=500):
# Post IDs to history server
handle = Entrez.epost(db=db, id=','.join(ids))
result = Entrez.read(handle)
handle.close()
webenv = result['WebEnv']
query_key = result['QueryKey']
total = len(ids)
delay = 0.1 if Entrez.api_key else 0.34
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
end = min(start + batch_size, total)
print(f"Fetching {start+1}-{end} of {total}")
handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
out.write(handle.read())
handle.close()
time.sleep(delay)
print(f"Downloaded {total} records")
Download GenBank with Progress
from Bio import Entrez, SeqIO
from io import StringIO
import time
def download_genbank_records(term, output_file, batch_size=100):
Entrez.email = 'your.email@example.com'
# Search
handle = Entrez.esearch(db='nucleotide', term=term, usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv, query_key = search['WebEnv'], search['QueryKey']
total = int(search['Count'])
records = []
for start in range(0, total, batch_size):
print(f"Fetching {start+1}-{min(start+batch_size, total)} of {total}")
handle = Entrez.efetch(db='nucleotide', rettype='gb', retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
batch_records = list(SeqIO.parse(handle, 'genbank'))
handle.close()
records.extend(batch_records)
time.sleep(0.34)
SeqIO.write(records, output_file, 'genbank')
print(f"Saved {len(records)} GenBank records")
return records
Download with Retry Logic
import time
from urllib.error import HTTPError
def robust_download(db, term, output_file, rettype='fasta', batch_size=500, max_retries=3):
handle = Entrez.esearch(db=db, term=term, usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv, query_key = search['WebEnv'], search['QueryKey']
total = int(search['Count'])
delay = 0.1 if Entrez.api_key else 0.34
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
for retry in range(max_retries):
try:
handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
data = handle.read()
handle.close()
if data.strip():
out.write(data)
break
except HTTPError as e:
if e.code == 429: # Rate limit
wait = 10 * (retry + 1)
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
elif retry == max_retries - 1:
raise
else:
time.sleep(5)
time.sleep(delay)
print(f"Downloaded to {output_file}")
Stream to File (Memory Efficient)
def stream_download(db, term, output_file, rettype='fasta', batch_size=1000):
handle = Entrez.esearch(db=db, term=term, usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv, query_key = search['WebEnv'], search['QueryKey']
total = int(search['Count'])
downloaded = 0
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
# Stream chunks to file
while True:
chunk = handle.read(8192)
if not chunk:
break
out.write(chunk)
handle.close()
downloaded = min(start + batch_size, total)
print(f"Progress: {downloaded}/{total} ({100*downloaded/total:.1f}%)")
time.sleep(0.34)
Batch Size Guidelines
| Database | rettype | Recommended Batch |
|---|---|---|
| nucleotide | fasta | 500-1000 |
| nucleotide | gb | 100-200 |
| protein | fasta | 500-1000 |
| protein | gp | 100-200 |
| pubmed | abstract | 1000-2000 |
| pubmed | xml | 200-500 |
Smaller batches for GenBank/XML (more data per record).
Common Errors
| Error | Cause | Solution |
|---|---|---|
HTTPError 429 | Rate limit exceeded | Increase delay, use API key |
HTTPError 400 | Invalid WebEnv/query_key | Session expired, re-search |
| Incomplete data | Connection interrupted | Add retry logic |
| Memory error | Batch too large | Reduce batch_size |
| Empty response | No more records | Check total vs start |
Decision Tree
Need to download many NCBI records?
├── Have search query?
│ └── Use esearch with usehistory='y', then batch efetch
├── Have list of IDs?
│ ├── < 200 IDs? → Direct efetch with comma-separated IDs
│ └── >= 200 IDs? → Use epost, then batch efetch
├── Need records as Biopython objects?
│ └── Parse each batch with SeqIO
├── Downloading > 10,000 records?
│ └── Use streaming to avoid memory issues
└── Getting rate limited?
└── Get API key, add retry logic
Related Skills
- entrez-search - Build search queries for batch downloads
- entrez-fetch - Single record fetching
- sra-data - For raw sequencing data (use SRA toolkit instead)