name: "gene-database" description: "NCBI Gene via E-utilities: curated records across 1M+ taxa. Official symbols, aliases, RefSeq IDs, summaries, coordinates, GO, interactions. Use for gene ID resolution and cross-species function queries. For sequences use Ensembl; for expression use geo-database." license: "CC0-1.0"
NCBI Gene Database
Overview
NCBI Gene is the authoritative curated database for gene-centric information, covering 1M+ genes across hundreds of thousands of taxa. Each gene record includes the official symbol, aliases, full name, functional summary, genomic coordinates (GRCh38/GRCh37), RefSeq accessions, GO annotations, interaction partners, and links to related databases. Access is free via E-utilities REST API (no API key required, though recommended).
When to Use
- Resolving gene aliases and synonyms to the current official HGNC/NCBI symbol
- Fetching the NCBI Gene ID (integer) for a gene symbol for downstream API calls (e.g., dbSNP, ClinVar, GEO)
- Retrieving curated gene summaries and function descriptions programmatically
- Pulling RefSeq mRNA (NM_) and protein (NP_) accessions associated with a gene
- Querying GO functional annotations (Biological Process, Molecular Function, Cellular Component)
- Cross-species gene queries using the same Gene ID space
- For expression profiles across conditions use
geo-database; for variant annotations useclinvar-databaseorensembl-database
Prerequisites
- Python packages:
requests,xml.etree.ElementTree(stdlib),pandas(optional) - Data requirements: gene symbols, NCBI Gene IDs, or tax IDs
- Environment: internet connection; NCBI email required (set
emailparameter) - Rate limits: 3 req/s unauthenticated; 10 req/s with free NCBI API key
pip install requests pandas
Quick Start
import requests
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def gene_search(query, retmax=5):
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gene", "term": query,
"retmax": retmax, "retmode": "json", "email": EMAIL})
r.raise_for_status()
return r.json()["esearchresult"]["idlist"]
# Find human BRCA1 gene ID
ids = gene_search("BRCA1[sym] AND Homo sapiens[orgn]")
print(f"Gene IDs for BRCA1: {ids}") # → ['672']
Core API
Query 1: Search by Symbol, Name, or Function
Use ESearch with field tags for precise queries.
import requests
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
# Exact symbol match for human gene
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gene", "email": EMAIL, "retmode": "json",
"term": "TP53[sym] AND Homo sapiens[orgn] AND alive[prop]"})
ids = r.json()["esearchresult"]["idlist"]
print(f"TP53 Gene ID: {ids}") # → ['7157']
# Search by function keyword
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gene", "email": EMAIL, "retmode": "json",
"term": "CRISPR[title] AND Homo sapiens[orgn]", "retmax": 5})
ids = r.json()["esearchresult"]["idlist"]
print(f"CRISPR-related gene IDs: {ids}")
Query 2: Fetch Gene Summary (JSON/ESummary)
Retrieve key metadata fields for a list of Gene IDs.
import requests
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def esummary_gene(gene_ids):
r = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "gene", "id": ",".join(gene_ids),
"retmode": "json", "email": EMAIL})
r.raise_for_status()
return r.json()["result"]
result = esummary_gene(["672", "675", "7157"]) # BRCA1, BRCA2, TP53
for uid in result.get("uids", []):
g = result[uid]
print(f"\n{g.get('name')} (ID {uid})")
print(f" Official symbol : {g.get('nomenclaturesymbol', g.get('name'))}")
print(f" Chr location : {g.get('maplocation')}")
print(f" Summary (first 100): {g.get('summary', '')[:100]}...")
print(f" Aliases: {g.get('otheraliases', 'none')}")
Query 3: Fetch Full Gene Record (XML)
Retrieve the complete gene record in XML for RefSeq accessions, GO terms, and interaction data.
import requests
import xml.etree.ElementTree as ET
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def efetch_gene_xml(gene_id):
r = requests.get(f"{BASE}/efetch.fcgi",
params={"db": "gene", "id": gene_id,
"rettype": "gene_table", "retmode": "text", "email": EMAIL})
r.raise_for_status()
return r.text
# Get gene table (tab-delimited overview)
table = efetch_gene_xml("672")
print(table[:500])
# XML for RefSeq accession extraction
r = requests.get(f"{BASE}/efetch.fcgi",
params={"db": "gene", "id": "672",
"rettype": "xml", "retmode": "xml", "email": EMAIL})
root = ET.fromstring(r.text)
# Extract RefSeq mRNA accessions
for ref in root.iter("Gene-commentary"):
acc = ref.find("Gene-commentary_accession")
ver = ref.find("Gene-commentary_version")
typ = ref.find("Gene-commentary_type")
if acc is not None and acc.text and acc.text.startswith("NM_"):
print(f"RefSeq mRNA: {acc.text}.{ver.text if ver is not None else ''}")
Query 4: Batch Symbol-to-ID Mapping
Map a list of gene symbols to NCBI Gene IDs efficiently.
import requests, time
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def symbols_to_ids(symbols, organism="Homo sapiens"):
"""Map gene symbols to NCBI Gene IDs. Returns dict {symbol: gene_id}."""
mapping = {}
for sym in symbols:
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gene", "email": EMAIL, "retmode": "json",
"term": f"{sym}[sym] AND {organism}[orgn] AND alive[prop]"})
ids = r.json()["esearchresult"]["idlist"]
mapping[sym] = ids[0] if ids else None
time.sleep(0.1)
return mapping
genes = ["EGFR", "KRAS", "BRAF", "PIK3CA", "PTEN"]
id_map = symbols_to_ids(genes)
for sym, gid in id_map.items():
print(f"{sym:10s} → Gene ID {gid}")
Query 5: GO Annotation Retrieval
Parse GO terms from the gene XML record.
import requests
import xml.etree.ElementTree as ET
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
r = requests.get(f"{BASE}/efetch.fcgi",
params={"db": "gene", "id": "7157",
"rettype": "xml", "retmode": "xml", "email": EMAIL})
root = ET.fromstring(r.text)
# Extract GO annotations
go_terms = []
for ref in root.iter("Gene-commentary"):
heading = ref.find("Gene-commentary_heading")
label = ref.find("Gene-commentary_label")
if heading is not None and "Gene Ontology" in heading.text:
if label is not None:
go_terms.append(label.text)
print(f"TP53 GO terms ({len(go_terms)} found):")
for term in go_terms[:10]:
print(f" {term}")
Query 6: Cross-Species Gene Query
Find orthologs across species using NCBI Gene IDs.
import requests, time
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def find_ortholog(human_gene_id, target_organism):
"""Find ortholog Gene ID in target species via NCBI Gene homologs."""
r = requests.get(f"{BASE}/elink.fcgi",
params={"dbfrom": "gene", "db": "gene",
"id": human_gene_id, "linkname": "gene_gene_homolog",
"retmode": "json", "email": EMAIL})
r.raise_for_status()
linksets = r.json().get("linksets", [])
if not linksets:
return []
homolog_ids = [str(l["id"]) for l in linksets[0].get("linksetdbs", [{}])[0].get("links", [])]
return homolog_ids[:10]
# Human TP53 (7157) homologs
homolog_ids = find_ortholog("7157", "Mus musculus")
print(f"Homolog Gene IDs for TP53: {homolog_ids}")
Key Concepts
NCBI Gene ID vs. HGNC ID vs. Ensembl ID
NCBI Gene IDs are integers assigned per gene per organism (e.g., human TP53 = 7157). These are distinct from HGNC IDs (e.g., HGNC:11998) and Ensembl IDs (ENSG00000141510). Many downstream NCBI databases (ClinVar, dbSNP, GEO) use NCBI Gene IDs internally.
alive[prop] Filter
NCBI Gene records for discontinued genes have status=discontinued. Always add AND alive[prop] to symbol queries to exclude retired entries and avoid retrieving stale data.
Common Workflows
Workflow 1: Build a Gene Annotation Table
Goal: For a list of gene symbols, retrieve Gene ID, official name, chromosomal location, and description.
import requests, time, pandas as pd
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def search_gene(sym, organism="Homo sapiens"):
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gene", "email": EMAIL, "retmode": "json",
"term": f"{sym}[sym] AND {organism}[orgn] AND alive[prop]"})
ids = r.json()["esearchresult"]["idlist"]
return ids[0] if ids else None
def batch_summary(gene_ids):
r = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "gene", "id": ",".join(gene_ids),
"retmode": "json", "email": EMAIL})
return r.json()["result"]
symbols = ["BRCA1", "BRCA2", "TP53", "EGFR", "MYC", "KRAS", "PTEN"]
# Step 1: Symbol → Gene ID
id_map = {}
for sym in symbols:
gid = search_gene(sym)
id_map[sym] = gid
time.sleep(0.12)
# Step 2: Batch summary
valid_ids = [v for v in id_map.values() if v]
result = batch_summary(valid_ids)
rows = []
sym_to_id = {v: k for k, v in id_map.items() if v}
for uid in result.get("uids", []):
g = result[uid]
rows.append({
"symbol": sym_to_id.get(uid, g.get("name")),
"gene_id": uid,
"full_name": g.get("description"),
"chr_location": g.get("maplocation"),
"summary": g.get("summary", "")[:200],
})
df = pd.DataFrame(rows)
df.to_csv("gene_annotations.csv", index=False)
print(df[["symbol", "gene_id", "full_name", "chr_location"]].to_string(index=False))
Workflow 2: Find All Genes in a Pathway Keyword
Goal: Retrieve all human genes associated with a biological keyword from the NCBI Gene summary field.
import requests, time, pandas as pd
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
keyword = "DNA mismatch repair"
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gene", "email": EMAIL, "retmode": "json",
"retmax": 50,
"term": f"{keyword}[title/abstract] AND Homo sapiens[orgn] AND alive[prop]"})
ids = r.json()["esearchresult"]["idlist"]
print(f"Found {len(ids)} genes related to '{keyword}'")
# Fetch summaries
r2 = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "gene", "id": ",".join(ids), "retmode": "json", "email": EMAIL})
result = r2.json()["result"]
rows = []
for uid in result.get("uids", []):
g = result[uid]
rows.append({"gene_id": uid, "symbol": g.get("name"),
"description": g.get("description"),
"location": g.get("maplocation")})
df = pd.DataFrame(rows)
print(df.to_string(index=False))
df.to_csv(f"{keyword.replace(' ', '_')}_genes.csv", index=False)
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
retmax | ESearch | 20 | 1–10000 | Max records returned |
retmode | ESearch/ESummary | "xml" | "json", "xml" | Response format |
rettype | EFetch | depends | "xml", "gene_table", "text" | Record format for full fetch |
[sym] field tag | ESearch | — | gene symbol | Match exact official symbol only |
[orgn] field tag | ESearch | — | organism name or tax ID | Filter by taxonomy |
alive[prop] | ESearch | — | boolean flag | Exclude discontinued gene records |
Best Practices
-
Always add
alive[prop]: Discontinued gene records remain in the database. Without this filter, symbol searches may return outdated records. -
Use Gene IDs in pipelines: Downstream NCBI databases (ClinVar, dbSNP, GEO) accept Gene IDs; avoid re-searching by symbol in each call.
-
Use ESummary for metadata, EFetch for full records: ESummary returns JSON with all common fields; EFetch XML is needed only for RefSeq accessions, GO terms, or interaction links.
-
Register for a free API key: Triple your rate limit (3 → 10 req/s) at https://www.ncbi.nlm.nih.gov/account/. Pass as
api_keyparameter. -
Batch with ESummary: POST up to 200 Gene IDs per call to ESummary instead of querying one at a time.
Common Recipes
Recipe: Gene ID to RefSeq NM Accession
When to use: Get the canonical mRNA accession for a protein-coding gene.
import requests, re
EMAIL = "your@email.com"
GENE_ID = "672" # BRCA1
r = requests.get(
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
params={"db": "gene", "id": GENE_ID, "rettype": "gene_table",
"retmode": "text", "email": EMAIL}
)
nm_accessions = re.findall(r"NM_\d+\.\d+", r.text)
print(f"RefSeq mRNA accessions: {list(set(nm_accessions))}")
Recipe: Retrieve Gene Aliases
When to use: Resolve legacy/alias symbols to the current official NCBI symbol.
import requests
EMAIL = "your@email.com"
# P53 is an alias for TP53
r = requests.get(
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
params={"db": "gene", "email": EMAIL, "retmode": "json",
"term": "p53[sym] AND Homo sapiens[orgn] AND alive[prop]"}
)
ids = r.json()["esearchresult"]["idlist"]
r2 = requests.post("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi",
data={"db": "gene", "id": ",".join(ids[:1]),
"retmode": "json", "email": EMAIL})
g = r2.json()["result"][ids[0]]
print(f"Official symbol : {g.get('nomenclaturesymbol', g.get('name'))}")
print(f"Other aliases : {g.get('otheraliases')}")
print(f"Designations : {g.get('otherdesignations', '')[:100]}")
Recipe: List All Genes on a Chromosome
When to use: Get all protein-coding genes on a specific human chromosome.
import requests
EMAIL = "your@email.com"
r = requests.get(
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
params={"db": "gene", "email": EMAIL, "retmode": "json", "retmax": 5,
"term": "17[chr] AND Homo sapiens[orgn] AND protein coding[filter] AND alive[prop]"}
)
result = r.json()["esearchresult"]
print(f"Protein-coding genes on chr17: {result['count']} total")
print(f"Sample IDs: {result['idlist']}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
Empty idlist for known symbol | Symbol is an alias, not the official term | Use [gene name] or [title] field tag; check aliases via ESummary |
| Wrong species returned | Missing organism filter | Add AND Homo sapiens[orgn] or target tax ID (9606[taxid]) |
| Discontinued gene returned | Missing alive[prop] filter | Append AND alive[prop] to all symbol queries |
HTTP 429 rate limit | Too many requests | Add time.sleep(0.35) between calls; use NCBI API key |
ESummary missing uids key | All IDs invalid/absent | Check id values are valid integers, not empty strings |
| XML parse error | Malformed XML for rare genes | Wrap ET.fromstring in try/except; retry with rettype=text |
Related Skills
geo-database— Gene Expression Omnibus for retrieving expression data linked to genes found hereclinvar-database— Clinical variant data indexed by NCBI Gene IDsensembl-database— Complementary gene annotations with VEP and comparative genomicsbiopython-molecular-biology— Biopython Entrez module wraps E-utilities with typed return values
References
- NCBI Gene database — Official homepage and search interface
- E-utilities documentation — Complete API reference for ESearch, ESummary, EFetch
- NCBI Gene field tags — Field tag reference for constructing Entrez queries
- NCBI API Key registration — Free registration for 10 req/s rate limit