name: "pubchem-compound-search" description: "Query PubChem (110M+ compounds) via PubChemPy/PUG-REST. Search by name/CID/SMILES, get properties (MW, LogP, TPSA), similarity/substructure search, bioactivity. For local cheminformatics use rdkit; for multi-DB queries use bioservices." license: "CC-BY-4.0"
PubChem Compound Search
Overview
PubChem is the world's largest freely available chemical database with 110M+ compounds. This skill covers searching compounds by name, structure, or identifier, retrieving molecular properties, performing similarity/substructure searches, and accessing bioactivity data through PubChemPy (Python wrapper) and PUG-REST API (direct HTTP).
When to Use
- Looking up a compound by name, CAS number, or SMILES to get its PubChem CID and properties
- Retrieving molecular properties (molecular weight, LogP, TPSA, H-bond counts) for known compounds
- Finding structurally similar compounds via Tanimoto similarity search
- Searching for compounds containing a specific substructure (pharmacophore screening)
- Converting between chemical identifier formats (name ↔ CID ↔ SMILES ↔ InChI)
- Accessing bioactivity screening data (assay results, active/inactive status)
- Batch property comparison across a set of drug candidates
- For local molecular computation (fingerprints, descriptors, 3D conformers), use
rdkitinstead - For querying multiple databases (UniProt, KEGG, ChEMBL) in one workflow, use
bioservicesinstead
Prerequisites
- Python packages:
pubchempy,requests(for direct API),pandas(for batch processing) - No API key required: PubChem is freely accessible
- Rate limits: Max 5 requests/second, 400 requests/minute
pip install pubchempy requests pandas
Quick Start
import pubchempy as pcp
# Search by name → get properties
compound = pcp.get_compounds("aspirin", "name")[0]
print(f"CID: {compound.cid}")
print(f"SMILES: {compound.canonical_smiles}")
print(f"MW: {compound.molecular_weight}, LogP: {compound.xlogp}")
print(f"HBD: {compound.h_bond_donor_count}, HBA: {compound.h_bond_acceptor_count}")
Workflow
Step 1: Compound Search
Search by name, CID, SMILES, InChI, or molecular formula.
import pubchempy as pcp
# By name
compounds = pcp.get_compounds("caffeine", "name")
print(f"Found {len(compounds)} compounds for 'caffeine'")
# By CID (fastest)
compound = pcp.Compound.from_cid(2244) # Aspirin
print(f"CID 2244 = {compound.iupac_name}")
# By SMILES
compound = pcp.get_compounds("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles")[0]
print(f"SMILES lookup: CID {compound.cid}")
# By molecular formula (returns all matches)
formula_matches = pcp.get_compounds("C9H8O4", "formula")
print(f"Formula C9H8O4 matches: {len(formula_matches)} compounds")
Step 2: Property Retrieval
Get molecular properties for one or more compounds.
import pubchempy as pcp
# Full compound object
compound = pcp.get_compounds("ibuprofen", "name")[0]
print(f"MW: {compound.molecular_weight}")
print(f"LogP: {compound.xlogp}")
print(f"TPSA: {compound.tpsa}")
print(f"Rotatable bonds: {compound.rotatable_bond_count}")
# Selective property retrieval (more efficient for specific needs)
props = pcp.get_properties(
["MolecularWeight", "XLogP", "TPSA", "HBondDonorCount"],
"aspirin", "name"
)
print(props) # List of dicts
Step 3: Similarity Search
Find structurally similar compounds using Tanimoto coefficient.
import pubchempy as pcp
# Get reference compound SMILES
ref = pcp.get_compounds("gefitinib", "name")[0]
# Similarity search (may take 15-30s for async processing)
similar = pcp.get_compounds(
ref.canonical_smiles, "smiles",
searchtype="similarity",
Threshold=85, # Tanimoto threshold (0-100)
MaxRecords=50
)
print(f"Found {len(similar)} compounds with ≥85% similarity to gefitinib")
for comp in similar[:5]:
print(f" CID {comp.cid}: MW={comp.molecular_weight}")
Step 4: Substructure Search
Find compounds containing a specific structural motif.
import pubchempy as pcp
# Search for sulfonamide-containing compounds
hits = pcp.get_compounds(
"S(=O)(=O)N", "smiles",
searchtype="substructure",
MaxRecords=100
)
print(f"Found {len(hits)} compounds with sulfonamide group")
Step 5: Bioactivity Data Access
Retrieve biological screening results via PUG-REST API.
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
rows = data.get("Table", {}).get("Row", [])
print(f"Aspirin has {len(rows)} bioassay records")
Step 6: Batch Property Comparison
Compare properties across multiple compounds.
import pubchempy as pcp
import pandas as pd
import time
compounds = ["aspirin", "ibuprofen", "naproxen", "celecoxib"]
results = []
for name in compounds:
comp = pcp.get_compounds(name, "name")[0]
results.append({
"Name": name, "CID": comp.cid,
"MW": comp.molecular_weight, "LogP": comp.xlogp,
"TPSA": comp.tpsa, "HBD": comp.h_bond_donor_count,
"HBA": comp.h_bond_acceptor_count,
})
time.sleep(0.25) # Respect rate limits
df = pd.DataFrame(results)
print(df.to_string(index=False))
Step 7: Identifier Format Conversion
Convert between chemical identifier formats.
import pubchempy as pcp
compound = pcp.get_compounds("caffeine", "name")[0]
print(f"CID: {compound.cid}")
print(f"IUPAC: {compound.iupac_name}")
print(f"SMILES: {compound.canonical_smiles}")
print(f"InChI: {compound.inchi}")
print(f"InChIKey: {compound.inchikey}")
print(f"Formula: {compound.molecular_formula}")
# Download structure files
pcp.download("SDF", "caffeine", "name", "caffeine.sdf", overwrite=True)
print("Downloaded caffeine.sdf")
Key Parameters
| Parameter | Function | Default | Range / Options | Effect |
|---|---|---|---|---|
namespace | get_compounds | required | "name", "cid", "smiles", "inchi", "formula" | Identifier type for search |
searchtype | get_compounds | None | "similarity", "substructure" | Type of structure search |
Threshold | similarity search | 90 | 0-100 | Tanimoto similarity cutoff (%) |
MaxRecords | structure search | None | 1-10000 | Maximum results returned |
properties | get_properties | required | See API reference | Which molecular properties to retrieve |
record_type | download | "2d" | "2d", "3d" | Structure dimensionality |
Common Recipes
Recipe: Drug-Likeness Screening (Lipinski's Rule of Five)
When to use: Quick check if a compound is orally bioavailable.
import pubchempy as pcp
def check_lipinski(name):
comp = pcp.get_compounds(name, "name")[0]
rules = {
"MW ≤ 500": comp.molecular_weight <= 500,
"LogP ≤ 5": (comp.xlogp or 0) <= 5,
"HBD ≤ 5": comp.h_bond_donor_count <= 5,
"HBA ≤ 10": comp.h_bond_acceptor_count <= 10,
}
violations = sum(1 for v in rules.values() if not v)
return rules, violations
rules, v = check_lipinski("metformin")
print(f"Violations: {v}/4 — {'PASS' if v <= 1 else 'FAIL'}")
for rule, passed in rules.items():
print(f" {'✓' if passed else '✗'} {rule}")
Recipe: Get All Synonyms for a Compound
When to use: Finding alternative names, trade names, or CAS numbers.
import pubchempy as pcp
synonyms = pcp.get_synonyms("aspirin", "name")
if synonyms:
names = synonyms[0]["Synonym"]
print(f"Found {len(names)} synonyms for aspirin:")
for name in names[:10]:
print(f" {name}")
Recipe: Download 2D Structure Image
When to use: Generating structure images for reports or presentations.
import requests
cid = 2519 # Caffeine
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open("caffeine_structure.png", "wb") as f:
f.write(response.content)
print("Saved caffeine_structure.png")
Expected Outputs
- Compound search:
pubchempy.Compoundobjects with properties (CID, name, SMILES, MW, etc.) - Property retrieval: List of dictionaries with requested properties
- Similarity search: List of
Compoundobjects sorted by similarity - Bioactivity query: JSON with assay results (activity outcome, assay ID, target)
- Structure download: SDF, JSON, or PNG files
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
IndexError: list index out of range | No compounds found for query | Check spelling; try alternative names or CID |
| Request timeout (>30s) | Large similarity/substructure search | Reduce MaxRecords; PubChemPy handles async polling automatically |
Empty property values (None) | Property not available for this compound | Check if property exists before use: if comp.xlogp is not None |
HTTP 503 Service Unavailable | Rate limit exceeded | Add time.sleep(0.25) between requests; max 5 req/sec |
BadRequestError | Invalid SMILES or identifier | Validate SMILES syntax; use canonical SMILES from RDKit |
| Formula search returns too many hits | Common formula shared by many isomers | Use SMILES or InChI for more specific searches |
| Bioactivity API returns empty | Compound has no bioassay data | Not all compounds have been tested; check PubChem web interface |
References
- PubChem PUG-REST API — official REST API docs
- PubChemPy documentation — Python wrapper docs
- PubChem PUG-REST tutorial — step-by-step guide
- PubChem database — web interface