name: chembl-database-bioactivity description: Query ChEMBL via Python SDK. Search compounds by structure/properties, retrieve bioactivity (IC50, Ki, EC50), find target inhibitors, run SAR, access drug mechanism/indication data. license: CC-BY-SA-3.0
ChEMBL Database — Bioactivity Queries
Overview
Query the ChEMBL bioactive molecule database (2M+ compounds, 19M+ bioactivity measurements, 13K+ targets) using the chembl_webresource_client Python SDK. Covers compound search, target lookup, bioactivity retrieval, structure-based search, and drug information access.
When to Use
- Finding compounds by name, ChEMBL ID, or physicochemical properties
- Querying bioactivity data (IC50, Ki, EC50) for specific targets
- Performing similarity or substructure searches using SMILES
- Retrieving drug mechanisms of action and clinical indications
- Identifying inhibitors, agonists, or bioactive molecules for a target
- Analyzing structure-activity relationships (SAR) across compound series
- Filtering molecules by Lipinski rule-of-5 or other drug-likeness criteria
- For general cheminformatics (SMILES manipulation, fingerprints, descriptors) use rdkit-cheminformatics instead
Prerequisites
uv pip install chembl_webresource_client
# Optional: pandas for tabular analysis
uv pip install pandas
Rate limiting: The SDK handles rate limiting internally via automatic retries and caching. No time.sleep() needed between queries. For large-scale data retrieval (100K+ records), consider ChEMBL bulk downloads instead of API queries.
Quick Start
from chembl_webresource_client.new_client import new_client
# Each entity type has its own client endpoint
molecule = new_client.molecule
target = new_client.target
activity = new_client.activity
# Retrieve a molecule by ChEMBL ID
aspirin = molecule.get('CHEMBL25')
print(f"{aspirin['pref_name']}: MW={aspirin['molecule_properties']['mw_freebase']}")
# Search targets by name
egfr_targets = target.filter(pref_name__icontains='EGFR', target_type='SINGLE PROTEIN')
print(f"Found {len(list(egfr_targets))} EGFR-related targets")
# Query bioactivities with filters
potent = activity.filter(
target_chembl_id='CHEMBL203', # EGFR
standard_type='IC50',
standard_value__lte=100, # <= 100 nM
standard_units='nM'
)
Key Concepts
Filter Operators
ChEMBL uses Django-style query filters on all endpoints:
| Operator | Meaning | Example |
|---|---|---|
__exact | Exact match (default) | target_type__exact='SINGLE PROTEIN' |
__iexact | Case-insensitive exact | pref_name__iexact='aspirin' |
__contains | Substring match | pref_name__contains='kinase' |
__icontains | Case-insensitive substring | pref_name__icontains='egfr' |
__startswith | Prefix match | pref_name__startswith='Epi' |
__endswith | Suffix match | pref_name__endswith='nib' |
__gt / __gte | Greater than (or equal) | standard_value__gte=10 |
__lt / __lte | Less than (or equal) | standard_value__lte=100 |
__range | Value in range | mw_freebase__range=[300, 500] |
__in | Value in list | target_chembl_id__in=['CHEMBL203', 'CHEMBL240'] |
__isnull | Null check | pchembl_value__isnull=False |
__regex | Regular expression | pref_name__regex='^EGF.*kinase$' |
__search | Full-text search | description__search='apoptosis' |
Core Endpoints
| Endpoint | Access | Description |
|---|---|---|
molecule | new_client.molecule | Compound structures, properties, synonyms |
target | new_client.target | Protein and non-protein biological targets |
activity | new_client.activity | Bioassay measurement results |
assay | new_client.assay | Experimental assay details |
drug | new_client.drug | Approved pharmaceutical information |
mechanism | new_client.mechanism | Drug mechanism of action data |
drug_indication | new_client.drug_indication | Drug therapeutic indications |
similarity | new_client.similarity | Tanimoto similarity search |
substructure | new_client.substructure | Substructure search |
image | new_client.image | SVG molecular structure images |
molecule_form | new_client.molecule_form | Parent/salt forms |
protein_class | new_client.protein_class | Protein classification hierarchy |
target_component | new_client.target_component | Target component details |
cell_line | new_client.cell_line | Cell line information |
tissue | new_client.tissue | Tissue type information |
compound_structural_alert | new_client.compound_structural_alert | Structural alerts for toxicity |
document | new_client.document | Literature source references |
Molecular Properties
Properties accessible via molecule['molecule_properties']:
| Field | Description |
|---|---|
mw_freebase | Molecular weight (free base) |
full_mwt | Full molecular weight (including salts) |
alogp | Calculated LogP |
hba | Hydrogen bond acceptors |
hbd | Hydrogen bond donors |
psa | Polar surface area |
rtb | Rotatable bonds |
num_ro5_violations | Lipinski rule-of-5 violations |
ro3_pass | Rule of 3 compliance |
cx_most_apka | Most acidic pKa |
cx_most_bpka | Most basic pKa |
Target Information Fields
Key fields in target records:
| Field | Description |
|---|---|
target_chembl_id | ChEMBL target identifier |
pref_name | Preferred target name |
target_type | Type: SINGLE PROTEIN, PROTEIN COMPLEX, ORGANISM |
organism | Target organism (e.g., Homo sapiens) |
tax_id | NCBI taxonomy ID |
target_components | Component details (UniProt accession, etc.) |
Bioactivity Data Fields
Key fields in activity records:
| Field | Description |
|---|---|
standard_type | Activity type: IC50, Ki, Kd, EC50, etc. |
standard_value | Numerical activity value |
standard_units | Units: nM, uM, etc. |
pchembl_value | Normalized activity (-log10 scale, comparable across types) |
activity_comment | Activity annotations |
data_validity_comment | Data quality flags (check before analysis) |
potential_duplicate | Duplicate flag |
Core API
1. Molecule Queries
molecule = new_client.molecule
# By ChEMBL ID
aspirin = molecule.get('CHEMBL25')
# By name (case-insensitive)
results = molecule.filter(pref_name__icontains='imatinib')
# By properties (Lipinski rule-of-5 compliant)
drug_like = molecule.filter(
molecule_properties__mw_freebase__lte=500,
molecule_properties__alogp__lte=5,
molecule_properties__hba__lte=10,
molecule_properties__hbd__lte=5
)
# By property range
mid_weight = molecule.filter(
molecule_properties__mw_freebase__range=[300, 500]
)
2. Target Queries
target = new_client.target
# By ChEMBL ID
egfr = target.get('CHEMBL203')
print(f"{egfr['pref_name']} ({egfr['organism']})")
# Search by name and type
kinases = target.filter(
target_type='SINGLE PROTEIN',
pref_name__icontains='kinase'
)
# By organism
human_targets = target.filter(organism='Homo sapiens')
3. Bioactivity Data
activity = new_client.activity
# Potent inhibitors for a target
potent = activity.filter(
target_chembl_id='CHEMBL203',
standard_type='IC50',
standard_value__lte=100,
standard_units='nM'
)
# All activities for a compound (with pChEMBL values)
compound_acts = activity.filter(
molecule_chembl_id='CHEMBL25',
pchembl_value__isnull=False
)
# Multiple activity types
ki_data = activity.filter(
target_chembl_id='CHEMBL240',
standard_type__in=['IC50', 'Ki', 'Kd']
)
4. Structure-Based Search
# Similarity search (Tanimoto)
similarity = new_client.similarity
similar = similarity.filter(
smiles='CC(=O)Oc1ccccc1C(=O)O', # aspirin
similarity=85 # >=85% similarity
)
# Substructure search
substructure = new_client.substructure
benzimidazoles = substructure.filter(smiles='c1ccc2[nH]cnc2c1')
5. Drug and Mechanism Data
drug = new_client.drug
mechanism = new_client.mechanism
drug_indication = new_client.drug_indication
# Drug details
drug_info = drug.get('CHEMBL25')
# Mechanisms of action
mechs = mechanism.filter(molecule_chembl_id='CHEMBL941')
for m in mechs:
print(f"{m['mechanism_of_action']} → {m.get('target_chembl_id')}")
# Therapeutic indications
indications = drug_indication.filter(molecule_chembl_id='CHEMBL941')
for ind in indications:
print(f"{ind.get('mesh_heading')} (Phase {ind.get('max_phase_for_ind')})")
# SVG molecular image
image = new_client.image
svg_data = image.get('CHEMBL25')
with open('aspirin.svg', 'w') as f:
f.write(svg_data)
Common Workflows
Workflow 1: Find Inhibitors for a Target
from chembl_webresource_client.new_client import new_client
import pandas as pd
# Step 1: Identify the target
targets = new_client.target.filter(pref_name__icontains='BRAF', target_type='SINGLE PROTEIN')
target_id = list(targets)[0]['target_chembl_id']
# Step 2: Query potent activities
activities = new_client.activity.filter(
target_chembl_id=target_id,
standard_type='IC50',
standard_value__lte=100,
standard_units='nM',
pchembl_value__isnull=False
)
# Step 3: Convert to DataFrame for analysis
df = pd.DataFrame(list(activities))
df['standard_value'] = pd.to_numeric(df['standard_value'])
print(f"Found {len(df)} potent compounds")
print(df[['molecule_chembl_id', 'standard_value', 'pchembl_value']].head(10))
Workflow 2: Analyze a Known Drug
from chembl_webresource_client.new_client import new_client
chembl_id = 'CHEMBL941' # Imatinib
# Drug information
drug_info = new_client.molecule.get(chembl_id)
print(f"Name: {drug_info['pref_name']}")
print(f"MW: {drug_info['molecule_properties']['mw_freebase']}")
# Mechanisms of action
mechs = list(new_client.mechanism.filter(molecule_chembl_id=chembl_id))
for m in mechs:
print(f"Mechanism: {m['mechanism_of_action']}")
# Indications
indications = list(new_client.drug_indication.filter(molecule_chembl_id=chembl_id))
for ind in indications:
print(f"Indication: {ind.get('mesh_heading')} (Phase {ind.get('max_phase_for_ind')})")
# All bioactivity data
activities = list(new_client.activity.filter(
molecule_chembl_id=chembl_id, pchembl_value__isnull=False
))
print(f"Total bioactivity records: {len(activities)}")
Workflow 3: SAR Study
from chembl_webresource_client.new_client import new_client
import pandas as pd
# Step 1: Find similar compounds to lead
similar = new_client.similarity.filter(
smiles='c1ccc2c(c1)cc(nc2N)c3ccc(cc3)NC(=O)c4ccccc4', # lead compound
similarity=80
)
analogs = list(similar)
# Step 2: Collect activities for each analog
records = []
for compound in analogs[:20]: # limit for demo
cid = compound['molecule_chembl_id']
acts = list(new_client.activity.filter(
molecule_chembl_id=cid,
standard_type='IC50',
pchembl_value__isnull=False
))
for act in acts:
records.append({
'chembl_id': cid,
'target': act.get('target_pref_name'),
'IC50_nM': act.get('standard_value'),
'pchembl': act.get('pchembl_value'),
'mw': compound.get('molecule_properties', {}).get('mw_freebase'),
'alogp': compound.get('molecule_properties', {}).get('alogp')
})
# Step 3: Analyze property-activity relationships
df = pd.DataFrame(records)
if not df.empty:
df['IC50_nM'] = pd.to_numeric(df['IC50_nM'])
print(df.groupby('target')['IC50_nM'].describe())
Common Recipes
Recipe: Virtual Screening Filter (Lipinski + Activity)
from chembl_webresource_client.new_client import new_client
candidates = new_client.molecule.filter(
molecule_properties__mw_freebase__range=[300, 500],
molecule_properties__alogp__lte=5,
molecule_properties__hba__lte=10,
molecule_properties__hbd__lte=5,
molecule_properties__num_ro5_violations=0
)
print(f"Drug-like candidates: {len(list(candidates))}")
Recipe: Client Configuration
from chembl_webresource_client.settings import Settings
Settings.Instance().CACHING = True # enable/disable cache
Settings.Instance().CACHE_EXPIRE = 86400 # cache duration (seconds)
Settings.Instance().TIMEOUT = 30 # request timeout (seconds)
Settings.Instance().TOTAL_RETRIES = 3 # retry count on failure
Recipe: Export Activities to CSV
import pandas as pd
from chembl_webresource_client.new_client import new_client
activities = new_client.activity.filter(
target_chembl_id='CHEMBL203',
standard_type='IC50',
pchembl_value__isnull=False
)
df = pd.DataFrame(list(activities))
df.to_csv('egfr_activities.csv', index=False)
print(f"Exported {len(df)} records")
Key Parameters
| Parameter | Endpoint | Default | Description |
|---|---|---|---|
similarity | similarity.filter() | — | Tanimoto threshold (0-100), typically 70-90 |
standard_type | activity.filter() | — | Activity type: IC50, Ki, Kd, EC50 |
standard_value__lte | activity.filter() | — | Max activity value (nM) |
pchembl_value | activity.filter() | — | Normalized -log10 activity (>6 = potent) |
target_type | target.filter() | — | SINGLE PROTEIN, PROTEIN COMPLEX, ORGANISM |
CACHING | Settings | True | Enable HTTP response caching |
CACHE_EXPIRE | Settings | 86400 | Cache TTL in seconds |
TIMEOUT | Settings | 30 | HTTP request timeout in seconds |
TOTAL_RETRIES | Settings | 3 | Auto-retry count on failure |
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Empty results from filter | No matches or too strict filters | Relax filters; verify IDs exist with .get() first |
KeyError on molecule properties | Not all molecules have full property data | Use .get('molecule_properties', {}).get('field') |
| Query returns unexpectedly few results | Lazy evaluation not consumed | Convert to list() before checking length |
| Slow queries | Large result sets paginated automatically | Add more filters to narrow results; use __range |
404 on .get() | Invalid ChEMBL ID | Verify ID format (e.g., CHEMBL25, not 25) |
| Stale data | Aggressive caching | Set Settings.Instance().CACHING = False or clear cache |
| Timeout errors | Server overload or large query | Increase TIMEOUT; split into smaller queries |
| Mixed units in activity data | Different assays use different units | Filter by standard_units='nM' or use pchembl_value |
| Duplicate activity records | Same measurement from different sources | Check potential_duplicate and data_validity_comment |
Best Practices
- Use
pchembl_valuefor cross-study comparisons — it normalizes IC50/Ki/EC50 to a comparable -log10 scale - Always check
data_validity_commentbefore using activity values — flags data quality issues - Filter by
standard_unitsto ensure consistent units across results - Pagination is automatic: the SDK handles pagination transparently — iterate directly over query results without manual page handling. Convert to
list()only when you need all results in memory - Use lazy evaluation: queries execute only when iterated — convert to
list()only when needed - Cache results: the SDK caches for 24h by default — leverage this for repeated queries
- For bulk data (>100K records): use ChEMBL FTP downloads rather than API queries
Related Skills
rdkit-cheminformatics— SMILES manipulation, molecular descriptors, fingerprintsdatamol-cheminformatics— Molecular preprocessing and featurizationpubchem-compound-search— Alternative compound database (NIH)
References
- ChEMBL website: https://www.ebi.ac.uk/chembl/
- API documentation: https://www.ebi.ac.uk/chembl/api/data/docs
- Python client: https://github.com/chembl/chembl_webresource_client
- Interface docs: https://chembl.gitbook.io/chembl-interface-documentation/
- Example notebooks: https://github.com/chembl/notebooks
Bundled Resources
Self-contained entry (no references/ directory). Original total: 662 lines (SKILL.md 389 + api_reference.md 273). Scripts: 279 lines (example_queries.py).
Original file disposition:
SKILL.md(389 lines) → Core API, Workflows, Quick Start. "Common Use Cases" consolidated (rule 7b): Find Kinase Inhibitors → Workflow 1 pattern, Virtual Screening → Recipe, Drug Repurposing → omitted (trivial loop over drug endpoint, not a distinct analytical workflow). "Important Notes" section routed to Best Practices and Troubleshooting (rule 9)references/api_reference.md(273 lines) → Consolidated inline. Filter Operators → Key Concepts table. Core Endpoints listing → Key Concepts table (all endpoints). Molecular Properties → Key Concepts table. Bioactivity Data Fields → Key Concepts table. Target Information Fields → Key Concepts table. Configuration/Settings → Common Recipes. Error handling/rate limiting → Troubleshooting + Best Practices. Response formats (JSON/XML/YAML) → omitted (JSON is default and only format used via Python SDK). Advanced query examples already covered in Core APIscripts/example_queries.py(279 lines) → Thin-wrapper functions absorbed into Core API modules:get_molecule_info/search_molecules_by_name/find_molecules_by_properties→ Module 1 (Molecule Queries);get_target_info/search_targets_by_name→ Module 2 (Target Queries);get_bioactivity_data/get_compound_bioactivities→ Module 3 (Bioactivity Data);find_similar_compounds/substructure_search→ Module 4 (Structure-Based Search);get_drug_info→ Module 5 (Drug and Mechanism Data);find_kinase_inhibitors→ Workflow 1;export_to_dataframe→ Workflow 1 + Recipe (Export)
Retention: ~460 lines / 662 original = ~69%. Vendor metadata stripped (rule 13). Agent-behavior section stripped (rule 4).