name: datamol-cheminformatics description: >- Pythonic RDKit wrapper with sensible defaults for drug discovery. SMILES parsing, standardization, descriptors, fingerprints, similarity, clustering, diversity selection, scaffold analysis, BRICS/RECAP fragmentation, 3D conformers, and visualization. Returns native rdkit.Chem.Mol. Prefer datamol for standard workflows; use RDKit directly for advanced control. license: Apache-2.0
Datamol Cheminformatics Toolkit
Overview
Datamol provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. It simplifies common drug discovery operations — SMILES parsing, standardization, descriptors, fingerprints, clustering, scaffolds, conformers, and visualization — with sensible defaults, built-in parallelization, and cloud storage support via fsspec. All molecular objects are native rdkit.Chem.Mol instances, ensuring full RDKit compatibility.
When to Use
- Parsing, validating, and standardizing molecular structures from SMILES, SDF, or other formats
- Computing molecular descriptors and fingerprints for ML featurization
- Similarity searching and diversity selection from compound libraries
- Clustering compounds by structural similarity (Butina clustering)
- Scaffold analysis and scaffold-based train/test splitting for ML
- BRICS/RECAP molecular fragmentation for fragment-based design
- 3D conformer generation and analysis
- Visualizing molecules as grids with alignment and highlighting
- Batch processing molecular datasets with parallelization
- For quick gene lookups use gget instead; for advanced substructure queries or custom fingerprints, use RDKit directly
Prerequisites
uv pip install datamol
import datamol as dm
import numpy as np
import pandas as pd
Quick Start
import datamol as dm
# Parse and standardize
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
mol = dm.standardize_mol(mol)
print(dm.to_smiles(mol)) # Canonical SMILES
# Compute descriptors
desc = dm.descriptors.compute_many_descriptors(mol)
print(f"MW: {desc['mw']:.1f}, LogP: {desc['logp']:.2f}, TPSA: {desc['tpsa']:.1f}")
# Generate fingerprint
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
print(f"Fingerprint shape: {fp.shape}") # (2048,)
Core API
1. Molecular I/O & Standardization
Parsing molecules:
import datamol as dm
# From SMILES (returns None on failure)
mol = dm.to_mol("CCO")
if mol is None:
print("Invalid SMILES")
# Format conversions
smiles = dm.to_smiles(mol, isomeric=True) # Canonical SMILES
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
Standardization (always recommended for external data):
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
clean_smiles = dm.standardize_smiles("C(C)O") # From SMILES directly
File I/O:
# Reading (supports local, S3, GCS, HTTP via fsspec)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
df = dm.open_df("file.sdf") # Auto-detect format
# Writing
dm.to_sdf(df, "output.sdf", mol_column="mol")
dm.to_smi(mols, "output.smi")
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"]) # Renders molecule images
# Remote files
df = dm.read_sdf("s3://bucket/compounds.sdf")
dm.to_sdf(mols, "s3://bucket/output.sdf")
2. Descriptors & Properties
import datamol as dm
mol = dm.to_mol("c1ccc(cc1)CCN")
# Standard descriptor set (single molecule)
desc = dm.descriptors.compute_many_descriptors(mol)
# Returns dict: {'mw': 121.18, 'logp': 1.41, 'hbd': 1, 'hba': 1,
# 'tpsa': 26.02, 'n_aromatic_atoms': 6, ...}
# Batch computation (parallel)
mols = [dm.to_mol(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
desc_df = dm.descriptors.batch_compute_many_descriptors(
mols, n_jobs=-1, progress=True
)
print(desc_df.head())
# Specific descriptors
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
n_rigid = dm.descriptors.n_rigid_bonds(mol)
Drug-likeness filtering (Lipinski Rule of Five):
def is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (desc['mw'] <= 500 and desc['logp'] <= 5
and desc['hbd'] <= 5 and desc['hba'] <= 10)
druglike = [m for m in mols if is_druglike(m)]
print(f"Drug-like: {len(druglike)}/{len(mols)}")
3. Fingerprints & Similarity
import datamol as dm
mol = dm.to_mol("c1ccc(cc1)CCN")
# Fingerprint types
fp_ecfp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048) # Morgan/ECFP
fp_maccs = dm.to_fp(mol, fp_type='maccs') # MACCS keys (167 bits)
fp_topo = dm.to_fp(mol, fp_type='topological') # Topological
fp_ap = dm.to_fp(mol, fp_type='atompair') # Atom pairs
# Pairwise distances (Tanimoto distance = 1 - similarity)
mols = [dm.to_mol(s) for s in ["CCO", "CCCO", "c1ccccc1"]]
dist_matrix = dm.pdist(mols, n_jobs=-1)
print(f"Distance vector shape: {dist_matrix.shape}")
# Distances between two sets
query = [dm.to_mol("CCO")]
library = [dm.to_mol(s) for s in ["CCCO", "c1ccccc1", "CC(=O)O"]]
distances = dm.cdist(query, library, n_jobs=-1)
print(f"Query-library distances: {distances.shape}")
4. Clustering & Diversity Selection
import datamol as dm
mols = [dm.to_mol(s) for s in smiles_list] # Assume smiles_list defined
# Butina clustering (suitable for ~1000 molecules, builds full distance matrix)
clusters = dm.cluster_mols(mols, cutoff=0.2, n_jobs=-1)
for i, cluster in enumerate(clusters[:5]):
print(f"Cluster {i}: {len(cluster)} molecules")
# Diversity selection (works for larger libraries)
diverse_mols = dm.pick_diverse(mols, npick=100)
print(f"Selected {len(diverse_mols)} diverse molecules")
# Cluster centroids
centroids = dm.pick_centroids(mols, npick=50)
print(f"Selected {len(centroids)} centroids")
5. Scaffolds & Fragments
Murcko scaffold extraction:
import datamol as dm
from collections import Counter
mol = dm.to_mol("c1ccc(cc1)CCN")
scaffold = dm.to_scaffold_murcko(mol)
print(f"Scaffold: {dm.to_smiles(scaffold)}")
# Scaffold frequency analysis
scaffolds = [dm.to_scaffold_murcko(m) for m in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
counts = Counter(scaffold_smiles)
print(f"Top scaffolds: {counts.most_common(5)}")
# Scaffold-based train/test split (for ML)
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
scaffold_to_mols.setdefault(scaf, []).append(mol)
scaffolds_list = list(scaffold_to_mols.keys())
split_idx = int(0.8 * len(scaffolds_list))
train_mols = [m for s in scaffolds_list[:split_idx] for m in scaffold_to_mols[s]]
test_mols = [m for s in scaffolds_list[split_idx:] for m in scaffold_to_mols[s]]
Fragmentation:
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
# BRICS (16 bond types, retrosynthetic)
brics_frags = dm.fragment.brics(mol)
print(f"BRICS fragments: {brics_frags}") # Set of fragment SMILES with [1*] attachment points
# RECAP (11 bond types, combinatorial)
recap_frags = dm.fragment.recap(mol)
# MMPA (matched molecular pair analysis)
mmpa_frags = dm.fragment.mmpa_frag(mol)
6. 3D Conformers
import datamol as dm
mol = dm.to_mol("c1ccc(cc1)CCN")
# Generate conformers
mol_3d = dm.conformers.generate(
mol,
n_confs=50, # Number to generate
rms_cutoff=0.5, # Filter similar (Angstroms)
minimize_energy=True, # UFF minimization
method='ETKDGv3' # Embedding method
)
print(f"Generated {mol_3d.GetNumConformers()} conformers")
# Access coordinates
conf = mol_3d.GetConformer(0)
positions = conf.GetPositions() # Nx3 array
print(f"Atom positions shape: {positions.shape}")
# Cluster conformers by RMSD
clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0)
centroids = dm.conformers.return_centroids(mol_3d, clusters)
# Solvent accessible surface area
sasa = dm.conformers.sasa(mol_3d, n_jobs=-1)
print(f"SASA values: {sasa[:3]}")
Key Concepts
Datamol vs RDKit Decision Guide
| Use Datamol when... | Use RDKit directly when... |
|---|---|
| Standard SMILES ↔ Mol conversions | Custom fingerprint definitions |
| Batch processing with parallelization | Low-level atom/bond manipulation |
| Quick descriptor computation | Substructure query optimization |
| File I/O (SDF, CSV, Excel, cloud) | Reaction enumeration (large-scale) |
| Clustering & diversity selection | Custom force field parameters |
| Scaffold analysis | Advanced stereochemistry handling |
Key Data Types
- All molecules are native
rdkit.Chem.Molobjects — fully compatible with RDKit functions - Fingerprints are numpy arrays (dense bit vectors)
- DataFrames use pandas with a
molcolumn containing Mol objects - Distance matrices use Tanimoto distance (0 = identical, 1 = completely different)
Parallelization
Functions supporting n_jobs parameter: dm.read_sdf, dm.descriptors.batch_compute_many_descriptors, dm.cluster_mols, dm.pdist, dm.cdist, dm.conformers.sasa. Use n_jobs=-1 for all cores, progress=True for progress bars.
Common Workflows
1. Drug Discovery Pipeline: Load → Filter → Cluster → Visualize
import datamol as dm
# 1. Load and standardize
df = dm.read_sdf("compounds.sdf")
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()]
print(f"Loaded {len(df)} valid molecules")
# 2. Compute descriptors and filter by drug-likeness
desc_df = dm.descriptors.batch_compute_many_descriptors(
df['mol'].tolist(), n_jobs=-1, progress=True
)
druglike = (desc_df['mw'] <= 500) & (desc_df['logp'] <= 5) & (desc_df['hbd'] <= 5) & (desc_df['hba'] <= 10)
filtered_df = df[druglike.values].reset_index(drop=True)
print(f"Drug-like compounds: {len(filtered_df)}")
# 3. Select diverse subset
diverse = dm.pick_diverse(filtered_df['mol'].tolist(), npick=100)
# 4. Visualize
dm.viz.to_image(diverse[:20], legends=[dm.to_smiles(m) for m in diverse[:20]],
n_cols=5, mol_size=(300, 300), outfile="diverse_hits.png")
2. Virtual Screening: Query → Similarity → Rank
import datamol as dm
import numpy as np
# Query actives and screening library
actives = [dm.to_mol(s) for s in active_smiles] # Known actives
library = [dm.to_mol(s) for s in library_smiles] # Screening library
# Calculate distances (Tanimoto)
distances = dm.cdist(actives, library, n_jobs=-1)
min_distances = distances.min(axis=0) # Best match to any active
similarities = 1 - min_distances
# Rank and select top hits
top_idx = np.argsort(similarities)[::-1][:100]
top_hits = [library[i] for i in top_idx]
top_scores = [similarities[i] for i in top_idx]
print(f"Top hit similarity: {top_scores[0]:.3f}")
# Visualize top hits
dm.viz.to_image(top_hits[:20],
legends=[f"Sim: {s:.3f}" for s in top_scores[:20]],
outfile="screening_hits.png")
3. SAR Analysis: Group by Scaffold → Compare Activities
import datamol as dm
# Group compounds by scaffold
scaffolds = [dm.to_scaffold_murcko(m) for m in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
sar_df = pd.DataFrame({
'mol': mols, 'scaffold': scaffold_smiles, 'activity': activities
})
# Analyze each scaffold series
for scaffold, group in sar_df.groupby('scaffold'):
if len(group) >= 3:
print(f"Scaffold: {scaffold} | N={len(group)} | "
f"Activity: {group['activity'].min():.2f}–{group['activity'].max():.2f}")
dm.viz.to_image(group['mol'].tolist(), align=True,
legends=[f"Act: {a:.2f}" for a in group['activity']])
Key Parameters
| Function | Parameter | Default | Description |
|---|---|---|---|
dm.to_fp | fp_type | 'ecfp' | Fingerprint type: ecfp, maccs, topological, atompair |
dm.to_fp | radius | 2 | Morgan radius (ecfp only); radius=2 ≈ ECFP4 |
dm.to_fp | n_bits | 2048 | Fingerprint length (ecfp, topological) |
dm.cluster_mols | cutoff | 0.2 | Tanimoto distance threshold (0=identical, 1=different) |
dm.pick_diverse | npick | required | Number of diverse molecules to select |
dm.conformers.generate | n_confs | None | Number of conformers (None = auto) |
dm.conformers.generate | rms_cutoff | None | RMSD filter threshold (Angstroms) |
dm.conformers.generate | method | 'ETKDGv3' | Embedding: ETKDGv3, ETKDGv2, ETKDG |
dm.standardize_mol | disconnect_metals | False | Remove metal-ligand bonds |
dm.read_sdf | sanitize | True | Apply molecule sanitization |
dm.read_sdf | remove_hs | True | Remove explicit hydrogens |
dm.viz.to_image | align | False | Align molecules by MCS |
dm.viz.to_image | use_svg | False | Output SVG (True) or PNG (False) |
Best Practices
- Always standardize molecules from external sources — call
dm.standardize_mol()withdisconnect_metals=True, normalize=True, reionize=Truebefore any analysis. Different SMILES representations of the same molecule will produce different fingerprints - Check for None after parsing —
dm.to_mol()returnsNonefor invalid SMILES. Filter these before batch operations to avoid crashes - Use parallel processing for datasets — pass
n_jobs=-1, progress=Trueto batch operations. Sequential processing of 10,000+ molecules is unnecessarily slow - Choose fingerprints by use case — ECFP (Morgan): general structural similarity; MACCS: fast, smaller space; Atom pairs: distance-sensitive. ECFP with radius=2, n_bits=2048 is the most common default
- Mind clustering scale limits — Butina clustering (
dm.cluster_mols) builds a full distance matrix. Use for ≤~1,000 molecules. For larger sets, usedm.pick_diverse()or hierarchical methods - Use scaffold splitting for ML — random splits leak similar structures into train/test. Always use scaffold-based splitting for molecular property prediction models
- Leverage fsspec for cloud data — all I/O functions accept S3, GCS, and HTTP paths directly. Install
s3fsorgcsfsfor cloud support
Common Recipes
Recipe: Batch SMILES Validation and Standardization
When to use: Clean a list of SMILES strings before any downstream analysis.
import datamol as dm
smiles_list = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccccc1", "invalid_smiles", "CC(N)C(=O)O"]
mols = [dm.to_mol(s) for s in smiles_list]
valid = [(s, m) for s, m in zip(smiles_list, mols) if m is not None]
standardized = [(s, dm.standardize_mol(m)) for s, m in valid]
print(f"Valid: {len(valid)}/{len(smiles_list)}")
for orig, mol in standardized:
print(f" {orig} → {dm.to_smiles(mol)}")
Recipe: Pairwise Similarity Matrix
When to use: Compare a small compound set against each other or a reference library.
import datamol as dm
import numpy as np
smiles = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccc(cc1)C(=O)O", "CC(N)C(=O)O", "c1ccccc1"]
mols = [dm.to_mol(s) for s in smiles]
fps = [dm.to_fp(m) for m in mols]
# Pairwise Tanimoto similarity
n = len(fps)
sim_matrix = np.zeros((n, n))
for i in range(n):
for j in range(n):
sim_matrix[i, j] = dm.similarity.tanimoto(fps[i], fps[j])
print(f"Similarity matrix shape: {sim_matrix.shape}")
print(f"Most similar pair: {np.unravel_index(np.argsort(sim_matrix.ravel())[-3], (n, n))}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
dm.to_mol() returns None | Invalid or non-canonical SMILES | Try dm.standardize_smiles() first; check for kekulization issues |
| MemoryError during clustering | Full distance matrix for large set | Use dm.pick_diverse() instead of dm.cluster_mols for >1000 molecules |
| Slow conformer generation | Too many conformers or large molecule | Reduce n_confs, increase rms_cutoff, or limit molecule size |
| Remote file access fails | Missing fsspec backend | Install s3fs (AWS), gcsfs (GCP), or adlfs (Azure) |
| Descriptor computation fails | Molecule has no conformer | Standardize first; some 3D descriptors need dm.conformers.generate() |
dm.to_xlsx missing images | openpyxl not installed | uv pip install openpyxl |
| Inconsistent fingerprints | Different SMILES for same molecule | Standardize all molecules before fingerprint computation |
| Scaffold extraction returns full molecule | No ring system in molecule | Murcko scaffolds require at least one ring; acyclic molecules return themselves |
| Reaction product is None | Reactant doesn't match SMARTS pattern | Verify reactant matches reaction template; check atom mapping |
Import error for dm.viz | Missing visualization dependencies | uv pip install Pillow cairosvg |
Related Skills
- rdkit-cheminformatics — full RDKit API for advanced operations not covered by datamol's simplified interface
- pubchem-compound-search — retrieve compound data by name, CID, or structure from PubChem
- scikit-learn-machine-learning — ML model training using datamol-generated features
- matplotlib-scientific-plotting — custom publication-quality molecular property plots
References
- Datamol documentation: https://docs.datamol.io/
- RDKit documentation: https://www.rdkit.org/docs/
- GitHub repository: https://github.com/datamol-io/datamol
- Bemis, G. W. & Murcko, M. A. (1996). The Properties of Known Drugs. J. Med. Chem. 39(15), 2887–2893