name: matchms-spectral-matching description: MS spectral matching and metabolite ID with matchms. Import spectra (mzML, MGF, MSP, JSON), filter/normalize peaks, score similarity (cosine, modified cosine, fingerprint), build reproducible pipelines, identify unknowns vs spectral libraries. Use pyopenms for full LC-MS/MS proteomics. license: Apache-2.0
Matchms — Spectral Matching & Metabolite Identification
Overview
Matchms is a Python library for mass spectrometry data processing focused on spectral similarity calculation and compound identification. It provides multi-format I/O, 50+ spectrum filters for metadata harmonization and peak processing, 8 similarity scoring functions, and a pipeline framework for reproducible analytical workflows.
When to Use
- Identifying unknown metabolites by matching MS/MS spectra against reference libraries
- Computing spectral similarity scores (cosine, modified cosine, fingerprint-based)
- Processing and standardizing mass spectral data from multiple formats (mzML, MGF, MSP, JSON)
- Building reproducible spectral processing pipelines for quality control
- Harmonizing metadata across spectral databases (compound names, SMILES, InChI, adducts)
- Large-scale spectral library comparisons and duplicate detection
- For full LC-MS/MS proteomics workflows (feature detection, protein ID), use pyopenms instead
- For chemical structure similarity without mass spectra, use rdkit fingerprint comparison
Prerequisites
uv pip install matchms numpy pandas
# For chemical structure processing (SMILES, InChI, fingerprints):
uv pip install matchms[chemistry]
- Python 3.8+; NumPy for peak array operations
- Input: spectral data in MGF, MSP, mzML, mzXML, JSON, or pickle format
- Reference library in any supported format for matching
Quick Start
from matchms.importing import load_from_mgf
from matchms.filtering import default_filters, normalize_intensities
from matchms.filtering import select_by_relative_intensity, require_minimum_number_of_peaks
from matchms import calculate_scores
from matchms.similarity import CosineGreedy
# Load and process query spectra
queries = list(load_from_mgf("queries.mgf"))
queries = [default_filters(s) for s in queries]
queries = [normalize_intensities(s) for s in queries if s is not None]
queries = [require_minimum_number_of_peaks(s, n_required=5) for s in queries if s is not None]
# Load reference library
refs = list(load_from_mgf("library.mgf"))
refs = [default_filters(s) for s in refs]
refs = [normalize_intensities(s) for s in refs if s is not None]
# Calculate similarity scores
scores = calculate_scores(references=refs, queries=queries,
similarity_function=CosineGreedy(tolerance=0.1))
# Get best matches for first query
best = scores.scores_by_query(queries[0], sort=True)[:5]
for match, score_tuple in best:
print(f"Score: {score_tuple['score']:.3f}, Matches: {score_tuple['matches']}")
Core API
Module 1: Spectrum I/O
Import spectra from multiple file formats and export processed data.
from matchms.importing import (load_from_mgf, load_from_mzml, load_from_msp,
load_from_json, load_from_mzxml, load_from_pickle,
load_from_usi)
from matchms.exporting import save_as_mgf, save_as_msp, save_as_json, save_as_pickle
# Import from various formats (returns generators)
spectra_mgf = list(load_from_mgf("library.mgf"))
spectra_mzml = list(load_from_mzml("data.mzML"))
spectra_msp = list(load_from_msp("nist_library.msp"))
spectra_json = list(load_from_json("gnps_spectra.json"))
print(f"Loaded: MGF={len(spectra_mgf)}, mzML={len(spectra_mzml)}")
# Export processed spectra
save_as_mgf(spectra_mgf, "processed.mgf")
save_as_json(spectra_mgf, "processed.json")
save_as_pickle(spectra_mgf, "spectra.pickle") # Fast for intermediate results
# Pickle for large datasets (fastest I/O)
from matchms.importing import load_from_pickle
spectra = list(load_from_pickle("spectra.pickle"))
from matchms import Spectrum
import numpy as np
# Create spectrum manually
mz = np.array([100.0, 150.0, 200.0, 250.0, 300.0])
intensities = np.array([0.1, 0.5, 0.9, 0.3, 0.7])
metadata = {
"precursor_mz": 325.5,
"ionmode": "positive",
"compound_name": "Caffeine",
"smiles": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
}
spectrum = Spectrum(mz=mz, intensities=intensities, metadata=metadata)
# Access spectrum data
print(f"Peaks: {spectrum.peaks.mz}")
print(f"Precursor: {spectrum.get('precursor_mz')}")
print(f"Name: {spectrum.get('compound_name')}")
# Visualize
spectrum.plot()
Module 2: Spectrum Filtering & Processing
Apply metadata harmonization and peak processing filters. Matchms provides 50+ filters.
from matchms.filtering import (
default_filters, normalize_intensities,
select_by_relative_intensity, select_by_mz,
require_minimum_number_of_peaks, reduce_to_number_of_peaks,
remove_peaks_around_precursor_mz, add_losses
)
# default_filters applies: metadata cleanup, charge correction, adduct parsing
spectrum = default_filters(spectrum)
# Peak normalization (max intensity → 1.0)
spectrum = normalize_intensities(spectrum)
# Filter peaks by relative intensity (remove noise below 1%)
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01, intensity_to=1.0)
# Filter peaks by m/z range
spectrum = select_by_mz(spectrum, mz_from=50.0, mz_to=500.0)
# Keep top N peaks only
spectrum = reduce_to_number_of_peaks(spectrum, n_max=50)
# Remove peaks near precursor (common contaminants)
spectrum = remove_peaks_around_precursor_mz(spectrum, mz_tolerance=17.0)
# Require minimum peaks for matching
spectrum = require_minimum_number_of_peaks(spectrum, n_required=5)
# Add neutral losses (useful for NeutralLossesCosine)
spectrum = add_losses(spectrum)
if spectrum is not None:
print(f"After filtering: {len(spectrum.peaks.mz)} peaks")
# Chemical annotation filters (require matchms[chemistry])
from matchms.filtering import (
derive_inchi_from_smiles, derive_inchikey_from_inchi,
derive_smiles_from_inchi, add_fingerprint,
repair_inchi_inchikey_smiles, require_valid_annotation
)
# Derive chemical identifiers from SMILES
spectrum = derive_inchi_from_smiles(spectrum)
spectrum = derive_inchikey_from_inchi(spectrum)
# Add molecular fingerprint for structural similarity
spectrum = add_fingerprint(spectrum, fingerprint_type="morgan", nbits=2048)
# Validate annotations
spectrum = require_valid_annotation(spectrum)
if spectrum is not None:
print(f"InChIKey: {spectrum.get('inchikey')}")
Module 3: Similarity Scoring
Compare spectra using multiple similarity metrics.
from matchms import calculate_scores
from matchms.similarity import (
CosineGreedy, CosineHungarian, ModifiedCosine,
NeutralLossesCosine, FingerprintSimilarity,
MetadataMatch, PrecursorMzMatch
)
# CosineGreedy — fast peak matching (greedy algorithm)
scores = calculate_scores(references=library, queries=unknowns,
similarity_function=CosineGreedy(tolerance=0.1))
# ModifiedCosine — accounts for precursor mass differences (best for analog search)
scores = calculate_scores(references=library, queries=unknowns,
similarity_function=ModifiedCosine(tolerance=0.1))
# CosineHungarian — optimal peak matching (slower but more accurate)
scores = calculate_scores(references=library, queries=unknowns,
similarity_function=CosineHungarian(tolerance=0.1))
# NeutralLossesCosine — similarity based on neutral loss patterns
scores = calculate_scores(references=library, queries=unknowns,
similarity_function=NeutralLossesCosine(tolerance=0.1))
# Access results
for i, query in enumerate(unknowns[:3]):
best_matches = scores.scores_by_query(query, sort=True)[:3]
print(f"\nQuery {i}: precursor_mz={query.get('precursor_mz')}")
for ref, score_tuple in best_matches:
print(f" {ref.get('compound_name', 'Unknown')}: "
f"score={score_tuple['score']:.3f}, matches={score_tuple['matches']}")
# FingerprintSimilarity — structural similarity (requires fingerprints)
from matchms.similarity import FingerprintSimilarity
scores = calculate_scores(references=library, queries=unknowns,
similarity_function=FingerprintSimilarity(
similarity_measure="jaccard"))
# PrecursorMzMatch — fast mass-based pre-filtering
from matchms.similarity import PrecursorMzMatch
scores = calculate_scores(references=library, queries=unknowns,
similarity_function=PrecursorMzMatch(tolerance=0.1))
# Multi-metric scoring: combine peak + structural similarity
cosine_scores = calculate_scores(references=library, queries=unknowns,
similarity_function=CosineGreedy(tolerance=0.1))
fp_scores = calculate_scores(references=library, queries=unknowns,
similarity_function=FingerprintSimilarity())
Module 4: Processing Pipelines
Build reusable, reproducible multi-step processing workflows.
from matchms import SpectrumProcessor
from matchms.filtering import (
default_filters, normalize_intensities,
select_by_relative_intensity, require_minimum_number_of_peaks,
remove_peaks_around_precursor_mz, add_losses
)
# Define reusable pipeline
pipeline = SpectrumProcessor([
default_filters,
normalize_intensities,
lambda s: select_by_relative_intensity(s, intensity_from=0.01),
lambda s: remove_peaks_around_precursor_mz(s, mz_tolerance=17.0),
lambda s: require_minimum_number_of_peaks(s, n_required=5),
add_losses
])
# Apply to all spectra (filters returning None remove the spectrum)
processed = [pipeline(s) for s in raw_spectra]
processed = [s for s in processed if s is not None]
print(f"Processed: {len(processed)}/{len(raw_spectra)} spectra retained")
Key Concepts
Similarity Function Comparison
| Function | Speed | Accuracy | Best For |
|---|---|---|---|
CosineGreedy | Fast | Good | General library matching |
CosineHungarian | Slow | Best | Small comparisons, validation |
ModifiedCosine | Fast | Good | Analog search (different precursors) |
NeutralLossesCosine | Medium | Good | Structural class identification |
FingerprintSimilarity | Fast | Moderate | Structure-based pre-filtering |
PrecursorMzMatch | Fastest | N/A | Mass-based pre-filtering |
Filter Categories
| Category | Examples | Purpose |
|---|---|---|
| Metadata cleanup | default_filters, clean_compound_name, clean_adduct | Standardize metadata fields |
| Chemical derivation | derive_inchi_from_smiles, add_fingerprint | Compute chemical identifiers |
| Mass/charge | add_precursor_mz, correct_charge, add_parent_mass | Fix and validate mass info |
| Peak normalization | normalize_intensities, select_by_relative_intensity | Scale and filter peaks |
| Peak reduction | reduce_to_number_of_peaks, remove_peaks_around_precursor_mz | Remove noise/artifacts |
| Quality control | require_minimum_number_of_peaks, require_precursor_mz | Enforce minimum quality |
| Neutral losses | add_losses | Compute precursor-fragment losses |
Score Tuple Structure
All similarity functions return (score, matches):
score: float 0.0–1.0 (cosine similarity value)matches: int (number of matched peaks between query and reference)
Higher scores and more matched peaks indicate better matches. Typical thresholds: score > 0.7 and matches > 6 for confident identifications.
Common Workflows
Workflow 1: Library Matching for Metabolite Identification
from matchms.importing import load_from_mgf
from matchms.filtering import default_filters, normalize_intensities
from matchms.filtering import select_by_relative_intensity, require_minimum_number_of_peaks
from matchms import calculate_scores
from matchms.similarity import ModifiedCosine
import pandas as pd
# Load and process both queries and library identically
def process_spectra(spectra):
processed = []
for s in spectra:
s = default_filters(s)
if s is None: continue
s = normalize_intensities(s)
s = select_by_relative_intensity(s, intensity_from=0.01)
s = require_minimum_number_of_peaks(s, n_required=5)
if s is not None:
processed.append(s)
return processed
queries = process_spectra(load_from_mgf("unknowns.mgf"))
library = process_spectra(load_from_mgf("reference_library.mgf"))
print(f"Queries: {len(queries)}, Library: {len(library)}")
# Score all query-reference pairs
scores = calculate_scores(references=library, queries=queries,
similarity_function=ModifiedCosine(tolerance=0.1))
# Extract best matches
results = []
for query in queries:
best = scores.scores_by_query(query, sort=True)[:1]
if best:
ref, score_tuple = best[0]
results.append({
"query_precursor_mz": query.get("precursor_mz"),
"match_name": ref.get("compound_name", "Unknown"),
"match_smiles": ref.get("smiles", ""),
"score": score_tuple["score"],
"matched_peaks": score_tuple["matches"]
})
df = pd.DataFrame(results)
confident = df[df["score"] > 0.7]
print(f"Confident matches (score>0.7): {len(confident)}/{len(df)}")
df.to_csv("identification_results.csv", index=False)
Workflow 2: Quality Control and Data Cleaning
from matchms.importing import load_from_msp
from matchms.exporting import save_as_mgf
from matchms import SpectrumProcessor
from matchms.filtering import (
default_filters, normalize_intensities,
select_by_relative_intensity, require_minimum_number_of_peaks,
require_precursor_mz, add_parent_mass
)
# Define QC pipeline
qc_pipeline = SpectrumProcessor([
default_filters,
require_precursor_mz,
add_parent_mass,
normalize_intensities,
lambda s: select_by_relative_intensity(s, intensity_from=0.001),
lambda s: require_minimum_number_of_peaks(s, n_required=3)
])
# Process and filter
raw = list(load_from_msp("raw_library.msp"))
cleaned = [qc_pipeline(s) for s in raw]
cleaned = [s for s in cleaned if s is not None]
print(f"Input: {len(raw)}, Output: {len(cleaned)} ({len(cleaned)/len(raw)*100:.1f}% retained)")
save_as_mgf(cleaned, "cleaned_library.mgf")
Workflow 3: Format Conversion
- Load spectra from source format (Core API Module 1 —
load_from_mzml) - Apply
default_filtersfor metadata harmonization (Core API Module 2) - Export to target format (Core API Module 1 —
save_as_mgf)
Key Parameters
| Parameter | Function/Module | Default | Range/Options | Effect |
|---|---|---|---|---|
tolerance | CosineGreedy/ModifiedCosine | 0.1 | 0.005–0.5 Da | m/z tolerance for peak matching |
mz_power | CosineGreedy | 0.0 | 0.0–2.0 | Weight of m/z in scoring (0=ignore) |
intensity_power | CosineGreedy | 1.0 | 0.0–2.0 | Weight of intensity in scoring |
intensity_from | select_by_relative_intensity | 0.0 | 0.0–1.0 | Minimum relative intensity to keep |
n_required | require_minimum_number_of_peaks | 10 | 1–100 | Minimum peaks to retain spectrum |
n_max | reduce_to_number_of_peaks | 100 | 10–500 | Maximum peaks to retain |
mz_tolerance | remove_peaks_around_precursor_mz | 17.0 | 0.5–50 Da | Window around precursor to remove |
fingerprint_type | add_fingerprint | "daylight" | "daylight"/"morgan"/"maccs" | Molecular fingerprint type |
nbits | add_fingerprint | 2048 | 256–4096 | Fingerprint bit vector length |
Best Practices
- Always process queries and references identically — apply the same filtering pipeline to both sets to avoid systematic bias in similarity scores
- Save intermediate results in pickle — pickle format is fastest for re-loading; use MGF/MSP for sharing with other tools
- Pre-filter by precursor mass — for large libraries, use
PrecursorMzMatchfirst to reduce the comparison space, then score withCosineGreedy - Combine multiple metrics — use both CosineGreedy (peak matching) and FingerprintSimilarity (structure) for more robust identification
- Check for None after filtering — filters return
Nonewhen a spectrum fails quality requirements. Always filter:[s for s in processed if s is not None] - Use ModifiedCosine for analog search — when querying against libraries that may not have the exact compound, ModifiedCosine handles precursor mass differences
Common Recipes
Recipe 1: Precursor-Filtered Library Search (Efficient)
from matchms import calculate_scores
from matchms.similarity import PrecursorMzMatch, CosineGreedy
# Step 1: Fast mass filter
mass_scores = calculate_scores(references=library, queries=unknowns,
similarity_function=PrecursorMzMatch(tolerance=0.5))
# Step 2: Detailed scoring only for mass-matched pairs
cosine = CosineGreedy(tolerance=0.1)
for query in unknowns:
candidates = mass_scores.scores_by_query(query, sort=True)
mass_matched = [ref for ref, score in candidates if score["score"] > 0]
if mass_matched:
detailed = calculate_scores(references=mass_matched, queries=[query],
similarity_function=cosine)
best = detailed.scores_by_query(query, sort=True)[:3]
for ref, s in best:
print(f"{ref.get('compound_name')}: {s['score']:.3f}")
Recipe 2: Ion Mode-Specific Processing
from matchms.importing import load_from_mgf
from matchms.filtering import default_filters, normalize_intensities
spectra = list(load_from_mgf("mixed_library.mgf"))
spectra = [default_filters(s) for s in spectra]
spectra = [s for s in spectra if s is not None]
# Separate by ion mode
positive = [s for s in spectra if s.get("ionmode") == "positive"]
negative = [s for s in spectra if s.get("ionmode") == "negative"]
print(f"Positive: {len(positive)}, Negative: {len(negative)}")
# Process each mode with mode-specific filtering
positive = [normalize_intensities(s) for s in positive]
negative = [normalize_intensities(s) for s in negative]
Recipe 3: Metadata Enrichment Report
import pandas as pd
from matchms.importing import load_from_mgf
from matchms.filtering import default_filters
spectra = [default_filters(s) for s in load_from_mgf("library.mgf")]
spectra = [s for s in spectra if s is not None]
# Extract metadata summary
rows = []
for s in spectra:
rows.append({
"compound_name": s.get("compound_name", ""),
"precursor_mz": s.get("precursor_mz"),
"ionmode": s.get("ionmode", ""),
"smiles": s.get("smiles", ""),
"inchikey": s.get("inchikey", ""),
"num_peaks": len(s.peaks.mz)
})
df = pd.DataFrame(rows)
print(f"Library: {len(df)} spectra")
print(f"Named: {(df.compound_name != '').sum()}")
print(f"With SMILES: {(df.smiles != '').sum()}")
print(f"Ion modes: {df.ionmode.value_counts().to_dict()}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| All scores are 0.0 | No matching peaks within tolerance | Increase tolerance (try 0.2–0.5 Da); verify both spectra have peaks |
| Low scores despite same compound | Different fragmentation conditions | Use ModifiedCosine instead of CosineGreedy; check ion mode consistency |
| Many spectra filtered to None | Too strict quality filters | Lower n_required in require_minimum_number_of_peaks; relax intensity thresholds |
KeyError on metadata field | Field name not harmonized | Apply default_filters first to harmonize metadata keys |
| Memory error with large library | All-vs-all comparison | Pre-filter by precursor mass (PrecursorMzMatch) before detailed scoring |
add_fingerprint fails | RDKit not installed | Install chemistry extras: pip install matchms[chemistry] |
| Import returns empty list | Wrong file format or path | Verify format matches loader (MGF for .mgf, MSP for .msp); check file is not empty |
| Inconsistent scores between runs | Different processing pipelines | Use SpectrumProcessor to ensure identical processing for queries and references |
Bundled Resources
-
references/filtering_catalog.md — Complete catalog of 50+ matchms filter functions organized by category (metadata processing, chemical structure, mass/charge, peak processing, quality control).
- Covers: all filter function signatures with parameters and brief descriptions, common filter combinations
- Relocated inline: key filters (default_filters, normalize_intensities, select_by_relative_intensity, reduce_to_number_of_peaks, remove_peaks_around_precursor_mz, add_fingerprint) in Core API Module 2; filter category summary in Key Concepts
- Omitted: individual filter examples duplicating Core API patterns
-
references/workflows_similarity.md — Extended workflows and detailed similarity function documentation consolidated from two original references.
- Covers: all 8 similarity functions with parameter tables, large-scale comparison strategies, multi-metric scoring patterns, 7 extended workflows (library matching, QC, multi-metric, format conversion, metadata enrichment, large-scale comparison, automated identification report), performance considerations
- Relocated inline: CosineGreedy/ModifiedCosine/CosineHungarian/FingerprintSimilarity usage in Core API Module 3; similarity comparison table in Key Concepts; library matching + QC workflows in Common Workflows
- Omitted: network-based spectral clustering workflow — requires external tool (spec2vec); spectra visualization tutorial — covered by
spectrum.plot()in Core API Module 1
Disposition of original reference files:
- importing_exporting.md (417 lines) → fully consolidated into Core API Module 1 (I/O functions, format list, Spectrum creation, pickle usage). Retained: all 7 import functions, 4 export functions, Spectrum class creation, format selection guidance. Omitted: USI detailed examples — niche use case
- filtering.md (289 lines) → migrated as references/filtering_catalog.md with key filters relocated to Core API Module 2
- similarity.md (381 lines) → consolidated into references/workflows_similarity.md with core functions in Core API Module 3
- workflows.md (648 lines) → consolidated into references/workflows_similarity.md with top workflows in Common Workflows
Related Skills
- pyopenms-mass-spectrometry — full LC-MS/MS proteomics and metabolomics pipelines (feature detection, protein ID)
- rdkit-cheminformatics — molecular fingerprint generation and chemical structure similarity
References
- matchms documentation: https://matchms.readthedocs.io
- Huber et al. (2020) matchms — processing and similarity scoring of mass spectrometry data. Journal of Open Source Software, DOI: 10.21105/joss.02411
- GitHub: https://github.com/matchms/matchms