name: "muon-multiomics-singlecell" description: "Multi-modal single-cell analysis with muon/MuData. Joint RNA+ATAC (10x Multiome), CITE-seq (RNA+protein), other multi-omics. MuData holds per-modality AnnData with shared obs. WNN joint embedding, per-modality preprocessing, MOFA factor analysis. Use scanpy-scrna-seq for single-modality RNA; use muon when combining 2+ omics from the same cells." license: "BSD-3-Clause"
muon — Multi-Modal Single-Cell Analysis
Overview
muon is a Python framework for multi-modal single-cell data analysis that extends the AnnData ecosystem. Its core data structure, MuData, holds multiple AnnData objects (one per modality: RNA, ATAC, protein, etc.) with shared observation and variable axes, enabling coordinated operations across all modalities. muon provides modality-specific preprocessing routines (TF-IDF and LSI for ATAC, CLR normalization for surface proteins), Weighted Nearest Neighbor (WNN) graph construction for joint dimensionality reduction, and cross-modal analysis tools. It integrates directly with scanpy, scvi-tools, and MOFA+ for a complete multi-omics single-cell workflow.
When to Use
- Analyzing 10x Genomics Multiome data (simultaneous RNA + ATAC from the same nuclei)
- Processing CITE-seq experiments (RNA + surface protein from the same cells)
- Building joint UMAP embeddings that integrate signals from two or more modalities via WNN
- Preprocessing ATAC-seq modalities (TF-IDF normalization, LSI dimensionality reduction)
- Normalizing surface protein data with centered log-ratio (CLR) normalization
- Performing cross-modal feature linkage (associating ATAC peaks with nearby gene expression)
- Applying MOFA+ factor analysis across multiple omics layers within a unified container
- Use scanpy-scrna-seq instead when analyzing a single RNA-seq modality without any co-measured omics
- Use scvi-tools (MultiVI / totalVI) when you need probabilistic deep generative batch correction across modalities
Prerequisites
- Python packages:
muon>=0.1.6,scanpy>=1.10,anndata>=0.10,numpy,scipy,pandas,matplotlib,leidenalg - Data requirements: 10x Multiome h5 or h5mu files, or per-modality AnnData objects (cells x features) sharing the same
obs_names - Environment: Python 3.9+; 16 GB+ RAM for datasets >50k cells; optional
mofapy2for MOFA factor analysis
pip install "muon[all]" "scanpy[leiden]" anndata
# Optional: for MOFA+ integration
pip install mofapy2
Quick Start
import muon as mu
import scanpy as sc
import numpy as np
import anndata as ad
import pandas as pd
# Simulate a small RNA + ATAC MuData object (200 cells)
np.random.seed(42)
n_cells = 200
rna = ad.AnnData(np.abs(np.random.negative_binomial(5, 0.3, (n_cells, 2000))).astype(float),
obs=pd.DataFrame(index=[f"cell_{i}" for i in range(n_cells)]),
var=pd.DataFrame(index=[f"gene_{j}" for j in range(2000)]))
atac = ad.AnnData(np.abs(np.random.negative_binomial(2, 0.5, (n_cells, 5000))).astype(float),
obs=rna.obs.copy(),
var=pd.DataFrame(index=[f"peak_{j}" for j in range(5000)]))
mdata = mu.MuData({"rna": rna, "atac": atac})
print(mdata)
# MuData object with n_obs × n_vars = 200 × 7000
# 2 modalities
# rna: 200 × 2000
# atac: 200 × 5000
# Minimal preprocessing
sc.pp.normalize_total(mdata["rna"], target_sum=1e4)
sc.pp.log1p(mdata["rna"])
sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500)
sc.pp.pca(mdata["rna"])
mu.atac.pp.tfidf(mdata["atac"])
mu.atac.tl.lsi(mdata["atac"])
# WNN joint embedding
mu.pp.neighbors(mdata, key_added="wnn",
use_rep={"rna": "X_pca", "atac": "X_lsi"})
sc.tl.umap(mdata, neighbors_key="wnn")
sc.tl.leiden(mdata, neighbors_key="wnn", key_added="leiden_wnn")
print(f"Clusters: {mdata.obs['leiden_wnn'].nunique()}")
Core API
Module 1: MuData Creation and I/O
MuData is the central container: a dictionary of AnnData modalities plus shared obs and var slots. The mdata.obs DataFrame merges per-modality observation metadata with a modality prefix for ambiguous columns.
import muon as mu
import anndata as ad
import numpy as np
import pandas as pd
# --- Build from per-modality AnnData objects ---
n_cells = 300
rna = ad.AnnData(np.abs(np.random.negative_binomial(5, 0.3, (n_cells, 3000))).astype(float),
obs=pd.DataFrame(index=[f"cell_{i}" for i in range(n_cells)]),
var=pd.DataFrame(index=[f"gene_{j}" for j in range(3000)]))
prot = ad.AnnData(np.abs(np.random.randn(n_cells, 30)) + 3,
obs=rna.obs.copy(),
var=pd.DataFrame(index=[f"protein_{j}" for j in range(30)]))
mdata = mu.MuData({"rna": rna, "protein": prot})
print(mdata)
# Access individual modalities
print(mdata.mod["rna"]) # AnnData: 300 × 3000
print(mdata["rna"]) # same shorthand
print(mdata.obs.head()) # shared observation metadata
print(mdata.obs_names[:5]) # cell barcodes
# Save and load
mdata.write("multiome_data.h5mu")
mdata2 = mu.read("multiome_data.h5mu")
print(f"Loaded: {mdata2.n_obs} cells, {mdata2.n_mod} modalities")
# --- Load from 10x Multiome h5 file ---
# mdata = mu.read_10x_h5("filtered_feature_bc_matrix.h5")
# Produces MuData with mdata["rna"] and mdata["atac"] modalities
# --- Subsetting by cells or features ---
# Select high-quality cells (e.g., after QC)
mask = np.ones(mdata.n_obs, dtype=bool) # replace with actual QC mask
mdata_filtered = mdata[mask].copy()
print(f"After filtering: {mdata_filtered.n_obs} cells")
# Propagate obs mask to each modality
mu.pp.intersect_obs(mdata) # ensures obs consistency across modalities
Module 2: RNA Modality Preprocessing
Standard scRNA-seq preprocessing applied to mdata["rna"] using scanpy functions. The RNA modality is preprocessed identically to a standalone scanpy workflow but operates on the slice of the MuData container.
import scanpy as sc
import numpy as np
# Assume mdata["rna"] has raw integer counts
rna = mdata["rna"]
# QC metrics
sc.pp.calculate_qc_metrics(rna, percent_top=None, log1p=False, inplace=True)
rna.obs["pct_counts_mt"] = rna[:, rna.var_names.str.startswith("MT-")].X.sum(axis=1).A1 / rna.obs["total_counts"] * 100
# Filter cells and genes
min_genes, max_genes, max_mt = 200, 5000, 20
sc.pp.filter_cells(rna, min_genes=min_genes)
sc.pp.filter_genes(rna, min_cells=5)
rna = rna[(rna.obs["n_genes_by_counts"] < max_genes) &
(rna.obs["pct_counts_mt"] < max_mt)].copy()
print(f"RNA after QC: {rna.n_obs} cells × {rna.n_vars} genes")
# Normalize and log-transform
sc.pp.normalize_total(rna, target_sum=1e4)
sc.pp.log1p(rna)
# Highly variable genes
sc.pp.highly_variable_genes(rna, n_top_genes=3000, flavor="seurat_v3",
batch_key=None)
print(f"HVGs: {rna.var['highly_variable'].sum()}")
# PCA on HVGs
sc.pp.pca(rna, n_comps=50, use_highly_variable=True)
print(f"PCA embedding shape: {rna.obsm['X_pca'].shape}")
# Update slice in MuData
mdata.mod["rna"] = rna
Module 3: ATAC Modality Preprocessing
ATAC-seq modalities require a different normalization strategy. TF-IDF (term frequency–inverse document frequency) normalizes peak accessibility across cells and peaks; LSI (latent semantic indexing, equivalent to truncated SVD after TF-IDF) produces a low-dimensional embedding. The first LSI component typically captures sequencing depth rather than biology and is excluded.
# Assume mdata["atac"] contains raw binary or integer peak accessibility counts
atac = mdata["atac"]
# Basic QC: filter low-coverage cells and low-frequency peaks
sc.pp.calculate_qc_metrics(atac, percent_top=None, log1p=False, inplace=True)
sc.pp.filter_cells(atac, min_genes=200) # min peaks detected
sc.pp.filter_genes(atac, min_cells=10) # min cells a peak appears in
print(f"ATAC after QC: {atac.n_obs} cells × {atac.n_vars} peaks")
# TF-IDF normalization (log(TF) * log(IDF) scaling)
mu.atac.pp.tfidf(atac, scale_factor=1e4)
print("TF-IDF normalization complete")
print(f"ATAC data range: [{atac.X.min():.2f}, {atac.X.max():.2f}]")
# LSI dimensionality reduction (truncated SVD on TF-IDF matrix)
mu.atac.tl.lsi(atac, n_comps=50, use_highly_variable=False)
# LSI component 1 correlates with sequencing depth — exclude it
# mu.atac.tl.lsi sets X_lsi starting from component 2 by default
print(f"LSI embedding shape: {atac.obsm['X_lsi'].shape}")
# Update modality in MuData
mdata.mod["atac"] = atac
Module 4: WNN Graph and Joint Embedding
Weighted Nearest Neighbor (WNN) integrates multiple modality embeddings by learning per-cell, per-modality weights. Cells with high-quality RNA signal receive higher RNA weight; cells with cleaner ATAC signal receive higher ATAC weight. The resulting WNN graph is used for UMAP layout and Leiden clustering.
# Compute per-modality neighbor graphs first (optional but enables modality-specific UMAPs)
mu.pp.neighbors(mdata["rna"], use_rep="X_pca", n_neighbors=30, key_added="neighbors")
mu.pp.neighbors(mdata["atac"], use_rep="X_lsi", n_neighbors=30, key_added="neighbors")
# WNN joint neighbor graph across modalities
mu.pp.neighbors(
mdata,
key_added="wnn",
use_rep={"rna": "X_pca", "atac": "X_lsi"},
n_neighbors=30,
random_state=42,
)
print("WNN graph built. Keys:", list(mdata.obsp.keys()))
# Expected: ['wnn_connectivities', 'wnn_distances']
# UMAP from WNN graph
sc.tl.umap(mdata, neighbors_key="wnn", random_state=42)
print(f"UMAP embedding shape: {mdata.obsm['X_umap'].shape}")
# Leiden clustering from WNN graph
sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.5, key_added="leiden_wnn")
n_clusters = mdata.obs["leiden_wnn"].nunique()
print(f"Leiden WNN clustering: {n_clusters} clusters at resolution 0.5")
Module 5: Visualization
muon extends scanpy's plotting interface with modality-aware functions. mu.pl.embedding() colors joint UMAP embeddings by features from any modality; sc.pl.umap() with color pointing to modality-prefixed feature names (e.g., "rna:CD3E") is also supported.
import matplotlib.pyplot as plt
# Joint UMAP colored by cluster assignment
sc.pl.umap(mdata, color="leiden_wnn", title="WNN Leiden clusters",
legend_loc="on data", show=False)
plt.savefig("wnn_umap_clusters.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved wnn_umap_clusters.png")
# Color by RNA gene expression on joint UMAP
sc.pl.umap(mdata, color=["rna:CD3E", "rna:CD19", "rna:CD14"],
use_raw=False, vmax="p99", show=False, ncols=3)
plt.savefig("wnn_umap_markers.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved wnn_umap_markers.png")
# Per-modality scatter / embedding plots
mu.pl.embedding(mdata, basis="X_umap", color="leiden_wnn",
show=False)
plt.savefig("embedding_clusters.png", dpi=150, bbox_inches="tight")
plt.close()
# Violin plot: RNA QC metrics per cluster
sc.pl.violin(mdata["rna"], keys=["n_genes_by_counts", "total_counts"],
groupby=mdata.obs.loc[mdata["rna"].obs_names, "leiden_wnn"],
rotation=90, show=False)
plt.savefig("rna_qc_by_cluster.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved rna_qc_by_cluster.png")
Module 6: Cross-Modal Analysis
Cross-modal analysis links features across modalities — for example, associating ATAC peak accessibility near a gene's promoter with its RNA expression, or applying MOFA+ to find shared latent factors.
# --- Peak-to-gene distance annotation ---
# Annotate ATAC peaks with nearest gene (requires genomic coordinates in var)
# mdata["atac"].var should contain columns: chrom, chromStart, chromEnd
# mu.atac.tl.rank_peaks_groups(mdata, groupby="leiden_wnn") # differential peaks
# --- Differentially accessible peaks per cluster ---
atac = mdata["atac"]
# Transfer cluster labels from MuData obs to ATAC obs
atac.obs["cluster"] = mdata.obs.loc[atac.obs_names, "leiden_wnn"].values
sc.tl.rank_genes_groups(atac, groupby="cluster", method="wilcoxon",
use_raw=False)
top_peaks = sc.get.rank_genes_groups_df(atac, group="0").head(5)
print("Top differential peaks in cluster 0:")
print(top_peaks[["names", "logfoldchanges", "pvals_adj"]])
# --- MOFA+ factor analysis on MuData ---
# Requires: pip install mofapy2
try:
mu.tl.mofa(mdata, n_factors=10, seed=42,
use_obs="all", outfile="mofa_model.hdf5")
# Factor scores stored in mdata.obsm["X_mofa"]
print(f"MOFA factors: {mdata.obsm['X_mofa'].shape}")
# Variance explained per factor per modality
# Inspect with mdata.uns["mofa"]
except ImportError:
print("Install mofapy2 to enable MOFA factor analysis")
Common Workflows
Workflow 1: Full 10x Multiome (RNA + ATAC) Joint WNN Clustering
Goal: Process paired RNA and ATAC data from 10x Genomics Multiome, build a WNN joint embedding, cluster cells, and identify RNA marker genes per cluster.
import muon as mu
import scanpy as sc
import numpy as np
import anndata as ad
import pandas as pd
import matplotlib.pyplot as plt
# --- 1. Simulate 10x Multiome-like data (replace with mu.read_10x_h5()) ---
np.random.seed(0)
n_cells = 400
cell_ids = [f"AAACGAATC{i:05d}-1" for i in range(n_cells)]
# Simulate two cell types with distinct expression/accessibility
labels = np.array(["typeA"] * 200 + ["typeB"] * 200)
rna_counts = np.abs(np.random.negative_binomial(6, 0.35, (n_cells, 2000))).astype(float)
rna_counts[:200, :100] += 10 # typeA marker genes
rna_counts[200:, 100:200] += 10 # typeB marker genes
atac_counts = np.abs(np.random.binomial(1, 0.05, (n_cells, 50000))).astype(float)
atac_counts[:200, :5000] += 1 # typeA accessible peaks
atac_counts[200:, 5000:10000] += 1 # typeB accessible peaks
rna = ad.AnnData(rna_counts,
obs=pd.DataFrame({"cell_type": labels}, index=cell_ids),
var=pd.DataFrame(index=[f"gene_{j}" for j in range(2000)]))
atac = ad.AnnData(atac_counts,
obs=pd.DataFrame({"cell_type": labels}, index=cell_ids),
var=pd.DataFrame(index=[f"peak_{j}" for j in range(50000)]))
mdata = mu.MuData({"rna": rna, "atac": atac})
print(f"Loaded: {mdata}")
# --- 2. RNA preprocessing ---
sc.pp.filter_cells(mdata["rna"], min_genes=50)
sc.pp.filter_genes(mdata["rna"], min_cells=5)
sc.pp.normalize_total(mdata["rna"], target_sum=1e4)
sc.pp.log1p(mdata["rna"])
sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500)
sc.pp.pca(mdata["rna"], n_comps=30, use_highly_variable=True)
print(f"RNA PCA: {mdata['rna'].obsm['X_pca'].shape}")
# --- 3. ATAC preprocessing: TF-IDF + LSI ---
sc.pp.filter_cells(mdata["atac"], min_genes=100)
sc.pp.filter_genes(mdata["atac"], min_cells=5)
mu.atac.pp.tfidf(mdata["atac"])
mu.atac.tl.lsi(mdata["atac"], n_comps=30)
print(f"ATAC LSI: {mdata['atac'].obsm['X_lsi'].shape}")
# --- 4. Per-modality neighbors (optional for modality-specific plots) ---
mu.pp.neighbors(mdata["rna"], use_rep="X_pca", n_neighbors=15, key_added="neighbors")
mu.pp.neighbors(mdata["atac"], use_rep="X_lsi", n_neighbors=15, key_added="neighbors")
# --- 5. WNN joint neighbor graph ---
mu.pp.intersect_obs(mdata) # align obs after per-modality filtering
mu.pp.neighbors(mdata, key_added="wnn",
use_rep={"rna": "X_pca", "atac": "X_lsi"},
n_neighbors=15, random_state=42)
# --- 6. UMAP + Leiden clustering ---
sc.tl.umap(mdata, neighbors_key="wnn", random_state=42)
sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.5, key_added="leiden_wnn")
print(f"Clusters: {mdata.obs['leiden_wnn'].value_counts().to_dict()}")
# --- 7. Marker genes per cluster ---
sc.tl.rank_genes_groups(mdata["rna"],
groupby=mdata.obs.loc[mdata["rna"].obs_names, "leiden_wnn"],
method="wilcoxon", use_raw=False)
top_markers = sc.get.rank_genes_groups_df(mdata["rna"], group="0").head(5)
print("Top RNA markers for cluster 0:")
print(top_markers[["names", "logfoldchanges", "pvals_adj"]])
# --- 8. Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sc.pl.umap(mdata, color="leiden_wnn", ax=axes[0], show=False, title="WNN clusters")
sc.pl.umap(mdata, color="rna:gene_0", ax=axes[1], show=False, title="gene_0 expression",
use_raw=False, vmax="p99")
plt.tight_layout()
plt.savefig("multiome_wnn_pipeline.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved multiome_wnn_pipeline.png")
Workflow 2: CITE-seq (RNA + Surface Protein) Analysis
Goal: Process CITE-seq data with paired RNA and antibody-derived tag (ADT/protein) counts. Normalize proteins with CLR, annotate cell types from protein markers, and build a joint UMAP.
import muon as mu
import scanpy as sc
import numpy as np
import anndata as ad
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
np.random.seed(1)
n_cells = 350
cell_ids = [f"CITESEQ_{i:04d}" for i in range(n_cells)]
# Simulate RNA + protein modalities
# Three cell types: T-cells (CD3+CD4+), B-cells (CD19+CD20+), Monocytes (CD14+CD16+)
n_t, n_b, n_mono = 120, 130, 100
labels = ["Tcell"] * n_t + ["Bcell"] * n_b + ["Monocyte"] * n_mono
rna_mat = np.abs(np.random.negative_binomial(4, 0.4, (n_cells, 1500))).astype(float)
rna_mat[:n_t, :50] += 15 # T-cell RNA markers
rna_mat[n_t:n_t+n_b, 50:100] += 15 # B-cell RNA markers
rna_mat[n_t+n_b:, 100:150] += 15 # Monocyte RNA markers
# 20 surface proteins: first 5 T-cell, next 5 B-cell, next 5 Monocyte, rest baseline
prot_mat = np.abs(np.random.normal(2, 0.5, (n_cells, 20)))
prot_mat[:n_t, :5] += 8 # CD3, CD4, CD5, CD7, CD8 for T-cells
prot_mat[n_t:n_t+n_b, 5:10] += 8 # CD19, CD20, CD22, CD24, CD79 for B-cells
prot_mat[n_t+n_b:, 10:15] += 8 # CD14, CD16, CD64, CD11b, HLA-DR for Monocytes
protein_names = ["CD3", "CD4", "CD5", "CD7", "CD8",
"CD19", "CD20", "CD22", "CD24", "CD79a",
"CD14", "CD16", "CD64", "CD11b", "HLA-DR",
"CD25", "CD56", "CD45RA", "CD45RO", "IgG-ctrl"]
rna = ad.AnnData(csr_matrix(rna_mat),
obs=pd.DataFrame({"cell_type": labels}, index=cell_ids),
var=pd.DataFrame(index=[f"gene_{j}" for j in range(1500)]))
prot = ad.AnnData(prot_mat,
obs=pd.DataFrame({"cell_type": labels}, index=cell_ids),
var=pd.DataFrame(index=protein_names))
mdata = mu.MuData({"rna": rna, "protein": prot})
print(mdata)
# --- RNA preprocessing ---
sc.pp.normalize_total(mdata["rna"], target_sum=1e4)
sc.pp.log1p(mdata["rna"])
sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500)
sc.pp.pca(mdata["rna"], n_comps=30, use_highly_variable=True)
# --- Protein CLR normalization (Centered Log-Ratio) ---
# CLR normalizes each protein across cells: log(x / geometric_mean(x))
mu.prot.pp.clr(mdata["protein"])
print("Protein CLR normalization applied")
print(f"Protein data range: [{mdata['protein'].X.min():.2f}, {mdata['protein'].X.max():.2f}]")
# PCA on protein modality
sc.pp.pca(mdata["protein"], n_comps=min(15, prot.n_vars - 1))
# --- WNN joint embedding ---
mu.pp.neighbors(mdata, key_added="wnn",
use_rep={"rna": "X_pca", "protein": "X_pca"},
n_neighbors=20, random_state=0)
sc.tl.umap(mdata, neighbors_key="wnn", random_state=0)
sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.4, key_added="leiden_wnn")
# --- Protein-based cell type annotation ---
# Compute mean CLR protein per cluster
prot_df = pd.DataFrame(
mdata["protein"].X,
index=mdata["protein"].obs_names,
columns=protein_names
)
prot_df["cluster"] = mdata.obs.loc[mdata["protein"].obs_names, "leiden_wnn"].values
cluster_means = prot_df.groupby("cluster").mean()
print("\nMean CLR protein per cluster:")
print(cluster_means[["CD3", "CD4", "CD19", "CD14"]].round(2))
# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
sc.pl.umap(mdata, color="leiden_wnn", ax=axes[0], show=False, title="WNN Leiden")
sc.pl.umap(mdata, color="protein:CD3", ax=axes[1], show=False, title="CD3 (CLR)",
use_raw=False, vmax="p99")
sc.pl.umap(mdata, color="protein:CD19", ax=axes[2], show=False, title="CD19 (CLR)",
use_raw=False, vmax="p99")
plt.tight_layout()
plt.savefig("citeseq_joint_umap.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved citeseq_joint_umap.png")
# --- Dot plot: protein markers per cluster ---
sc.pl.dotplot(mdata["protein"],
var_names=["CD3", "CD4", "CD19", "CD20", "CD14", "CD16"],
groupby=mdata.obs.loc[mdata["protein"].obs_names, "leiden_wnn"],
show=False)
plt.savefig("citeseq_protein_dotplot.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved citeseq_protein_dotplot.png")
Key Parameters
| Parameter | Module / Function | Default | Range / Options | Effect |
|---|---|---|---|---|
n_neighbors | mu.pp.neighbors() | 30 | 10–100 | Neighborhood size for graph; lower = finer local structure |
use_rep | mu.pp.neighbors() | None (auto) | dict of {modality: embedding_key} | Which embedding per modality to use for WNN |
key_added | mu.pp.neighbors() | "neighbors" | any string | Key under which graph is stored; use "wnn" for joint graphs |
n_comps | sc.pp.pca(), mu.atac.tl.lsi() | 50 | 10–100 | Number of reduced dimensions; 30–50 typical |
resolution | sc.tl.leiden() | 1.0 | 0.1–2.0 | Clustering granularity; lower = fewer, larger clusters |
scale_factor | mu.atac.pp.tfidf() | 1e4 | 1e3–1e5 | TF scaling constant before log transform |
n_factors | mu.tl.mofa() | 10 | 5–50 | Number of MOFA latent factors; use elbow on variance explained |
target_sum | sc.pp.normalize_total() | 1e4 | 1e3–1e6 | Library size normalization target per cell (RNA) |
n_top_genes | sc.pp.highly_variable_genes() | varies | 1000–5000 | Number of highly variable genes to retain for PCA |
Best Practices
-
Align obs before WNN: After per-modality QC filtering, cell counts across modalities may differ. Always call
mu.pp.intersect_obs(mdata)before WNN to ensure every modality has the same cell set.mu.pp.intersect_obs(mdata) print(f"Shared cells after intersect: {mdata.n_obs}") -
Drop LSI component 1 from ATAC: The first LSI component captures sequencing depth (total counts per cell) rather than biological signal.
mu.atac.tl.lsi()stores all components inX_lsi, so verify that component 1 correlates withlog_total_countsbefore usingX_lsiin WNN; if so, restrict toX_lsi[:, 1:].import numpy as np, pandas as pd atac = mdata["atac"] corr = np.corrcoef(atac.obsm["X_lsi"][:, 0], np.log1p(atac.obs["total_counts"]))[0, 1] print(f"LSI1 vs log_counts correlation: {corr:.3f}") # If |corr| > 0.9, exclude component 1: # atac.obsm["X_lsi"] = atac.obsm["X_lsi"][:, 1:] -
Use
key_added="wnn"consistently: Always name the joint graph"wnn"and passneighbors_key="wnn"to all downstreamsc.tl.umap(),sc.tl.leiden(), andsc.tl.paga()calls. Mixingneighbors_keyvalues silently uses the wrong graph. -
CLR normalization for proteins, not log-normalization: Antibody-derived tag (ADT) counts from CITE-seq have a different noise model than RNA. Use
mu.prot.pp.clr()for per-protein CLR normalization. Do NOT applysc.pp.normalize_total()+sc.pp.log1p()to the protein modality. -
Store raw RNA counts before normalization: Before any normalization, store the raw RNA counts in
mdata["rna"].layers["counts"]and setmdata["rna"].raw = mdata["rna"]. This is required for downstream differential expression tests and scvi-tools integration.import scipy.sparse as sp mdata["rna"].layers["counts"] = mdata["rna"].X.copy() # Store pre-normalization snapshot mdata["rna"].raw = mdata["rna"] -
Match WNN embedding dimensions: The RNA PCA and ATAC LSI embeddings passed to WNN should have comparable dimensionality (e.g., both 30 or 50 components). Mismatched dimensions do not cause errors but can down-weight the smaller modality.
Common Recipes
Recipe: Compute Per-Cluster Modality Weights from WNN
When to use: Inspect which cells rely more on RNA vs ATAC signal in the WNN graph. High RNA weight in a cluster suggests cleaner RNA data; high ATAC weight suggests stronger chromatin accessibility signal.
# WNN stores per-cell modality weights in mdata.obsm after mu.pp.neighbors()
# Key name: "{key_added}_weights" — check available keys
print([k for k in mdata.obsm.keys() if "wnn" in k.lower()])
# If weights are stored (depends on muon version):
if "wnn_weights" in mdata.obsm:
weights_df = pd.DataFrame(
mdata.obsm["wnn_weights"],
index=mdata.obs_names,
columns=["rna_weight", "atac_weight"]
)
weights_df["cluster"] = mdata.obs["leiden_wnn"].values
print(weights_df.groupby("cluster").mean().round(3))
else:
# Proxy: compare RNA vs ATAC PCA explained variance per cell
rna_var = np.var(mdata["rna"].obsm["X_pca"], axis=1)
atac_var = np.var(mdata["atac"].obsm["X_lsi"], axis=1)
mdata.obs["rna_signal_proxy"] = rna_var / (rna_var + atac_var)
mdata.obs["atac_signal_proxy"] = atac_var / (rna_var + atac_var)
print(mdata.obs[["rna_signal_proxy", "atac_signal_proxy", "leiden_wnn"]].groupby("leiden_wnn").mean().round(3))
Recipe: Export Cluster Labels Back to Per-Modality AnnData
When to use: After joint WNN clustering, copy shared cluster labels back into each modality's .obs for modality-specific downstream analyses (e.g., differential peaks within clusters).
# Copy joint cluster labels into each modality's obs
for mod_name in mdata.mod:
mod_adata = mdata[mod_name]
shared_cells = mod_adata.obs_names
mod_adata.obs["leiden_wnn"] = mdata.obs.loc[shared_cells, "leiden_wnn"].values
print(f"{mod_name}: added leiden_wnn, {mod_adata.obs['leiden_wnn'].nunique()} clusters")
# Now run modality-specific differential analysis per cluster
sc.tl.rank_genes_groups(mdata["rna"], groupby="leiden_wnn",
method="wilcoxon", use_raw=False)
sc.tl.rank_genes_groups(mdata["atac"], groupby="leiden_wnn",
method="wilcoxon", use_raw=False)
print("Differential features computed for both modalities")
Recipe: Convert MuData to Concatenated AnnData for scvi-tools
When to use: MultiVI and totalVI in scvi-tools require a concatenated AnnData with modality identity tracked in var. This recipe prepares the input for these models.
# Concatenate RNA and ATAC into a single AnnData with modality column in var
rna_adata = mdata["rna"].copy()
atac_adata = mdata["atac"].copy()
rna_adata.var["modality"] = "Gene Expression"
atac_adata.var["modality"] = "Peaks"
# Restore raw counts for scvi-tools (requires integer counts)
# Ensure mdata["rna"].layers["counts"] and mdata["atac"].X are integer
import anndata as ad
combined = ad.concat([rna_adata, atac_adata], axis=1, merge="unique")
combined.obs = mdata.obs.loc[combined.obs_names].copy()
print(f"Combined AnnData: {combined.n_obs} cells × {combined.n_vars} features")
print(combined.var["modality"].value_counts())
# Use combined with scvi.model.MULTIVI.setup_anndata(combined, ...)
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
KeyError: 'rna' when building MuData | Modality key not set or loaded from file without expected names | Check mdata.mod.keys(). When loading 10x h5 files, use mu.read_10x_h5() which auto-names modalities "rna" and "atac" |
ValueError: obs_names mismatch during WNN | Cells filtered differently per modality leaving mismatched obs | Call mu.pp.intersect_obs(mdata) after per-modality QC to harmonize cell sets |
| UMAP/Leiden uses wrong graph | Multiple neighbor graphs exist; function uses default key "neighbors" | Always pass neighbors_key="wnn" explicitly to sc.tl.umap() and sc.tl.leiden() |
| LSI component 1 dominates ATAC embedding | First component captures sequencing depth, not biology | Compute correlation of X_lsi[:, 0] with log_total_counts; if |
mu.atac.pp.tfidf raises sparse matrix error | Input matrix is dense or wrong dtype | Convert first: mdata["atac"].X = scipy.sparse.csr_matrix(mdata["atac"].X.astype(float)) |
| CLR normalization produces NaN for proteins | Zero counts in a cell for all proteins | Filter cells with sc.pp.filter_cells(mdata["protein"], min_genes=1) before CLR |
mu.tl.mofa() fails with missing mofapy2 | mofapy2 not installed | pip install mofapy2; ensure muon version ≥ 0.1.5 for MOFA integration |
| Memory error for large ATAC peak matrices | ATAC matrices (50k+ peaks × 10k+ cells) exceed RAM | Use sc.pp.highly_variable_genes() to select top 50k peaks first, or load in chunks |
Related Skills
- scanpy-scrna-seq — single-modality RNA analysis; use as the foundation for the RNA preprocessing steps within muon
- scvi-tools-single-cell — deep generative multi-modal integration (MultiVI for RNA+ATAC, totalVI for CITE-seq); use when probabilistic batch correction across samples is needed
- mofaplus-multi-omics — multi-omics factor analysis;
mu.tl.mofa()calls mofapy2 internally and stores factors in MuData - anndata-data-structure — AnnData fundamentals; each MuData modality is a standard AnnData object
- deeptools-ngs-analysis — upstream ATAC-seq BAM → bigWig normalization before peak calling
- macs3-peak-calling — produces the peak BED files used to generate the ATAC AnnData count matrix
References
- muon documentation — official docs, tutorials, API reference
- muon GitHub (scverse) — source code, issues, examples
- Bredikhin et al. Genome Biology 2022 — muon/MuData publication
- 10x Genomics Multiome analysis guide — step-by-step Multiome tutorial in muon
- WNN method (Hao et al. Cell 2021) — original Weighted Nearest Neighbor paper from Seurat v4