name: umap-learn description: >- UMAP dimensionality reduction for visualization, clustering prep, and feature engineering. Fast nonlinear manifold learning preserving local and global structure. Standard UMAP (fit/transform, sklearn-compatible), supervised/semi-supervised, Parametric UMAP (NN encoder/decoder, TensorFlow), DensMAP (density), AlignedUMAP (temporal/batch). 15+ distance metrics, custom Numba metrics, precomputed distances. For linear reduction use PCA; for neighborhood graphs use sklearn NearestNeighbors. license: BSD-3-Clause
UMAP-Learn
Overview
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm for visualization and general non-linear dimensionality reduction. It is faster than t-SNE, scales to larger datasets, preserves both local and global structure, and supports supervised learning and embedding of new data points.
When to Use
- Reducing high-dimensional data to 2D/3D for visualization
- Preprocessing for density-based clustering (HDBSCAN, DBSCAN)
- Feature engineering in ML pipelines (transform new data into learned embedding)
- Supervised/semi-supervised embedding with partial labels
- Tracking embeddings across time points or batches (AlignedUMAP)
- Density-preserving embeddings (DensMAP)
- Neural network-based embedding with custom architectures (Parametric UMAP)
- For linear dimensionality reduction use PCA (scikit-learn)
- For neighborhood-graph construction without embedding use scikit-learn NearestNeighbors
Prerequisites
pip install umap-learn
# For Parametric UMAP (neural network variant)
pip install umap-learn[parametric_umap] # requires TensorFlow 2.x
Critical: Always standardize features before applying UMAP to ensure equal weighting across dimensions.
Quick Start
import umap
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
# Load and scale data
X, y = load_digits(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)
# Fit and transform
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled)
print(f"Input: {X_scaled.shape}, Output: {embedding.shape}")
# Input: (1797, 64), Output: (1797, 2)
Core API
1. Standard UMAP
Basic dimensionality reduction following scikit-learn conventions.
import umap
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(data)
# Method 1: fit_transform (single step)
embedding = umap.UMAP(
n_neighbors=15, # local neighborhood size (2-200)
min_dist=0.1, # min distance between embedded points (0.0-0.99)
n_components=2, # output dimensions
metric='euclidean', # distance metric
random_state=42, # reproducibility
).fit_transform(X_scaled)
print(f"Embedding shape: {embedding.shape}")
# Method 2: fit + access (for reuse)
reducer = umap.UMAP(random_state=42)
reducer.fit(X_scaled)
embedding = reducer.embedding_ # trained embedding
graph = reducer.graph_ # fuzzy simplicial set (sparse matrix)
# Visualization
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.tight_layout()
plt.savefig('umap_embedding.png', dpi=150)
2. Supervised & Semi-Supervised UMAP
Incorporate label information to guide embedding via the y parameter.
import umap
# Supervised — all labels known
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled, y=labels)
# Semi-supervised — partial labels (mark unlabeled as -1)
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled, y=semi_labels)
# Control label influence with target_weight (0.0=unsupervised, 1.0=fully supervised)
reducer = umap.UMAP(
target_weight=0.7, # emphasize labels
target_metric='categorical', # for classification; use distance metric for regression
random_state=42
)
embedding = reducer.fit_transform(X_scaled, y=labels)
print(f"Supervised embedding: {embedding.shape}")
3. Transform New Data
Project unseen data into the trained embedding space.
import umap
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Fit on training data
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_emb = reducer.fit_transform(X_train_scaled)
# Transform test data
X_test_emb = reducer.transform(X_test_scaled)
print(f"Train: {X_train_emb.shape}, Test: {X_test_emb.shape}")
# Works in sklearn Pipelines
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipeline = Pipeline([
('scaler', StandardScaler()),
('umap', umap.UMAP(n_components=10, random_state=42)),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.3f}")
4. Parametric UMAP
Neural network-based embedding via TensorFlow/Keras. Enables efficient transform, reconstruction, and custom architectures.
from umap.parametric_umap import ParametricUMAP
# Default architecture (3-layer, 100-neuron FC network)
embedder = ParametricUMAP(n_components=2, random_state=42)
embedding = embedder.fit_transform(X_scaled)
new_emb = embedder.transform(new_data) # fast neural network inference
print(f"Parametric embedding: {embedding.shape}")
import tensorflow as tf
from umap.parametric_umap import ParametricUMAP
# Custom encoder/decoder for autoencoder mode
input_dim = X_scaled.shape[1]
encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(2),
])
decoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(2,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(input_dim),
])
embedder = ParametricUMAP(
encoder=encoder, decoder=decoder, dims=(input_dim,),
parametric_reconstruction=True, autoencoder_loss=True,
n_training_epochs=10, batch_size=128,
n_neighbors=15, min_dist=0.1, random_state=42
)
embedding = embedder.fit_transform(X_scaled)
reconstructed = embedder.inverse_transform(embedding)
print(f"Reconstruction error: {np.mean((X_scaled - reconstructed)**2):.4f}")
5. DensMAP
Variant preserving local density information in the embedding.
import umap
reducer = umap.UMAP(
densmap=True, # enable DensMAP
dens_lambda=2.0, # density preservation weight
dens_frac=0.3, # fraction for density estimation
output_dens=True, # output density estimates
n_neighbors=15,
min_dist=0.1,
random_state=42
)
embedding = reducer.fit_transform(X_scaled)
# Access density estimates
original_density = reducer.rad_orig_ # density in original space
embedded_density = reducer.rad_emb_ # density in embedded space
print(f"DensMAP embedding: {embedding.shape}")
print(f"Density correlation: {np.corrcoef(original_density, embedded_density)[0,1]:.3f}")
6. AlignedUMAP
Align embeddings across multiple related datasets (time points, batches).
from umap import AlignedUMAP
# Multiple related datasets
datasets = [day1_data, day2_data, day3_data]
mapper = AlignedUMAP(
n_neighbors=15,
alignment_regularisation=1e-2, # alignment strength
alignment_window_size=2, # align with N adjacent datasets
n_components=2,
random_state=42
)
mapper.fit(datasets)
aligned_embeddings = mapper.embeddings_ # list of aligned embedding arrays
print(f"Aligned {len(aligned_embeddings)} datasets")
for i, emb in enumerate(aligned_embeddings):
print(f" Dataset {i}: {emb.shape}")
Key Concepts
Parameter Tuning Guide
| Parameter | Low | Medium (default) | High | Effect |
|---|---|---|---|---|
n_neighbors | 2-5 | 15 | 50-200 | Local detail vs global structure |
min_dist | 0.0 | 0.1 | 0.5-0.99 | Tight clusters vs spread out |
n_components | 2 | 2 | 5-50 | Visualization vs ML/clustering |
spread | 0.5 | 1.0 | 2.0 | Embedding scale (with min_dist) |
Configuration by Use-Case
| Use-Case | n_neighbors | min_dist | n_components | metric |
|---|---|---|---|---|
| Visualization | 15 | 0.1 | 2 | euclidean |
| Clustering (HDBSCAN) | 30 | 0.0 | 5-10 | euclidean |
| Text/document embedding | 15 | 0.1 | 2 | cosine |
| Global structure | 100 | 0.5 | 2 | euclidean |
| ML feature engineering | 15-30 | 0.1 | 10-50 | euclidean |
| Binary/set data | 15 | 0.1 | 2 | hamming/jaccard |
Supported Metrics
Minkowski family: euclidean, manhattan, chebyshev, minkowski. Spatial: canberra, braycurtis, haversine. Correlation: cosine, correlation. Binary: hamming, jaccard, dice, russellrao, rogerstanimoto, sokalmichener, sokalsneath, yule. Special: precomputed (distance matrix), custom Numba-compiled callables.
Standard UMAP vs Parametric UMAP
| Feature | Standard | Parametric |
|---|---|---|
| Backend | Direct optimization | TensorFlow neural network |
| Transform speed | Moderate | Fast (neural net inference) |
| Inverse transform | Approximate, expensive | Decoder network, fast |
| Custom architecture | No | Yes (CNNs, RNNs, etc.) |
| Requirements | umap-learn | umap-learn + TensorFlow 2.x |
| Best for | Quick exploration | Production pipelines, reconstruction |
Common Workflows
Workflow 1: UMAP + HDBSCAN Clustering Pipeline
import umap
import hdbscan
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
# Step 1: Preprocess
X_scaled = StandardScaler().fit_transform(data)
print(f"Input shape: {X_scaled.shape}")
# Step 2: UMAP for clustering (NOT visualization parameters)
reducer = umap.UMAP(
n_neighbors=30, # more global structure for clustering
min_dist=0.0, # allow tight packing
n_components=10, # higher dims preserve density better than 2D
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(X_scaled)
# Step 3: HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=5)
cluster_labels = clusterer.fit_predict(embedding)
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
noise = sum(cluster_labels == -1)
print(f"Clusters: {n_clusters}, Noise: {noise}")
# Step 4: Separate 2D embedding for visualization
vis_emb = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42).fit_transform(X_scaled)
plt.scatter(vis_emb[:, 0], vis_emb[:, 1], c=cluster_labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title(f'HDBSCAN Clusters (n={n_clusters})')
plt.tight_layout()
plt.savefig('umap_clusters.png', dpi=150)
Workflow 2: Supervised Embedding for Classification
import umap
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Supervised UMAP for feature engineering
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_emb = reducer.fit_transform(X_train_s, y=y_train)
X_test_emb = reducer.transform(X_test_s)
# Downstream classifier
clf = SVC(kernel='rbf')
clf.fit(X_train_emb, y_train)
y_pred = clf.predict(X_test_emb)
print(classification_report(y_test, y_pred))
Workflow 3: Exploring Embedding Space with Inverse Transform
Text-only — combines Core API modules 1 and 3 (inverse_transform on standard UMAP):
- Fit standard UMAP on data (Core API: Standard UMAP)
- Create a grid of points spanning the embedding space
- Apply
reducer.inverse_transform(grid_points)to reconstruct high-dimensional data - Visualize reconstructed samples to understand embedding regions
Note: inverse transform is approximate; works poorly outside the convex hull of the training embedding.
Key Parameters
| Parameter | Module | Default | Range | Effect |
|---|---|---|---|---|
n_neighbors | UMAP | 15 | 2-200 | Local vs global structure balance |
min_dist | UMAP | 0.1 | 0.0-0.99 | Cluster tightness |
n_components | UMAP | 2 | 2-100 | Output dimensionality |
metric | UMAP | 'euclidean' | See metrics list | Distance calculation method |
spread | UMAP | 1.0 | >0 | Embedding scale (with min_dist) |
n_epochs | UMAP | None (auto) | 50-500+ | Training iterations |
learning_rate | UMAP | 1.0 | >0 | SGD step size |
init | UMAP | 'spectral' | spectral/random/pca | Embedding initialization |
random_state | UMAP | None | int | Reproducibility seed |
target_weight | UMAP | 0.5 | 0.0-1.0 | Label influence (supervised) |
densmap | UMAP | False | bool | Enable DensMAP |
dens_lambda | UMAP | 2.0 | >0 | DensMAP density weight |
low_memory | UMAP | True | bool | Memory-efficient mode |
encoder | ParametricUMAP | None | Keras model | Custom encoder network |
decoder | ParametricUMAP | None | Keras model | Custom decoder network |
n_training_epochs | ParametricUMAP | 1 | 1-100 | Neural network training epochs |
alignment_regularisation | AlignedUMAP | 0.01 | >0 | Alignment strength |
alignment_window_size | AlignedUMAP | 3 | 1-N | Adjacent datasets to align |
Best Practices
-
Always standardize features: Use
StandardScalerbefore UMAP — unscaled features with different ranges will dominate the embedding. -
Set
random_statefor reproducibility: UMAP uses stochastic optimization; results vary between runs without a fixed seed. -
Use different parameters for clustering vs visualization: Clustering needs
n_neighbors=30, min_dist=0.0, n_components=5-10. Visualization needsn_neighbors=15, min_dist=0.1, n_components=2. -
Anti-pattern — interpreting distances literally: UMAP preserves topology, not precise distances. Cluster separations and point distances in the embedding are not proportional to original distances.
-
Anti-pattern — using 2D embeddings for clustering: 2D projections lose density information. Use 5-10 components for HDBSCAN input.
-
Consider PCA preprocessing for very high dimensions: For data with >1000 features, reducing to 50-100 PCA components first can speed up UMAP without losing quality.
-
Use Parametric UMAP for production: When you need fast transform on new data or reconstruction capabilities, Parametric UMAP's neural network provides consistent, fast inference.
Common Recipes
Recipe: Custom Numba Distance Metric
from numba import njit
import umap
@njit()
def weighted_euclidean(x, y):
"""Custom distance with feature weights."""
result = 0.0
for i in range(x.shape[0]):
result += (x[i] - y[i]) ** 2 * (1.0 + i * 0.01) # increasing weight
return np.sqrt(result)
embedding = umap.UMAP(metric=weighted_euclidean, random_state=42).fit_transform(data)
Recipe: Precomputed Distance Matrix
import umap
from scipy.spatial.distance import pdist, squareform
# Compute custom distance matrix
dist_matrix = squareform(pdist(data, metric='correlation'))
# Use precomputed distances
embedding = umap.UMAP(
metric='precomputed', random_state=42
).fit_transform(dist_matrix)
print(f"Embedding from precomputed: {embedding.shape}")
Recipe: Metric Learning Pipeline
import umap
from sklearn.svm import SVC
# Train supervised embedding on labeled data
mapper = umap.UMAP(n_components=10, random_state=42)
train_emb = mapper.fit_transform(X_train, y=y_train)
# Transform unlabeled test data using learned metric
test_emb = mapper.transform(X_test)
# Downstream classifier
clf = SVC().fit(train_emb, y_train)
predictions = clf.predict(test_emb)
print(f"Accuracy: {(predictions == y_test).mean():.3f}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Disconnected/fragmented clusters | n_neighbors too low | Increase n_neighbors (try 30-50) |
| Clusters too spread out | min_dist too high | Decrease min_dist (try 0.0-0.05) |
| All points collapsed | Bad preprocessing or min_dist too low | Check StandardScaler; increase min_dist |
| Poor clustering results | Using visualization parameters for clustering | Set n_neighbors=30, min_dist=0.0, n_components=5-10 |
| Transform results differ from training | Distribution shift | Ensure test data matches training distribution; use Parametric UMAP |
| Slow on large datasets (>100k) | Default settings | Set low_memory=True; preprocess with PCA to 50-100 dims |
| First run very slow | Numba JIT compilation | Expected — subsequent runs are fast (compiled cache) |
ImportError: umap | Name conflict with umap package | pip install umap-learn (not pip install umap) |
| Parametric UMAP import error | Missing TensorFlow | pip install umap-learn[parametric_umap] |
| Non-reproducible results | Missing random_state | Always set random_state=42 (or any int) |
Bundled Resources
references/api_reference.md
Complete UMAP constructor parameter reference (60+ parameters organized by category: core, training, advanced structural, supervised, transform, performance, DensMAP), all methods and attributes, ParametricUMAP class with autoencoder parameters, AlignedUMAP class, utility functions (nearest_neighbors, fuzzy_simplicial_set). Core parameter tuning guidance was relocated to SKILL.md Key Concepts and Core API modules. Usage examples duplicating SKILL.md workflows omitted.
Related Skills
- scikit-learn-machine-learning — ML classifiers, preprocessing, pipelines for downstream tasks
- matplotlib-scientific-plotting — Visualization of UMAP embeddings
- scikit-bio — Biological distance matrices that can feed into UMAP via
metric='precomputed'
References
- McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426
- Sainburg T, McInnes L, Gentner TQ. Parametric UMAP Embeddings for Representation and Semisupervised Learning. Neural Computation (2021)
- Narayan A, Berger B, Cho H. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nature Biotechnology (2021) — DensMAP
- Official docs: https://umap-learn.readthedocs.io/
- GitHub: https://github.com/lmcinnes/umap