name: "snakemake-workflow-engine" description: "Python-based workflow manager for reproducible, scalable pipelines. Define rules with file-based dependencies; Snakemake resolves execution order and parallelism. Runs local, SLURM, LSF, AWS, GCP via profiles; per-rule conda/Singularity envs. For NGS pipelines, ML training, and multi-step file processing. Use Nextflow for Groovy dataflow or nf-core integration." license: "MIT"
Snakemake — Python Workflow Engine
Overview
Snakemake is a Python-based workflow management system that scales analyses from laptop to HPC and cloud. Workflows are defined as rules with explicit input/output file dependencies; Snakemake resolves the execution order automatically and runs independent steps in parallel. Rules can call shell commands, Python/R/Julia scripts, or inline Python. Per-rule conda or Singularity environments make workflows fully reproducible. Widely used in bioinformatics for NGS, genome assembly, and variant-calling pipelines.
When to Use
- Building reproducible multi-step bioinformatics pipelines (align → sort → call variants → annotate)
- Scaling the same workflow from local development to SLURM cluster without code changes
- Processing multiple samples identically using wildcard-based rules
- Managing dependencies automatically — only rerun steps whose inputs changed
- Deploying per-rule conda or Singularity environments for tool isolation
- Generating visual DAGs and dry-run previews before committing computational resources
- Use
Nextflowinstead when you need Groovy DSL + dataflow channels, or when leveraging the nf-core community pipeline library - For simple shell loops, use bash scripts; Snakemake is worth the overhead only for 3+ sequential steps with branching
- Use
PrefectorAirflowinstead for data engineering workflows with dynamic task graphs or time-based scheduling
Prerequisites
- Python packages:
snakemake,graphviz(for DAG visualization) - Environment: Python 3.11+; conda/mamba recommended for per-rule environments
- Data requirements: Input files, reference files; output paths defined as rules
Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v snakemakefirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run snakemakerather than baresnakemake.
# Install via conda (includes optional dependencies)
conda install -c conda-forge -c bioconda snakemake
# Minimal pip install
pip install snakemake
# Verify
snakemake --version
# 8.x.x
Quick Start
# Snakefile — minimal 2-rule pipeline
SAMPLES = ["sampleA", "sampleB"]
rule all: # Target rule: request final outputs
input:
expand("results/{sample}.sorted.bam", sample=SAMPLES)
rule align:
input:
fastq="data/{sample}.fastq",
ref="refs/genome.fa"
output:
bam="results/{sample}.sorted.bam"
threads: 4
shell:
"bwa mem -t {threads} {input.ref} {input.fastq} "
"| samtools sort -@ {threads} -o {output.bam}"
# Run: dry-run first, then execute
snakemake -n # dry-run: show what would run
snakemake --cores 8 # execute with 8 cores
Core API
Module 1: Rule Definition
Each rule defines one analysis step with inputs, outputs, and an execution method.
# Shell rule: run a command with {input} and {output} placeholders
rule fastqc:
input:
fastq="data/{sample}.fastq"
output:
html="qc/{sample}_fastqc.html",
zip="qc/{sample}_fastqc.zip"
log:
"logs/fastqc/{sample}.log"
shell:
"fastqc {input.fastq} -o qc/ 2> {log}"
# Run rule: inline Python for logic-heavy steps
rule parse_stats:
input:
txt="results/{sample}.flagstat.txt"
output:
csv="results/{sample}.stats.csv"
run:
import re, csv
lines = open(input.txt).readlines()
mapped = re.search(r"(\d+) mapped", "".join(lines)).group(1)
with open(output.csv, "w") as f:
csv.writer(f).writerow([wildcards.sample, mapped])
# Script rule: delegate to external R/Python/Julia script
rule plot_coverage:
input:
depth="results/{sample}.depth.txt"
output:
pdf="results/{sample}.coverage.pdf"
script:
"scripts/plot_coverage.R"
# In the R script, access via snakemake object:
# depth_file <- snakemake@input[["depth"]]
# pdf_path <- snakemake@output[["pdf"]]
Module 2: Wildcards and Pattern Expansion
Wildcards let one rule process any number of samples; expand() generates all required file paths.
# Define sample list (from config or glob)
SAMPLES = ["ctrl_rep1", "ctrl_rep2", "treat_rep1", "treat_rep2"]
rule all:
input:
# expand() generates: qc/ctrl_rep1_fastqc.html, qc/ctrl_rep2_fastqc.html, ...
expand("qc/{sample}_fastqc.html", sample=SAMPLES),
expand("results/{sample}.bam", sample=SAMPLES)
# Access wildcard values inside shell/run
rule align:
input:
"data/{sample}.fastq"
output:
"results/{sample}.bam"
shell:
"echo Processing {wildcards.sample}; "
"bwa mem refs/genome.fa {input} | samtools view -b > {output}"
# Wildcard constraints prevent ambiguous matches
rule process:
input:
"data/{sample}_{rep}.fastq"
output:
"results/{sample}_{rep}.txt"
wildcard_constraints:
sample="[A-Za-z]+", # letters only
rep="\d+" # digits only
# multiext: multiple outputs sharing a common path base
rule bwa_index:
input:
"refs/genome.fa"
output:
multiext("refs/genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
shell:
"bwa index {input}"
Module 3: Configuration and Parameters
Config files externalize settings; params passes rule-level values without file dependencies.
# Snakefile: declare config file
configfile: "config/config.yaml"
# config/config.yaml:
# samples: [ctrl, treat]
# threads:
# align: 8
# sort: 4
# min_mapq: 20
SAMPLES = config["samples"]
rule filter_reads:
input:
"results/{sample}.bam"
output:
"results/{sample}.filtered.bam"
params:
mapq=config["min_mapq"] # from config, not a file
threads:
config["threads"]["sort"]
shell:
"samtools view -q {params.mapq} -b {input} > {output}"
# Dynamic params via lambda functions
rule trim:
input:
fastq="data/{sample}.fastq"
output:
trimmed="trimmed/{sample}.fastq"
params:
# Adapt quality threshold based on sample name
quality=lambda wildcards: 25 if "ctrl" in wildcards.sample else 20
shell:
"fastp -q {params.quality} -i {input.fastq} -o {output.trimmed}"
Module 4: Resources and Environments
Declare computational resources for scheduler integration; use conda/Singularity for tool isolation.
# Resource declaration (used by SLURM/LSF profiles)
rule variant_calling:
input:
bam="results/{sample}.deduped.bam",
ref="refs/genome.fa"
output:
vcf="variants/{sample}.vcf.gz"
resources:
mem_mb=16000, # memory in MB
runtime=240, # max walltime in minutes
disk_mb=20000 # scratch disk space
threads: 8
shell:
"bcftools mpileup -f {input.ref} {input.bam} "
"| bcftools call -m -Oz -o {output.vcf}"
# Conda environment per rule (for reproducibility)
rule star_align:
input:
reads="data/{sample}.fastq",
genome_dir="refs/star_index/"
output:
bam="star_out/{sample}/Aligned.sortedByCoord.out.bam"
conda:
"envs/star.yaml"
# envs/star.yaml:
# channels:
# - bioconda
# dependencies:
# - star=2.7.10b
# - samtools=1.17
threads: 8
shell:
"STAR --runThreadN {threads} --genomeDir {input.genome_dir} "
"--readFilesIn {input.reads} --outSAMtype BAM SortedByCoordinate"
# Singularity/Apptainer container
rule gatk_haplotypecaller:
input:
bam="results/{sample}.bam",
ref="refs/genome.fa"
output:
gvcf="gvcfs/{sample}.g.vcf.gz"
container:
"docker://broadinstitute/gatk:4.4.0.0"
shell:
"gatk HaplotypeCaller -I {input.bam} -R {input.ref} "
"-O {output.gvcf} -ERC GVCF"
Module 5: Execution and Cluster Profiles
Execute locally, on clusters, or in cloud; profiles configure executors without changing the Snakefile.
# Local execution
snakemake --cores 8 # use 8 CPU cores
snakemake --cores all # use all available cores
# Dry run: show tasks without executing
snakemake -n --cores 8
# Output: 12 of 24 steps are complete. 12 jobs to run.
# Force rerun (ignore existing outputs)
snakemake --forceall --cores 8
# Visualize DAG as PDF
snakemake --dag | dot -Tpdf > workflow_dag.pdf
# SLURM cluster profile (profiles/slurm/config.yaml)
# executor: slurm
# jobs: 50
# default-resources:
# mem_mb: 2000
# runtime: 60
# use-conda: true
# Run with profile (cluster submit + monitor)
snakemake --profile profiles/slurm --cores 128
# Override resources at runtime
snakemake --profile profiles/slurm \
--set-resources variant_calling:mem_mb=32000 --cores 128
# Override threads
snakemake --set-threads align=16 --cores 64
Module 6: Special Output Types and Utilities
Handle temporary files, protected outputs, checkpoints, and output validation.
# temp: auto-delete after downstream rules consume it
rule sort_bam:
input:
"results/{sample}.raw.bam"
output:
temp("results/{sample}.sorted_temp.bam") # deleted after indexing
shell:
"samtools sort {input} -o {output}"
# protected: write-protect final outputs (prevent overwrite)
rule final_report:
input:
"results/{sample}.vcf.gz"
output:
protected("reports/{sample}.final.vcf.gz")
shell:
"cp {input} {output}"
# directory: rule that outputs a directory
rule denovo_assembly:
input:
fastq="data/{sample}.fastq"
output:
directory("assemblies/{sample}/")
shell:
"spades.py -s {input.fastq} -o {output}"
# touch: create empty flag file (for ordering-only dependencies)
rule validate_bam:
input:
"results/{sample}.bam"
output:
touch("checkpoints/{sample}.validated")
shell:
"samtools quickcheck {input} && echo OK"
# ensure: validate output properties before considering rule complete
rule download_reference:
output:
ensure("refs/genome.fa", min_size=1_000_000)
shell:
"wget -O {output} https://example.com/genome.fa"
Key Concepts
Rule Resolution and DAG
Snakemake works backward from targets: given a list of desired output files, it builds a DAG of rules needed to produce them. Rules not needed for the current targets are ignored.
# rule all: declare all final outputs here
# Without this, snakemake runs only the first rule
rule all:
input:
expand("results/{sample}.vcf.gz", sample=SAMPLES),
expand("qc/{sample}_fastqc.html", sample=SAMPLES)
Wildcards vs Expand
{sample}in rule input/output = wildcard: filled by Snakemake at execution timeexpand("results/{sample}.bam", sample=SAMPLES)= Python: generates a list of strings NOW (used inrule all)
Common Workflows
Workflow 1: Standard NGS QC Pipeline
Goal: FastQC → trim → align → sort → dedup → flagstat for multiple samples.
configfile: "config/config.yaml"
SAMPLES = config["samples"]
rule all:
input:
expand("qc/{sample}_fastqc.html", sample=SAMPLES),
expand("results/{sample}.flagstat.txt", sample=SAMPLES)
rule fastqc:
input: "data/{sample}.fastq"
output: "qc/{sample}_fastqc.html", "qc/{sample}_fastqc.zip"
shell: "fastqc {input} -o qc/"
rule trim:
input: "data/{sample}.fastq"
output: "trimmed/{sample}.fastq"
shell: "fastp -q 20 -i {input} -o {output}"
rule align:
input:
fastq="trimmed/{sample}.fastq",
ref="refs/genome.fa"
output: temp("results/{sample}.raw.bam")
threads: 8
shell:
"bwa mem -t {threads} {input.ref} {input.fastq} | samtools view -b > {output}"
rule sort_dedup:
input: "results/{sample}.raw.bam"
output:
bam="results/{sample}.bam",
bai="results/{sample}.bam.bai"
threads: 4
shell:
"samtools sort -@ {threads} {input} | samtools markdup -r - {output.bam} "
"&& samtools index {output.bam}"
rule flagstat:
input: "results/{sample}.bam"
output: "results/{sample}.flagstat.txt"
shell: "samtools flagstat {input} > {output}"
Workflow 2: Running on a SLURM Cluster
Goal: Deploy the same Snakefile to HPC with per-job resource allocation.
# 1. Create profiles/slurm/config.yaml
mkdir -p profiles/slurm
cat > profiles/slurm/config.yaml << 'EOF'
executor: slurm
jobs: 100
default-resources:
mem_mb: 4000
runtime: 60
use-conda: true
latency-wait: 30
rerun-incomplete: true
EOF
# 2. Add resources to compute-heavy rules in Snakefile
# resources:
# mem_mb=16000, runtime=120
# 3. Submit
snakemake --profile profiles/slurm --cores 256 -n # dry-run
snakemake --profile profiles/slurm --cores 256 # submit
# 4. Monitor
snakemake --profile profiles/slurm --report report.html # after completion
Key Parameters
| Parameter | Context | Default | Range/Options | Effect |
|---|---|---|---|---|
--cores | CLI | 1 | 1–N or all | Max concurrent jobs/threads |
threads: | Rule | 1 | 1–N | Threads per rule (scales to --cores) |
mem_mb: | resources: | None | integer | Memory in MB (used by SLURM profile) |
runtime: | resources: | None | integer (min) | Max walltime per job |
--profile | CLI | None | path | YAML profile for executor config |
--use-conda | CLI | False | flag | Activate per-rule conda environments |
--use-apptainer | CLI | False | flag | Enable Singularity/Apptainer containers |
-n | CLI | False | flag | Dry-run (show tasks, don't execute) |
--forceall | CLI | False | flag | Rerun all rules regardless of status |
--rerun-incomplete | CLI | False | flag | Rerun rules with partial outputs |
configfile: | Snakefile | None | YAML path | Load config dictionary from YAML |
Best Practices
-
Always define
rule all: Without it, only the first rule in the Snakefile runs.rule allcollects all final outputs; Snakemake runs everything needed to produce them. -
Use
temp()for large intermediates: BAM files before deduplication, unsorted BAMs, and intermediate assemblies can be markedtemp()to auto-delete after consumption — saves significant disk. -
Separate config from code: Put sample lists, thread counts, file paths, and thresholds in
config.yaml. Hard-coded values in Snakefiles make pipelines brittle and non-reusable. -
Test with
snakemake -nfirst: The dry-run shows exactly which rules will run and in what order. Run it before every production execution to confirm the DAG is correct. -
Use
log:for every shell rule: Redirect tool stderr/stdout to per-rule log files (2> {log}). Without logs, debugging cluster job failures is nearly impossible. -
Benchmark rules in production: Add
benchmark: "benchmarks/{rule}/{sample}.txt"to measure actual runtime and memory — essential data for tuning SLURM resource requests.
Common Recipes
Recipe: Generate Sample List from Files
# Auto-discover samples from input directory (no hardcoded list)
from pathlib import Path
SAMPLES = [p.stem.replace(".fastq", "") for p in Path("data/").glob("*.fastq")]
print(f"Found {len(SAMPLES)} samples: {SAMPLES[:3]}...")
rule all:
input:
expand("results/{sample}.bam", sample=SAMPLES)
Recipe: Conditional Execution Based on Config
configfile: "config/config.yaml"
# Only run deduplication for WGS (not amplicon) data
rule dedup:
input:
"results/{sample}.sorted.bam"
output:
"results/{sample}.deduped.bam"
run:
if config.get("assay_type") == "WGS":
shell("samtools markdup -r {input} {output}")
else:
shell("cp {input} {output}")
Recipe: Aggregate Multiple Samples
# Collect all per-sample stats into one summary table
rule multiqc:
input:
expand("qc/{sample}_fastqc.zip", sample=SAMPLES),
expand("results/{sample}.flagstat.txt", sample=SAMPLES)
output:
"multiqc/multiqc_report.html"
shell:
"multiqc qc/ results/ -o multiqc/"
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
AmbiguousRuleException | Multiple rules match same output | Add wildcard_constraints:, use ruleorder rule_a > rule_b, or rename outputs |
MissingOutputException | Rule completed but output file absent | Check working directory in shell; verify output path; check disk space |
TargetFileException | rule all requests a file no rule can produce | Verify expand() args match wildcard names; use -n to trace resolution |
| Cluster jobs all fail | Resources too low for tool | Increase mem_mb or runtime; check cluster queue with squeue |
| Conda env build fails | Package conflict or wrong channel | Add conda-forge before bioconda; pin package versions |
| Rule reruns unexpectedly | Output file timestamp older than input | Touch output files with snakemake --touch; or delete and rerun |
PermissionError on protected output | protected() wrapper applied | Remove protection with --force; or delete and regenerate without protected() |
Related Skills
- samtools-bam-processing — BAM sorting and indexing rules commonly used in Snakemake pipelines
- bedtools-genomic-intervals — interval operations in downstream annotation rules
- neuropixels-analysis — example of a complex multi-step pipeline that benefits from Snakemake
References
- Snakemake documentation — rules, wildcards, profiles, API reference
- Snakemake GitHub — source, releases, issue tracker
- Mölder et al. (2021) "Sustainable data analysis with Snakemake" — F1000Research 10:33
- Snakemake workflow catalog — community-maintained reference pipelines