name: "bedtools-genomic-intervals" description: "Genomic interval ops on BED/BAM/GFF/VCF. Find overlaps, merge intervals, compute coverage, extract FASTA, find nearest features. Core for ChIP-seq peak annotation, region filtering, genome arithmetic. Use tabix for indexed single-region queries; use deeptools for normalized bigWig coverage." license: "GPL-2.0"
bedtools — Genomic Interval Analysis Toolkit
Overview
bedtools is the standard toolkit for operating on genomic intervals in BED, BAM, GFF, and VCF formats. It solves the core problem of genome arithmetic: finding overlaps between feature sets, computing coverage, extracting sequences, merging adjacent regions, and annotating features with nearest neighbors. bedtools operates on sorted coordinate lists and runs at C speed, making it practical for whole-genome analyses.
When to Use
- Intersecting ChIP-seq peaks with gene annotations to find promoter-overlapping peaks
- Merging overlapping ATAC-seq peaks or called regions across replicates
- Computing read coverage depth over target capture regions
- Extracting FASTA sequences for motif discovery or primer design
- Finding the nearest gene to each regulatory element or variant
- Subtracting blacklist or repeat regions from peak calls
- Expanding genomic intervals by fixed distance (promoter regions)
- Use
tabixinstead for fast indexed queries of a single genomic region - For normalized coverage bigWig tracks, use
deeptools bamCoverageinstead - Use
mosdepthinstead for whole-genome per-base depth (10× faster)
Prerequisites
- Python packages: None required (command-line only)
- Input requirements: BED/BAM/GFF/VCF files; FASTA reference for
getfasta; genome file (chromosome sizes) forslop/flank/genomecov - Sorting: Most operations require coordinate-sorted input
Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v bedtoolsfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run bedtoolsrather than barebedtools.
# Bioconda (recommended)
conda install -c bioconda bedtools
# Homebrew (macOS)
brew install bedtools
# Verify
bedtools --version
# bedtools v2.31.0
# Create genome file from FASTA index
samtools faidx reference.fa
cut -f1,2 reference.fa.fai > genome.txt # chr → size table
Quick Start
# Find peaks overlapping genes, then merge overlapping peaks
bedtools intersect -a peaks.bed -b genes.bed -wa -wb > peaks_with_genes.bed
bedtools merge -i peaks.bed > merged_peaks.bed
bedtools coverage -a genes.bed -b reads.bam > gene_coverage.bed
Core API
Module 1: Interval Intersection and Overlap Analysis
Find regions that overlap between two feature sets.
# Basic intersection: output overlapping regions
bedtools intersect -a peaks.bed -b genes.bed
# Report original A and B features for each overlap
bedtools intersect -a peaks.bed -b genes.bed -wa -wb
# Count B overlaps per A feature (adds column)
bedtools intersect -a peaks.bed -b genes.bed -c
# Output: chr1 1000 2000 peak1 gene_count
# Peaks with ANY overlap (report each peak once)
bedtools intersect -a peaks.bed -b genes.bed -u
# Peaks with NO overlap in B (invert filter)
bedtools intersect -a peaks.bed -b blacklist.bed -v
# Require reciprocal 50% overlap both ways
bedtools intersect -a exp1.bed -b exp2.bed -f 0.5 -F 0.5 -r
# Same-strand intersections only
bedtools intersect -a peaks.bed -b genes.bed -s
# Multiple database files with overlap counts per file
bedtools intersect -a query.bed -b enhancers.bed promoters.bed \
-names enh prom -C
# Memory-efficient mode for pre-sorted large files
bedtools intersect -a sorted_peaks.bed -b sorted_genes.bed -sorted
Module 2: Interval Merging and Arithmetic
Combine overlapping intervals and perform set operations.
# Merge overlapping and adjacent intervals
sort -k1,1 -k2,2n peaks.bed | bedtools merge -i stdin
# Merge intervals within 500 bp of each other
bedtools merge -i peaks.bed -d 500
# Merge and count original features
bedtools merge -i peaks.bed -c 1 -o count
# Output: chr1 1000 5000 3 (3 original peaks merged)
# Merge and collapse feature names
bedtools merge -i peaks.bed -c 4 -o collapse -delim ";"
# Output: chr1 1000 5000 peak1;peak2;peak3
# Subtract B from A (remove covered bases)
bedtools subtract -a peaks.bed -b blacklist.bed
# Remove entire A feature if ANY B overlap
bedtools subtract -a peaks.bed -b exclusion.bed -A
# Find genomic gaps (complement of covered regions)
bedtools complement -i merged.bed -g genome.txt
Module 3: Coverage Analysis
Calculate depth and breadth of read coverage over features.
# Coverage stats per feature (count, bases covered, % covered)
bedtools coverage -a target_genes.bed -b aligned.bam
# Output: chr start end gene n_overlapping_reads bases_covered feature_len fraction_covered
# Per-base depth within each feature
bedtools coverage -a targets.bed -b aligned.bam -d
# Output: chr start end name position depth
# Coverage histogram per feature
bedtools coverage -a features.bed -b aligned.bam -hist
# Genome-wide BEDGRAPH (coverage per bin)
bedtools genomecov -ibam aligned.bam -bg -o coverage.bedgraph
# Include zero-coverage regions (for whole-genome coverage)
bedtools genomecov -ibam aligned.bam -bga > full_coverage.bedgraph
# Per-base depth for whole genome
bedtools genomecov -ibam aligned.bam -d > depth.txt
# Scaled BEDGRAPH (RPM normalization: total=50M reads → scale=1/50)
bedtools genomecov -ibam aligned.bam -bg -scale 0.00000002 > rpm.bedgraph
# Strand-specific coverage tracks
bedtools genomecov -ibam rnaseq.bam -bg -strand + > forward.bedgraph
bedtools genomecov -ibam rnaseq.bam -bg -strand - > reverse.bedgraph
Module 4: Sequence Extraction and Nearest Feature
Extract genomic sequences and annotate features with neighbors.
# Extract FASTA sequences for each BED region
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fasta
# Strand-aware extraction (reverse complement - strand)
bedtools getfasta -fi genome.fa -bed regions.bed -s -fo stranded.fasta
# Custom FASTA headers (name + coords)
bedtools getfasta -fi genome.fa -bed peaks.bed -name -fo named.fasta
# Extract and concatenate exons (BED12 spliced transcripts)
bedtools getfasta -fi genome.fa -bed transcripts.bed12 -split -fo exons.fasta
# Find nearest gene to each peak (with distance)
bedtools closest -a peaks.bed -b genes.bed -d
# Output: peak fields... | gene fields... | distance_bp
# Nearest feature on same strand only
bedtools closest -a peaks.bed -b genes.bed -s -d
# Ignore overlapping features (find nearest non-overlapping)
bedtools closest -a peaks.bed -b genes.bed -io -d
# Multiple annotation databases
bedtools closest -a query.bed -b genes.bed enhancers.bed \
-names genes enhancers -d
Module 5: Interval Manipulation
Expand, contract, and shift genomic intervals.
# Expand regions by 500 bp on each side
bedtools slop -i peaks.bed -g genome.txt -b 500
# Asymmetric: 2000 bp upstream, 500 bp downstream of TSS
bedtools slop -i tss.bed -g genome.txt -l 2000 -r 500
# Strand-aware expansion (upstream = 5' side)
bedtools slop -i genes.bed -g genome.txt -l 1000 -r 200 -s
# Create flanking regions (not overlapping the feature)
bedtools flank -i genes.bed -g genome.txt -b 1000
bedtools flank -i genes.bed -g genome.txt -l 2000 -r 0 -s # upstream only
Key Concepts
Coordinate Systems
BED files use 0-based half-open intervals: start is 0-indexed (like Python), end is exclusive. A region chr1:1000-2000 in BED covers bases 1000–1999 (1000 bases).
chr1 1000 2000 peak1 ← covers positions 1000,1001,...,1999
# BED: 0-based start, exclusive end
# VCF: 1-based position (POS)
# GFF: 1-based start and end (both inclusive)
bedtools converts internally — input format is auto-detected. Problems arise when mixing tools with different conventions.
Sorting Requirements
Most bedtools operations require coordinate-sorted input. Pre-sort with:
sort -k1,1 -k2,2n input.bed > sorted.bed
# For large files, use -S 4G for 4 GB sort buffer
sort -k1,1 -k2,2n -S 4G --parallel=8 input.bed > sorted.bed
The -sorted flag in bedtools intersect uses a sweep algorithm that requires sorted input but uses O(1) memory instead of O(N).
Common Workflows
Workflow 1: ChIP-seq Peak Annotation
Goal: Annotate peaks with overlapping genes, distances to TSS, and filter blacklisted regions.
#!/bin/bash
PEAKS="peaks.bed"
GENES="refseq_genes.bed"
TSS="refseq_tss.bed" # BED with TSS positions
BLACKLIST="encode_blacklist_hg38.bed"
GENOME="hg38.genome"
# 1. Remove blacklisted regions
bedtools subtract -a $PEAKS -b $BLACKLIST -A > peaks_clean.bed
echo "After blacklist filter: $(wc -l < peaks_clean.bed) peaks"
# 2. Annotate with overlapping gene (allow 2 kb from gene body)
bedtools slop -i $GENES -g $GENOME -b 2000 > genes_padded.bed
bedtools intersect -a peaks_clean.bed -b genes_padded.bed -wa -wb \
> peaks_gene_overlap.bed
# 3. For non-overlapping peaks: find nearest gene
bedtools intersect -a peaks_clean.bed -b genes_padded.bed -v > peaks_distal.bed
bedtools closest -a peaks_distal.bed -b $TSS -d > peaks_distal_nearest.bed
echo "Promoter peaks: $(wc -l < peaks_gene_overlap.bed)"
echo "Distal peaks: $(wc -l < peaks_distal.bed)"
Workflow 2: Coverage Analysis for WES Target Regions
Goal: Calculate on-target read depth and coverage breadth for exome sequencing QC.
#!/bin/bash
BAM="sample.deduped.bam"
TARGETS="capture_targets.bed"
# Per-target coverage statistics
bedtools coverage -a $TARGETS -b $BAM > per_target_coverage.bed
# Summary: total targets, mean depth, % at ≥20×
awk 'BEGIN{n=0; depth=0; covered=0}
{n++; depth+=$7; if($8>=20) covered++}
END{printf "Targets: %d\nMean depth: %.1f×\n%% at 20×: %.1f%%\n",
n, depth/n, covered/n*100}' per_target_coverage.bed
# Per-base depth for IGV visualization
bedtools coverage -a $TARGETS -b $BAM -d > per_base_depth.txt
echo "Per-base depth written to per_base_depth.txt"
Key Parameters
| Parameter | Command | Default | Range/Options | Effect |
|---|---|---|---|---|
-f | intersect, coverage | 1e-9 | 0.0–1.0 | Min fraction of A that must overlap |
-F | intersect, coverage | 1e-9 | 0.0–1.0 | Min fraction of B that must overlap |
-r | intersect | — | flag | Require reciprocal overlap (-f AND -F) |
-s | Most | — | flag | Strand-aware (same strand only) |
-v | intersect | — | flag | Report features with NO overlap (invert) |
-c | intersect | — | flag | Append overlap count per A feature |
-d | merge | 0 | integer | Max gap to merge (bp) |
-bg | genomecov | — | flag | BEDGRAPH output format |
-scale | genomecov | 1.0 | float | Multiply coverage by constant (for RPM) |
-sorted | intersect, closest | — | flag | Use sweep algorithm (sorted input required) |
-b | slop | — | integer | Expand interval by N bp on both sides |
-D | closest | — | ref/a/b | Report signed distance (upstream negative) |
Best Practices
-
Always sort before bedtools: Most bedtools commands fail silently on unsorted input. Sort with
sort -k1,1 -k2,2n input.bedbefore any bedtools operation. -
Use
-sortedfor large files: For pre-sorted files,-sortedreduces memory from O(N) to O(1). Required when intersecting multi-gigabyte BED files. -
Check chromosome naming consistency: The single most common failure — some tools use
chr1, others use1. Verify withcut -f1 file.bed | sort -ubefore running intersections. -
Apply blacklist early: Run
bedtools subtract -b blacklist.bed -Abefore any peak analysis. ENCODE blacklists remove artifactual signal in repetitive/high-copy regions. -
Use
-f 0.5 -rfor peak reproducibility: When intersecting peaks across replicates, require 50% reciprocal overlap to avoid spurious short overlaps at interval boundaries. -
Validate BED format: Malformed BED (wrong column count, text in numeric columns) causes silent failures. Test with
bedtools merge -i file.bed 2>&1 | head -5.
Common Recipes
Recipe: Count Feature Overlaps Across Annotation Categories
# Report how many peaks overlap each category (genes, promoters, enhancers)
for category in genes.bed promoters.bed enhancers.bed repeats.bed; do
label=$(basename $category .bed)
count=$(bedtools intersect -a peaks.bed -b $category -u | wc -l)
total=$(wc -l < peaks.bed)
echo "$label: $count/$total ($(echo "scale=1; $count*100/$total" | bc)%)"
done
Recipe: Create Promoter Regions from Gene Annotations
# Extract 2kb upstream of TSS for ChIP annotation
# For genes on + strand: TSS = start; on - strand: TSS = end
awk 'BEGIN{OFS="\t"} $6=="+" {print $1,$2,$2+1,$4,$5,$6}
$6=="-" {print $1,$3-1,$3,$4,$5,$6}' genes.bed > tss.bed
bedtools slop -i tss.bed -g genome.txt -l 2000 -r 500 -s > promoters.bed
echo "Created $(wc -l < promoters.bed) promoter regions"
Recipe: Calculate Intersection Statistics
# Jaccard similarity between two peak sets (0=no overlap, 1=identical)
bedtools sort -i set1.bed > s1.bed
bedtools sort -i set2.bed > s2.bed
bedtools jaccard -a s1.bed -b s2.bed
# Output: intersection union jaccard n_intersections
# 423456 2345678 0.1804 892
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Empty intersect output | Chromosome name mismatch (chr1 vs 1) | Check: cut -f1 a.bed | sort -u vs cut -f1 b.bed | sort -u |
| Memory error on large files | Not using -sorted flag | Pre-sort inputs and add -sorted to intersect/closest |
getfasta: sequence not found | FASTA headers differ from BED chr names | Index FASTA: samtools faidx genome.fa; match names exactly |
| Zero coverage everywhere | BAM not indexed or BED/BAM chr mismatch | Run samtools index file.bam; verify chr naming |
| Merge doesn't merge expected features | Input not sorted by coordinate | Sort: sort -k1,1 -k2,2n file.bed | bedtools merge -i stdin |
getfasta produces wrong-strand sequence | Using -s without strand column in BED | Ensure BED col 6 has +/-; add strand: awk '{$6="+"; print}' OFS="\t" |
| Off-by-one in coordinates | Mixing 0-based BED and 1-based VCF/GFF | Convert GFF to BED: subtract 1 from start |
| Slow on large genomes | Processing unsorted files | Sort both files; use -sorted; pipe through sort without writing temp files |
Related Skills
- samtools-bam-processing — BAM sorting, filtering, and QC before passing to bedtools
- deeptools-ngs-analysis — normalized coverage bigWig tracks and heatmaps from the BAM files bedtools processes
- pysam-genomic-files — Python-native BAM/BED manipulation for custom logic beyond bedtools
References
- bedtools documentation — command reference and examples
- GitHub: arq5x/bedtools2 — source, releases, issue tracker
- Quinlan & Hall (2010) "BEDTools: a flexible suite of utilities for comparing genomic features" — Bioinformatics 26(6)
- ENCODE blacklist regions — curated problematic genomic regions