Cancer Multi-Omics Analysis

Integrated analysis of expression, mutation, copy number, and methylation data from TCGA and GEO for solid tumor characterization.

When to Use This Skill

Activate when the user requests:

TCGA data download and processing for any cancer type
Differential expression analysis between tumor conditions
Pathway and gene set enrichment analysis
Mutation landscape analysis (oncoplot, signatures, co-occurrence)
Copy number variation analysis
DNA methylation analysis (450K/EPIC arrays)
Integration of two or more omics layers

Inputs

Data Type	Format	Source
Expression	Raw counts (STAR - Counts)	TCGA via TCGAbiolinks (v2.38+), GEO via GEOquery
Mutations	MAF	TCGA GDC, cBioPortal
Copy number	Segment files, GISTIC2.0 output	TCGA GDC, GDAC Firehose
Methylation	IDAT files or beta-value matrices	TCGA GDC, GEO
Clinical	Tabular	TCGA GDC, cBioPortal

Expression Analysis

Data Retrieval

library(TCGAbiolinks)  # v2.38.0+, Bioconductor 3.22

query <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)
GDCdownload(query, directory = "GDCdata")
se <- GDCprepare(query, directory = "GDCdata")
# Returns SummarizedExperiment; raw counts in assay(se, "unstranded")
# TPM in assay(se, "tpm_unstrand") — use for visualization only, never for DE

# Extract clinical data
clinical <- as.data.frame(colData(se))

GEO Data Retrieval (Alternative)

library(GEOquery)
gse <- getGEO("GSE72094", GSEMatrix = TRUE)[[1]]
# For count data from GEO Supplementary files:
# Download counts matrix manually, read with read.csv/read.delim

Normalization Decision Tree

Input type?
  Raw counts (HTSeq, STAR)
    -> DE analysis: feed directly to DESeq2 (handles normalization internally)
    -> Visualization (PCA, heatmap): apply vst() or rlog() from DESeq2
    -> Cross-sample comparison: vst() preferred for n > 30 samples (faster than rlog)
  TPM/FPKM (already normalized)
    -> DE analysis: STOP. Go back, get raw counts. TPM/FPKM invalid for DESeq2/edgeR.
    -> Correlation/visualization: log2(TPM + 1), acceptable
    -> Gene set scoring (ssGSEA, GSVA): TPM acceptable as input
  RSEM expected counts (non-integer)
    -> round() before DESeq2: DESeqDataSetFromMatrix(countData = round(counts), ...)

Differential Expression (DESeq2 v1.50+)

library(DESeq2)  # v1.50.2+, Bioconductor 3.22
set.seed(42)

counts_mat <- assay(se, "unstranded")
col_data <- colData(se)

# Ensure consistent sample ordering
stopifnot(identical(colnames(counts_mat), rownames(col_data)))

dds <- DESeqDataSetFromMatrix(
  countData = counts_mat,
  colData = col_data,
  design = ~ sample_type  # "Primary Tumor" vs "Solid Tissue Normal"
)

# Pre-filter: keep genes with >= 10 counts in at least 5 samples
# More stringent than rowSums >= 10; reduces false discoveries
keep <- rowSums(counts(dds) >= 10) >= 5
dds <- dds[keep, ]

# Set reference level explicitly
dds$sample_type <- relevel(factor(dds$sample_type), ref = "Solid Tissue Normal")

dds <- DESeq(dds)

# Check available contrasts before extracting results
resultsNames(dds)

res <- results(dds,
  name = "sample_type_Primary.Tumor_vs_Solid.Tissue.Normal",
  alpha = 0.05
)

# LFC shrinkage for ranking and visualization (not for significance testing)
# apeglm is recommended: Cauchy prior, uses likelihood directly
res_shrunk <- lfcShrink(dds,
  coef = "sample_type_Primary.Tumor_vs_Solid.Tissue.Normal",
  type = "apeglm"
)

# Export significant genes
sig <- subset(as.data.frame(res_shrunk), padj < 0.05 & abs(log2FoldChange) > 1)
sig <- sig[order(sig$padj), ]
write.csv(sig, "results/significant_degs.csv")
message(sprintf("DEGs (padj<0.05, |LFC|>1): %d up, %d down",
  sum(sig$log2FoldChange > 0), sum(sig$log2FoldChange < 0)))

Multi-Factor Design (Batch + Condition)

# When batch information is available, include in design
dds <- DESeqDataSetFromMatrix(
  countData = counts_mat, colData = col_data,
  design = ~ batch + sample_type  # batch first, condition last
)
# DESeq2 accounts for batch in the model; no separate correction needed for DE

Batch Correction for Visualization

Batch correction decision:
  For DE testing -> Include batch in DESeq2 design formula. Do NOT pre-correct.
  For PCA/heatmap (log-scale data) -> limma::removeBatchEffect on vst/rlog values
  For downstream tools requiring counts -> ComBat-seq (sva package) on raw counts
  NEVER: ComBat (original) on raw counts. NEVER: removeBatchEffect before DE.

# Visualization with batch removal
vsd <- vst(dds, blind = FALSE)
library(limma)
assay(vsd) <- removeBatchEffect(assay(vsd), batch = vsd$batch)
# Use this corrected vsd for PCA and heatmaps only

# ComBat-seq for count-level correction (when other tools need corrected counts)
library(sva)
adjusted_counts <- ComBat_seq(
  counts = as.matrix(counts_mat),
  batch = col_data$batch,
  group = col_data$sample_type  # preserve biological signal
)

Gene ID Conversion

# TCGA uses Ensembl IDs (ENSG...). Convert to symbols for downstream analysis.
library(org.Hs.eg.db)

# Method 1: AnnotationDbi (fast, local)
gene_symbols <- mapIds(org.Hs.eg.db,
  keys = sub("\\..*", "", rownames(res)),  # strip version suffix
  keytype = "ENSEMBL", column = "SYMBOL", multiVals = "first")

# Remove unmapped and duplicates
res$symbol <- gene_symbols[sub("\\..*", "", rownames(res))]
res <- res[!is.na(res$symbol) & !duplicated(res$symbol), ]

Pathway and Enrichment Analysis

Over-Representation Analysis (ORA)

library(clusterProfiler)
library(org.Hs.eg.db)

# ORA on significant DEGs — use when you have a clear gene list
sig_genes <- res$symbol[res$padj < 0.05 & abs(res$log2FoldChange) > 1]

ego <- enrichGO(
  gene = sig_genes, OrgDb = org.Hs.eg.db, keyType = "SYMBOL",
  ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.2
)

# KEGG enrichment (requires Entrez IDs)
entrez_ids <- mapIds(org.Hs.eg.db, keys = sig_genes,
  keytype = "SYMBOL", column = "ENTREZID")
ekegg <- enrichKEGG(gene = entrez_ids[!is.na(entrez_ids)],
  organism = "hsa", pvalueCutoff = 0.05)

Gene Set Enrichment Analysis (GSEA)

# GSEA on the FULL ranked gene list — never pre-filter
# Rank by: sign(log2FC) * -log10(pvalue) or stat column from DESeq2
gene_list <- res_shrunk$log2FoldChange
names(gene_list) <- res_shrunk$symbol
gene_list <- sort(gene_list[!is.na(names(gene_list))], decreasing = TRUE)

# Remove duplicates (keep highest absolute value)
gene_list <- gene_list[!duplicated(names(gene_list))]

gsea_go <- gseGO(
  geneList = gene_list, OrgDb = org.Hs.eg.db, keyType = "SYMBOL",
  ont = "BP", minGSSize = 15, maxGSSize = 500,
  pvalueCutoff = 0.05, eps = 0, seed = TRUE
)

MSigDB Gene Sets (Hallmark, C2, C5)

library(msigdbr)
# Hallmark gene sets (most interpretable for cancer biology)
h_sets <- msigdbr(species = "Homo sapiens", category = "H")
h_list <- split(h_sets$gene_symbol, h_sets$gs_name)

gsea_h <- GSEA(
  geneList = gene_list,
  TERM2GENE = h_sets[, c("gs_name", "gene_symbol")],
  minGSSize = 15, maxGSSize = 500,
  pvalueCutoff = 0.05, eps = 0, seed = TRUE
)

Sample-Level Pathway Scoring (GSVA/ssGSEA)

library(GSVA)
# GSVA transforms expression matrix to pathway activity scores per sample
# Input: log2-transformed expression (TPM or vst counts), not raw counts
expr_mat <- assay(vst(dds, blind = FALSE))

gsva_params <- gsvaParam(expr_mat, h_list, kcdf = "Gaussian")
gsva_scores <- gsva(gsva_params)
# Output: pathway x sample matrix of enrichment scores
# Use for: clustering, survival association, heatmaps

Visualization

library(ggplot2); library(ggrepel); library(ComplexHeatmap)

# --- PCA ---
vsd <- vst(dds, blind = FALSE)
pcaData <- plotPCA(vsd, intgroup = "sample_type", returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = sample_type)) +
  geom_point(size = 2) +
  labs(x = paste0("PC1: ", percentVar[1], "%"),
       y = paste0("PC2: ", percentVar[2], "%")) +
  theme_classic(base_size = 12)

# --- Volcano plot ---
df <- as.data.frame(res_shrunk)
df$sig <- ifelse(df$padj < 0.05 & abs(df$log2FoldChange) > 1,
  ifelse(df$log2FoldChange > 0, "Up", "Down"), "NS")
top10 <- head(df[order(df$padj), ], 10)
ggplot(df, aes(log2FoldChange, -log10(padj), color = sig)) +
  geom_point(size = 0.5, alpha = 0.5) +
  scale_color_manual(values = c(Up = "#D55E00", Down = "#0072B2", NS = "grey70")) +
  geom_text_repel(data = top10, aes(label = rownames(top10)),
    size = 3, color = "black", max.overlaps = 15) +
  theme_classic(base_size = 12)

Output Specification

Output	Format	Description
`significant_degs.csv`	CSV	DEGs with log2FC, padj, gene symbol
`dds.rds`	RDS	DESeqDataSet object for downstream use
`vsd.rds`	RDS	VST-transformed data for visualization
`gsva_scores.rds`	RDS	Pathway x sample enrichment matrix
`pca_plot.pdf`	PDF	PCA colored by condition
`volcano_plot.pdf`	PDF	Volcano with top genes labeled

Validation Checks

After running expression analysis, verify:
  DEG count: Expect 1000-5000 DEGs for tumor vs normal (varies by cancer type)
    TCGA-LUAD: ~3000-4000 DEGs at padj<0.05, |LFC|>1
    < 100 DEGs: check design formula, sample labels, reference level
    > 10000 DEGs: check for confounders, tumor purity bias
  PCA: tumor and normal should separate on PC1 or PC2
  Known markers: TP53, EGFR, KRAS mutations reflected in expression?
  Direction: oncogenes should trend up in tumor, TSGs down
  Volcano symmetry: roughly symmetric unless strong biological bias expected

Mutation Analysis

Data Retrieval

library(TCGAbiolinks)
library(maftools)  # v2.22+, Bioconductor 3.22

# Method 1: GDCquery_Maf (downloads open-access MAF, hg38-aligned)
# Pipelines: mutect2, muse, varscan2, somaticsniper
maf_df <- GDCquery_Maf("LUAD", pipelines = "mutect2")
maf <- read.maf(maf = maf_df, clinicalData = clinical_df)

# Method 2: GDC query (more control over filters)
query_maf <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Simple Nucleotide Variation",
  data.type = "Masked Somatic Mutation",
  access = "open"
)
GDCdownload(query_maf, directory = "GDCdata")
maf_file <- GDCprepare(query_maf, directory = "GDCdata")
maf <- read.maf(maf = maf_file, clinicalData = clinical_df)

Pipeline Selection

Which caller to use?
  Multi-caller consensus (MC3) -> Most stringent, used in TCGA PanCancer Atlas
    Access via: maftools::tcgaLoad(study = "LUAD")
  Single caller:
    MuTect2 -> Best sensitivity for low-frequency variants
    MuSE -> Good for paired tumor-normal, models tumor heterogeneity
    VarScan2 -> Works with and without matched normal
    SomaticSniper -> Bayesian, good specificity
  For most analyses -> MuTect2 or MC3 consensus

Visualization

plotmafSummary(maf, rmOutlier = TRUE, addStat = "median")

oncoplot(maf, top = 20,
  clinicalFeatures = c("ajcc_pathologic_stage", "gender"),
  sortByAnnotation = TRUE)

# Lollipop: protein domain plot for a single gene
lollipopPlot(maf, gene = "TP53", AACol = "HGVSp_Short",
  showMutationRate = TRUE)

# Rainfall: inter-mutation distance, detects kataegis (localized hypermutation)
rainfallPlot(maf, detectChangePoints = TRUE, pointSize = 0.4)

Tumor Mutation Burden

# TMB = total nonsynonymous mutations / capture region size (MB)
laml.tmb <- tmb(maf, captureSize = 50, logScale = TRUE)

# Compare against all 33 TCGA cohorts (uses MC3 data, 35.8 MB capture)
tcgaCompare(maf, cohortName = "LUAD-study", logscale = TRUE,
  capture_size = 50)
# Useful for: immunotherapy response prediction (TMB-high cutoff varies by cancer type)

Somatic Interactions

# Pairwise Fisher's exact test for co-occurrence and mutual exclusivity
somaticInteractions(maf, top = 25, pvalue = c(0.05, 0.01))
# Green = co-occurring, brown = mutually exclusive
# Classic examples: TP53 + KRAS co-occurrence in LUAD; EGFR vs KRAS mutual exclusivity

Clinical Enrichment

# Test which mutations are enriched in clinical subgroups
ce <- clinicalEnrichment(maf, clinicalFeature = "ajcc_pathologic_stage")
ce$groupwise_comparision[p_value < 0.05]
plotEnrichmentResults(ce, pvalue = 0.05, geneFontSize = 0.8)

Mutational Signatures

library(BSgenome.Hsapiens.UCSC.hg38)

# Step 1: Extract trinucleotide context matrix (96 substitution classes)
# Needs >= 30 samples for reliable decomposition
tnm <- trinucleotideMatrix(maf, prefix = "chr", add = TRUE,
  ref_genome = "BSgenome.Hsapiens.UCSC.hg38")

# Step 2: Estimate optimal number of signatures (cophenetic correlation)
est <- estimateSignatures(mat = tnm, nTry = 6)
# Look for the elbow in the cophenetic plot

# Step 3: Extract n signatures via NMF
sig <- extractSignatures(mat = tnm, n = 3)  # n from step 2

# Step 4: Compare to COSMIC
# "legacy" = 30 COSMIC v2 signatures, "SBS" = 65 COSMIC v3 signatures
cosim <- compareSignatures(nmfRes = sig, sig_db = "SBS")
# cosim$cosine_similarities: sample x COSMIC matrix
# Match: cosine similarity > 0.85 is a confident assignment

plotSignatures(sig, contributions = FALSE)  # signature profiles
plotSignatures(sig, contributions = TRUE)   # per-sample contributions

Driver Gene Detection

# oncodrive: identifies genes with clustered mutations (positional clustering)
# Not dN/dS — uses the concept that driver mutations cluster in specific protein regions
oncodrive_res <- oncodrive(maf, AACol = "HGVSp_Short", minMut = 5,
  pvalMethod = "zscore")
plotOncodrive(oncodrive_res, fdrCutOff = 0.1, useFraction = TRUE)

# For proper dN/dS analysis, use the dndscv package separately:
# library(dndscv); dndsout <- dndscv(mutations_df)

Copy Number Analysis

Data Retrieval

library(TCGAbiolinks)

# Download masked copy number segments (preferred: excludes germline-prone regions)
query_cnv <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Copy Number Variation",
  data.type = "Masked Copy Number Segment"
)
GDCdownload(query_cnv, directory = "GDCdata")
cnv_seg <- GDCprepare(query_cnv, directory = "GDCdata")
# Returns: Sample, Chromosome, Start, End, Num_Probes, Segment_Mean
# Segment_Mean is log2(copy_number / 2): 0 = diploid, >0 = gain, <0 = loss

Segment Interpretation

Segment_Mean thresholds (log2 ratio):
  > 0.3   -> Gain
  > 0.7   -> Amplification (high-level gain)
  < -0.3  -> Loss (heterozygous deletion)
  < -0.7  -> Deep deletion (likely homozygous)
  These are conventions, not absolute cutoffs. Adjust based on tumor purity.

GISTIC2.0 Analysis

# GISTIC identifies recurrent focal CNV events across a cohort
# Pre-computed GISTIC results available from Firehose/GDAC for most TCGA cancers:
# https://gdac.broadinstitute.org/

# Read GISTIC output with maftools
gistic <- readGistic(
  gisticAllLesionsFile = "all_lesions.conf_99.txt",
  gisticAmpGenesFile = "amp_genes.conf_99.txt",
  gisticDelGenesFile = "del_genes.conf_99.txt",
  gisticScoresFile = "scores.gistic",
  isTCGA = TRUE
)

gisticChromPlot(gistic)            # genome-wide q-values per arm
gisticBubblePlot(gistic)           # focal events with significance
gisticOncoPlot(gistic, top = 10)   # oncoplot of recurrent CNVs

Gene-Level Copy Number from Segments

# Map segments to genes when GISTIC output is not available
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
seg_gr <- GRanges(
  seqnames = paste0("chr", cnv_seg$Chromosome),
  ranges = IRanges(start = cnv_seg$Start, end = cnv_seg$End),
  segment_mean = cnv_seg$Segment_Mean,
  sample = cnv_seg$Sample
)

# Find overlaps: assign segment value to each gene
hits <- findOverlaps(genes, seg_gr)
# For genes spanning multiple segments, use the segment with largest overlap

Integration: Mutation + CNV

# Add CNV data to MAF object for combined oncoplot
# cnv_table: data.frame with columns: Gene, Sample_name, CN (amp/del)
oncoplot(maf, top = 20,
  additionalFeature = list(cnv_data),
  additionalFeaturePch = 15,
  additionalFeatureCol = c(amp = "red", del = "blue"))

Methylation Analysis

Data Retrieval (TCGA)

library(TCGAbiolinks)

# Download 450K methylation data from GDC (harmonized, hg38)
query_met <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "DNA Methylation",
  platform = "Illumina Human Methylation 450",
  sample.type = "Primary Tumor"
)
GDCdownload(query_met)
met_se <- GDCprepare(query_met)
# Returns SummarizedExperiment; beta values as assay matrix
# rowRanges contains probe genomic coordinates (hg38)

# Also download matched normals for differential analysis
query_met_normal <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "DNA Methylation",
  platform = "Illumina Human Methylation 450",
  sample.type = "Solid Tissue Normal"
)

Processing with minfi

library(minfi)

# --- From IDAT files (non-TCGA data) ---
targets <- read.metharray.sheet("/path/to/idats/", pattern = "csv$")
rgSet <- read.metharray.exp(targets = targets, extended = TRUE)
# extended = TRUE needed for Noob background correction

# --- Detection p-values (probe reliability) ---
detP <- detectionP(rgSet)
# Remove samples with mean detection p > 0.01
keep_samples <- colMeans(detP) < 0.01
rgSet <- rgSet[, keep_samples]

# --- Normalization ---
# Funnorm: best for cancer (large methylation differences between groups)
# Uses control probes to remove technical variation via PCA
grSet <- preprocessFunnorm(rgSet, nPCs = 2, bgCorr = TRUE, dyeCorr = TRUE)
# Returns GenomicRatioSet with beta and M-values

# Alternative: preprocessQuantile for subtle differences (cell type studies)
# Alternative: preprocessNoob for single-sample normalization (clinical)

Probe Filtering

# Order matters: filter after normalization

# 1. Failed probes (detection p > 0.01 in > 5% of samples)
keep <- rowSums(detP[rownames(grSet), colnames(grSet)] < 0.01) >
  (ncol(grSet) * 0.95)
grSet <- grSet[keep, ]

# 2. SNP-affected probes (polymorphisms at CpG site or single-base extension)
grSet <- dropLociWithSnps(grSet, snps = c("SBE", "CpG"), maf = 0)

# 3. Cross-reactive probes (map to multiple genomic locations)
# Use DMRcate's built-in filter or maxprobes package
library(DMRcate)
grSet <- rmSNPandCH(grSet, rmcrosshyb = TRUE, rmXY = FALSE)

# 4. Sex chromosome probes (remove for mixed-sex cohorts)
grSet <- grSet[!(seqnames(grSet) %in% c("chrX", "chrY")), ]

# Typical probe counts after filtering:
# 450K: ~485K -> ~420-440K retained
# EPIC: ~866K -> ~780-810K retained

Alternative: ChAMP Pipeline

library(ChAMP)

# All-in-one loading with built-in filtering
# Handles detection p, SNPs, cross-reactive, sex chromosomes in one call
myLoad <- champ.load(
  directory = "/path/to/idats/",
  arraytype = "EPIC",         # or "450K"
  filterDetP = TRUE,          # detection p-value filter
  detPcut = 0.01,
  filterSNPs = TRUE,          # remove SNP-affected probes
  filterMultiHit = TRUE,      # remove cross-reactive probes
  filterXY = TRUE,            # remove sex chromosome probes
  filterBeads = TRUE,         # remove low bead count probes
  filterNoCG = TRUE           # remove non-CpG probes
)

# Normalization: BMIQ corrects Type I/II probe bias
myNorm <- champ.norm(beta = myLoad$beta, arraytype = "EPIC", method = "BMIQ")

# Batch effect detection
champ.SVD(beta = myNorm, pd = myLoad$pd)

# Batch correction if needed
myNorm <- champ.runCombat(beta = myNorm, pd = myLoad$pd, batchname = c("Slide"))

When to use which:
  minfi -> Fine-grained control, Funnorm (best for cancer), EPIC v2 support
  ChAMP -> Rapid standardized pipeline, built-in filtering + batch correction
  Both use limma internally for DMP calling

Beta-Values vs M-Values

Beta-values: ratio = M / (M + U + 100), range [0, 1]
  Use for: biological interpretation, visualization, reporting delta-beta
  Problem: heteroscedastic (variance compressed at 0 and 1)

M-values: log2(M / U), range (-Inf, Inf)
  Use for: ALL statistical testing (limma, t-tests, linear models)
  Reason: approximately normal, satisfies linear model assumptions

Conversion: M = log2(beta / (1 - beta))
Always test on M-values, report delta-beta for biological effect size.

Differentially Methylated Positions (DMPs)

library(limma)

# Extract M-values for statistical testing
mVals <- getM(grSet)
bVals <- getBeta(grSet)

# Design matrix
group <- factor(pData(grSet)$sample_type,
  levels = c("Solid Tissue Normal", "Primary Tumor"))
design <- model.matrix(~ 0 + group)
colnames(design) <- c("Normal", "Tumor")

# Contrast
contMatrix <- makeContrasts(Tumor - Normal, levels = design)

# Fit — trend = TRUE is critical for M-values
fit <- lmFit(mVals, design)
fit2 <- contrasts.fit(fit, contMatrix)
fit2 <- eBayes(fit2, trend = TRUE)

# Results
dmps <- topTable(fit2, coef = 1, number = Inf, sort.by = "p")

# Add delta-beta for biological interpretation
dmps$deltaBeta <- rowMeans(bVals[rownames(dmps), group == "Primary Tumor"]) -
  rowMeans(bVals[rownames(dmps), group == "Solid Tissue Normal"])

# Filter: statistical + biological significance
sig_dmps <- dmps[dmps$adj.P.Val < 0.05 & abs(dmps$deltaBeta) > 0.2, ]
message(sprintf("Significant DMPs: %d hyper, %d hypo",
  sum(sig_dmps$deltaBeta > 0), sum(sig_dmps$deltaBeta < 0)))

Differentially Methylated Regions (DMRs)

library(DMRcate)

# DMRcate: recommended for arrays. Uses kernel smoothing of limma statistics.
# Faster and better benchmarked than bumphunter for 450K/EPIC data.

# Step 1: Annotate CpGs with limma statistics
myAnnotation <- cpg.annotate(
  datatype = "array",
  object = mVals,
  what = "M",
  analysis.type = "differential",
  design = design,
  contrasts = TRUE,
  cont.matrix = contMatrix,
  coef = "Tumor - Normal",
  arraytype = "450K",    # or "EPIC"
  fdr = 0.05
)

# Step 2: Find DMRs
dmResults <- dmrcate(myAnnotation,
  lambda = 1000,    # Gaussian kernel bandwidth (bp)
  C = 2,            # scaling factor
  min.cpgs = 3      # minimum CpGs per DMR
)

# Step 3: Extract results as GRanges
dmr_ranges <- extractRanges(dmResults)
# Columns: coord, no.cpgs, min_smoothed_fdr, Stouffer, HMFDR,
#   Fisher, maxdiff, meandiff, overlapping.genes

Methylation Subtypes (CIMP)

library(ConsensusClusterPlus)

# CIMP identification: cluster on CpG island probes with highest variance
ann <- getAnnotation(grSet)

# Select CpG island probes only
island_probes <- rownames(ann)[ann$Relation_to_Island == "Island"]
island_beta <- bVals[intersect(island_probes, rownames(bVals)), ]

# Top 5000 most variable CpG island probes
vars <- apply(island_beta, 1, var, na.rm = TRUE)
top_probes <- names(sort(vars, decreasing = TRUE))[1:5000]
mat <- island_beta[top_probes, ]

# Consensus clustering (k = 2-6, 1000 iterations)
results <- ConsensusClusterPlus(mat,
  maxK = 6, reps = 1000, pItem = 0.8, pFeature = 1,
  clusterAlg = "hc", distance = "euclidean",
  innerLinkage = "ward.D2", finalLinkage = "ward.D2",
  title = "CIMP_clustering", plot = "pdf", seed = 42)

# Optimal k: look at delta area plot and consensus CDF
# k = 2 or 3 typical: CIMP-high, CIMP-low, (non-CIMP)
cluster_assignments <- results[[3]]$consensusClass  # for k=3

Cancer-specific CIMP markers:
  Colorectal (COAD): CACNA1G, IGF2, NEUROG1, RUNX3, SOCS1 (5-marker panel)
    CIMP-H -> BRAF V600E, MLH1 hypermethylation, MSI-H
    CIMP-L -> KRAS mutations
  Glioblastoma (GBM): G-CIMP defined by IDH1 mutation + global hypermethylation
    G-CIMP -> proneural subtype, younger patients, better survival
  Pan-cancer: 19/26 TCGA cancer types have confirmed CIMP subtypes

Integration: Methylation-Expression Correlation

# --- Match samples across platforms ---
met_patients <- substr(colnames(bVals), 1, 12)
exp_patients <- substr(colnames(expr_mat), 1, 12)
common <- intersect(met_patients, exp_patients)

met_idx <- match(common, met_patients)
exp_idx <- match(common, exp_patients)

# --- Build promoter methylation per gene ---
ann <- getAnnotation(grSet)
promoter <- ann[grep("TSS200|TSS1500|1stExon", ann$UCSC_RefGene_Group), ]

# Probe-to-gene mapping (probes can map to multiple genes)
probe_gene <- data.frame(
  probe = rep(rownames(promoter),
    lengths(strsplit(promoter$UCSC_RefGene_Name, ";"))),
  gene = unlist(strsplit(promoter$UCSC_RefGene_Name, ";")),
  stringsAsFactors = FALSE
)
probe_gene <- unique(probe_gene[probe_gene$gene != "", ])
probe_gene <- probe_gene[probe_gene$probe %in% rownames(bVals), ]

# Average promoter beta per gene per sample
gene_meth <- tapply(seq_len(nrow(probe_gene)), probe_gene$gene, function(idx) {
  probes <- probe_gene$probe[idx]
  colMeans(bVals[probes, met_idx, drop = FALSE], na.rm = TRUE)
})
gene_meth_mat <- do.call(rbind, gene_meth)

# --- Spearman correlation per gene ---
exp_matched <- expr_mat[, exp_idx]
shared_genes <- intersect(rownames(gene_meth_mat), rownames(exp_matched))

cor_results <- t(sapply(shared_genes, function(g) {
  m <- as.numeric(gene_meth_mat[g, ])
  e <- as.numeric(exp_matched[g, ])
  ok <- !is.na(m) & !is.na(e)
  if (sum(ok) < 10) return(c(rho = NA, pval = NA))
  ct <- cor.test(m[ok], e[ok], method = "spearman", exact = FALSE)
  c(rho = ct$estimate, pval = ct$p.value)
}))
cor_df <- as.data.frame(cor_results)
cor_df$padj <- p.adjust(cor_df$pval, method = "BH")

# Genes silenced by promoter methylation
silenced <- cor_df[which(cor_df$rho < -0.3 & cor_df$padj < 0.05), ]

Pathway Analysis (Probe-Bias Corrected)

library(missMethyl)

# Standard GO/KEGG enrichment is biased for methylation arrays:
# genes with more probes are more likely to be called significant.
# gometh() uses Wallenius' noncentral hypergeometric test to correct this.

sig_cpgs <- rownames(sig_dmps)
all_cpgs <- rownames(dmps)

go_meth <- gometh(
  sig.cpg = sig_cpgs,
  all.cpg = all_cpgs,
  collection = "GO",
  array.type = "450K",     # or "EPIC"
  plot.bias = TRUE,        # diagnostic: shows probe-number bias
  prior.prob = TRUE         # applies bias correction
)

kegg_meth <- gometh(
  sig.cpg = sig_cpgs, all.cpg = all_cpgs,
  collection = "KEGG", array.type = "450K", prior.prob = TRUE
)
# Never use standard enrichGO/enrichKEGG on methylation probe lists

Visualization

library(ComplexHeatmap); library(circlize)

# --- DMP heatmap (top 100 probes) ---
top_probes <- rownames(sig_dmps)[1:100]
mat <- bVals[top_probes, ]

col_fun <- colorRamp2(c(0, 0.5, 1), c("#2166AC", "white", "#B2182B"))
ha <- HeatmapAnnotation(
  Group = pData(grSet)$sample_type,
  col = list(Group = c("Primary Tumor" = "#D55E00", "Solid Tissue Normal" = "#0072B2"))
)

Heatmap(mat, name = "Beta", col = col_fun, top_annotation = ha,
  show_row_names = FALSE, clustering_distance_rows = "euclidean",
  clustering_method_rows = "ward.D2",
  column_title = "Top 100 Differentially Methylated Positions")

# --- Methylation-expression scatter for a single gene ---
library(ggplot2)
gene <- "MGMT"
df <- data.frame(
  methylation = as.numeric(gene_meth_mat[gene, ]),
  expression = as.numeric(exp_matched[gene, ])
)
ggplot(df, aes(methylation, expression)) +
  geom_point(alpha = 0.5, size = 1.5) +
  geom_smooth(method = "lm", color = "steelblue", se = TRUE) +
  annotate("text", x = 0.8, y = max(df$expression, na.rm = TRUE),
    label = sprintf("rho = %.2f", cor(df$methylation, df$expression,
      method = "spearman", use = "complete.obs")), hjust = 1) +
  labs(x = paste0(gene, " promoter methylation (beta)"),
       y = paste0(gene, " expression (log2)")) +
  theme_classic(base_size = 12)

Output Specification

Output	Format	Description
`dmps.csv`	CSV	All tested probes with delta-beta, adj.P.Val, genomic annotation
`sig_dmps.csv`	CSV	Significant DMPs (adj.P.Val < 0.05, \|delta-beta\| > 0.2)
`dmr_ranges.rds`	RDS	GRanges of differentially methylated regions
`grSet.rds`	RDS	Filtered, normalized GenomicRatioSet
`cimp_clusters.csv`	CSV	Sample-to-cluster assignments from consensus clustering
`meth_expr_cor.csv`	CSV	Per-gene Spearman rho and adjusted p-values
`dmp_heatmap.pdf`	PDF	Top DMPs across samples

Validation Checks

After running methylation analysis, verify:
  DMP count: Expect 10,000-100,000 DMPs for tumor vs normal at adj.P.Val < 0.05
    < 1,000 DMPs: check normalization, sample labels, detection p filtering
    Global hypomethylation in tumors is expected (more hypo than hyper DMPs)
  Delta-beta distribution: most DMPs should have |delta-beta| between 0.1-0.5
    |delta-beta| > 0.7 at many probes: check for sample swaps or contamination
  Known markers:
    MGMT promoter methylation in GBM (predicts temozolomide response)
    MLH1 methylation in MSI-H colorectal (silences mismatch repair)
    BRCA1 methylation in basal-like breast cancer
  Probe filtering: verify 40,000-65,000 probes removed (450K) or 55,000-85,000 (EPIC)
  Integration: promoter methylation-expression correlation should be predominantly negative
    Median rho around -0.05 to -0.15 across all genes
    Gene body methylation should show weak positive correlation

Common Pitfalls

Expression

DESeq2 input: Using FPKM/TPM as input. DESeq2 requires raw integer counts.
Pre-filtering GSEA input: GSEA requires the full ranked gene list. Never pre-filter to significant genes only (that is ORA, not GSEA).
LFC shrinkage for hypothesis testing: apeglm shrinkage is for ranking and visualization. Use unshrunken results for strict significance calls unless using lfcThreshold > 0.
Gene ID mismatch: TCGA uses Ensembl IDs with version suffixes (ENSG00000141510.18). Strip versions before mapping: sub("\\..*", "", ensembl_ids).

Mutation

Mixing callers: Different variant callers produce different mutation sets. Pick one pipeline (or use MC3 consensus) and stick with it across the cohort.
Signature sample size: Extracting signatures from < 30 samples is unreliable. For small cohorts, fit to known COSMIC signatures instead of de novo extraction.
Reference genome for signatures: trinucleotideMatrix() requires the correct BSgenome. GDC harmonized data uses hg38 — using hg19 reference silently produces wrong trinucleotide contexts.
TMB capture size: tcgaCompare() uses 35.8 MB (Agilent SureSelect). If your data uses a different capture kit, set capture_size accordingly or TMB comparison is meaningless.

Copy Number

Segment_Mean interpretation: Segment_Mean is log2(CN/2), not raw copy number. A value of 0.0 is diploid, not zero copies.
CNV thresholds and purity: Fixed cutoffs (>0.3 gain, <-0.3 loss) assume high tumor purity. Low-purity samples compress the signal — a real amplification may show Segment_Mean of only 0.2.

Methylation

Beta-values for statistics: Using beta-values in limma or t-tests. Beta-values are heteroscedastic — use M-values for all statistical testing, report delta-beta for effect size.
Skipping cross-reactive probe removal: ~40K probes on 450K and ~43K on EPIC map to multiple genomic locations. These produce false DMPs at the wrong loci.
Standard enrichment on methylation results: Genes with more probes on the array are more likely to appear significant. Use missMethyl::gometh() instead of enrichGO() for methylation pathway analysis.
Mixing 450K and EPIC without harmonization: EPIC has ~380K additional probes. Restrict to shared probes before combining cohorts, or use ComBat on the intersection.
Ignoring probe type bias: Type I and Type II probes have different dynamic ranges. BMIQ or Funnorm corrects this; raw beta-values without type correction are biased.

General

Genome build mismatch: TCGA legacy archive uses hg19, GDC harmonized uses hg38. Never mix builds across data types. Check with GenomeInfoDb::genome(se).
Tumor purity: TCGA samples range 20-95% purity. Use ESTIMATE or ABSOLUTE purity scores to filter (> 60%) or include as covariate.
TCGA barcode matching: When integrating data types, match on patient barcode (first 12 characters), not full barcode. A patient may have multiple aliquots.

Related Skills

immune-deconvolution: Estimate immune cell composition from the expression data produced here
survival-analysis: Build prognostic models using mutation, expression, and methylation features as Cox covariates

Public Datasets for Testing

Dataset	Samples	Use Case
TCGA-LUAD	585	Lung adenocarcinoma, well-characterized drivers (EGFR, KRAS, ALK)
TCGA-BRCA	1098	Breast cancer, strong molecular subtypes (Luminal A/B, HER2, Basal)
TCGA-GBM	617	Glioblastoma, classic for survival and methylation subtyping
TCGA-COAD	521	Colorectal, MSI-H vs MSS comparison

ナビゲーション

Skillsとは？

リンク

Cancer Multi-Omics Analysis

Cancer Multi-Omics Analysis

When to Use This Skill

Inputs

Expression Analysis

Data Retrieval

GEO Data Retrieval (Alternative)

Normalization Decision Tree

Differential Expression (DESeq2 v1.50+)

Multi-Factor Design (Batch + Condition)

Batch Correction for Visualization

Gene ID Conversion

Pathway and Enrichment Analysis

Over-Representation Analysis (ORA)

Gene Set Enrichment Analysis (GSEA)

MSigDB Gene Sets (Hallmark, C2, C5)

Sample-Level Pathway Scoring (GSVA/ssGSEA)

Visualization

Output Specification

Validation Checks

Mutation Analysis

Data Retrieval

Pipeline Selection

Visualization

Tumor Mutation Burden

Somatic Interactions

Clinical Enrichment

Mutational Signatures

Driver Gene Detection

Copy Number Analysis

Data Retrieval

Segment Interpretation

GISTIC2.0 Analysis

Gene-Level Copy Number from Segments

Integration: Mutation + CNV

Methylation Analysis

Data Retrieval (TCGA)

Processing with minfi

Probe Filtering

Alternative: ChAMP Pipeline

Beta-Values vs M-Values

Differentially Methylated Positions (DMPs)

Differentially Methylated Regions (DMRs)

Methylation Subtypes (CIMP)

Integration: Methylation-Expression Correlation

Pathway Analysis (Probe-Bias Corrected)

Visualization

Output Specification

Validation Checks

Common Pitfalls

Expression

Mutation

Copy Number

Methylation

General

Related Skills

Public Datasets for Testing

関連スキル(📊 データ・分析)