Survival Analysis
Time-to-event analysis for cancer clinical data. Covers Kaplan-Meier, Cox proportional hazards, competing risks, restricted mean survival time, and optimal cutpoint selection using the survival, ggsurvfit, tidycmprsk, and survRM2 packages.
When to Use This Skill
Activate when the user requests:
- Kaplan-Meier survival curves with risk tables
- Log-rank or stratified log-rank tests between groups
- Cox proportional hazards modeling (univariable or multivariable)
- Proportional hazards assumption checking
- Competing risks analysis (cause-specific or Fine-Gray)
- Restricted mean survival time (RMST) comparisons
- Optimal biomarker cutpoint selection for survival
- Forest plots for multivariate Cox models
Inputs
| Data Type | Format | Source |
|---|
| Clinical | Tabular (time, event, covariates) | TCGA GDC via TCGAbiolinks, cBioPortal |
| Time variable | Days/months to event or last follow-up | days_to_death, days_to_last_follow_up |
| Event indicator | Binary (0 = censored, 1 = event) | Derived from vital_status |
| Covariates | Categorical or continuous | Age, stage, gene expression, mutations |
Preparing TCGA Survival Data
library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(
project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts"
)
GDCdownload(query, directory = "GDCdata")
se <- GDCprepare(query, directory = "GDCdata")
clinical <- as.data.frame(colData(se))
# Construct survival endpoint from TCGA clinical fields
# TCGA stores time in days. Convert to months for readability.
clinical$os_time <- ifelse(!is.na(clinical$days_to_death),
as.numeric(clinical$days_to_death),
as.numeric(clinical$days_to_last_follow_up))
clinical$os_event <- ifelse(clinical$vital_status == "Dead", 1, 0)
clinical$os_months <- clinical$os_time / 30.44
# Drop samples with missing survival data
clinical <- clinical[!is.na(clinical$os_months) & clinical$os_months > 0, ]
Kaplan-Meier Estimation
library(survival) # v3.8+
library(ggsurvfit) # v1.2+, pharmaverse — modern replacement for survminer
# survfit2() tracks the calling environment for cleaner legend labels
# Use survfit2() with ggsurvfit; use survfit() with base R plotting
km_fit <- survfit2(Surv(os_months, os_event) ~ gender, data = clinical)
ggsurvfit(km_fit) +
add_confidence_interval() +
add_risktable() +
add_pvalue(location = "annotation") +
scale_x_continuous(breaks = seq(0, 60, 12)) +
labs(x = "Months", y = "Overall survival probability") +
theme_classic(base_size = 12)
# Median survival per group
summary(km_fit)$table[, "median"]
# Median follow-up: reverse KM method (censor events, event the censored)
km_fu <- survfit(Surv(os_months, 1 - os_event) ~ 1, data = clinical)
summary(km_fu)$table["median"]
# Always report median follow-up alongside survival estimates
Log-Rank Test
# Two-group comparison
survdiff(Surv(os_months, os_event) ~ idh1_status, data = clinical)
# Returns chi-square and p-value. Tests H0: survival curves are identical.
# Stratified log-rank — adjust for a confounder without estimating its effect
survdiff(Surv(os_months, os_event) ~ treatment + strata(age_group), data = clinical)
When log-rank is appropriate:
Groups have proportional hazards (non-crossing curves)
-> log-rank is most powerful
Curves cross or separate late (common in immunotherapy)
-> log-rank has low power. Use RMST or weighted log-rank.
> 2 groups
-> log-rank gives a global test. Follow up with pairwise comparisons
using p.adjust() for multiple testing correction.
Cox Proportional Hazards
Univariable
# Continuous covariate
cox_age <- coxph(Surv(os_months, os_event) ~ age_at_diagnosis, data = clinical)
summary(cox_age)
# HR per 1-year increase. exp(coef) = HR, lower/upper .95 = CI
# Categorical covariate — set reference level explicitly
clinical$stage <- relevel(factor(clinical$stage), ref = "Stage I")
cox_stage <- coxph(Surv(os_months, os_event) ~ stage, data = clinical)
Multivariable
cox_multi <- coxph(
Surv(os_months, os_event) ~ age_at_diagnosis + gender + idh1_status + mgmt_methylation,
data = clinical
)
summary(cox_multi)
# Concordance index (C-statistic): 0.5 = random, 1.0 = perfect discrimination
# TCGA-GBM with IDH + MGMT: expect C-index ~0.65-0.75
concordance(cox_multi)
Design Considerations
Variable selection:
Include variables that are clinically established prognostic factors.
Do NOT use stepwise selection (AIC/BIC) for final models in clinical contexts —
it inflates Type I error and produces unstable models.
Stepwise is acceptable for exploratory analysis with clear disclosure.
Events per variable (EPV):
Classic rule: >= 10 events per predictor in the model.
With 50 events, fit at most 5 covariates.
More recent work (Riley 2019) recommends >= 20 EPV for adequate precision.
Too many covariates relative to events = overfitting and unstable HRs.
Continuous vs categorical:
Keep continuous variables continuous in Cox models. Dichotomizing age at
median discards information and reduces power.
If a cutpoint is needed for clinical communication, derive it separately
(see Optimal Cutpoints below) and validate externally.
Proportional Hazards Assumption
# Schoenfeld residual test
zph <- cox.zph(cox_multi, transform = "km")
print(zph)
# Per-covariate p-values + global test
# p > 0.05: PH assumption not rejected (but doesn't prove PH holds)
# p < 0.05: evidence of non-proportionality for that covariate
# Visual check — more informative than the p-value alone
# Horizontal band of residuals -> PH holds
# Systematic trend (slope) -> PH violated, effect changes over time
plot(zph)
# Or with survminer: survminer::ggcoxzph(zph)
Interpreting Schoenfeld residuals:
Positive slope: effect increases over time (HR drifts upward)
Negative slope: effect weakens over time (early benefit that fades)
U-shape or crossing zero: complex time-varying effect
The p-value is sample-size dependent: large datasets reject PH
for clinically negligible departures. Always look at the plot.
Handling PH Violations
# Option 1: Time-varying coefficient with tt()
# Preferred when a specific covariate violates PH
cox_tv <- coxph(
Surv(os_months, os_event) ~ age_at_diagnosis + tt(treatment),
data = clinical,
tt = function(x, t, ...) x * log(t + 1)
)
# Tests whether treatment effect changes with log(time)
# Significant tt() term confirms time-varying effect
# Option 2: Stratification — for nuisance variables
# Each stratum gets its own baseline hazard; no HR estimated for that variable
cox_strat <- coxph(
Surv(os_months, os_event) ~ age_at_diagnosis + mgmt_methylation + strata(center),
data = clinical
)
# Re-run cox.zph() after stratification to confirm remaining covariates satisfy PH
# Option 3: Time-splitting with tmerge() — for time-varying covariates
# When a covariate value changes during follow-up (e.g., treatment switch)
library(survival)
td_data <- tmerge(
data1 = clinical, data2 = clinical,
id = patient_id,
os_event = event(os_months, os_event),
treatment_switch = tdc(switch_time, new_treatment)
)
cox_td <- coxph(Surv(tstart, tstop, os_event) ~ treatment_switch + age_at_diagnosis,
data = td_data)
Competing Risks
Two hazard models — they answer different questions:
Cause-specific hazard (CSH):
"Does treatment affect the rate of cancer death among patients still alive?"
Standard coxph() treating competing events as censored.
Use for: etiological questions, causal inference.
Fine-Gray subdistribution hazard (SDH):
"Does treatment affect the cumulative probability of dying from cancer
when competing events are accounted for?"
Use for: prediction, prognostic models, clinical decision-making.
Run both and report side-by-side.
A covariate can have a significant CSH but non-significant SDH (or vice versa)
because they measure different things.
library(tidycmprsk) # v1.1+, wraps cmprsk with tidy interface
library(ggsurvfit)
# Event coding for competing risks:
# 0 = censored, 1 = event of interest, 2 = competing event
# Example: 1 = cancer death, 2 = death from other cause
clinical$event_type <- factor(
ifelse(clinical$os_event == 0, "censored",
ifelse(clinical$cause_of_death == "cancer", "cancer_death", "other_death")),
levels = c("censored", "cancer_death", "other_death")
)
# Cumulative incidence function (CIF)
cif <- cuminc(Surv(os_months, event_type) ~ treatment, data = clinical)
ggcuminc(cif, outcome = "cancer_death") +
add_risktable() +
labs(x = "Months", y = "Cumulative incidence of cancer death") +
theme_classic(base_size = 12)
# Fine-Gray regression (subdistribution hazard)
fg <- crr(Surv(os_months, event_type) ~ age + treatment + stage,
data = clinical, failcode = "cancer_death")
summary(fg)
# Cause-specific Cox (standard Cox, treating competing events as censored)
clinical$cancer_event <- ifelse(clinical$event_type == "cancer_death", 1, 0)
cs <- coxph(Surv(os_months, cancer_event) ~ age + treatment + stage, data = clinical)
summary(cs)
Restricted Mean Survival Time
library(survRM2) # v1.0-4
# RMST: area under the survival curve up to time tau
# Avoids the PH assumption entirely — compares mean survival time between groups
rmst <- rmst2(
time = clinical$os_months,
status = clinical$os_event,
arm = as.numeric(clinical$treatment == "experimental"),
tau = 24 # restrict to 24 months — must be pre-specified
)
print(rmst)
# Returns:
# RMST difference (treatment - control), CI, p-value
# RMST ratio
# RMTL ratio (restricted mean time lost)
# tau must be less than the minimum of the largest observed time in each group.
# Pre-specify tau based on clinical context (e.g., 12, 24, 36 months).
# Run sensitivity analysis across multiple tau values.
When to use RMST vs HR:
Proportional hazards hold?
-> HR is more efficient. Report both HR and RMST.
Curves cross or separate late (immunotherapy, delayed effect)?
-> RMST captures the difference that HR misses.
Need a clinically interpretable number?
-> RMST: "Treatment gives 3.2 extra months of survival on average"
HR: "Treatment reduces the hazard by 30%" (less intuitive)
Optimal Cutpoint Selection
library(survminer) # v0.5+, provides surv_cutpoint
# maxstat approach: maximally selected log-rank statistic
# Finds the cutpoint that maximally separates high/low groups
cp <- surv_cutpoint(clinical, time = "os_months", event = "os_event",
variables = "biomarker_score")
summary(cp)
cat(sprintf("Optimal cutpoint: %.2f (corrected p = %.4f)\n",
cp$cutpoint$cutpoint, cp$cutpoint$statistic))
# Categorize samples at the cutpoint
clinical$biomarker_group <- surv_categorize(cp)$biomarker_score
# Plot with survminer (surv_cutpoint is survminer-specific, not ggsurvfit)
fit_cp <- survfit(Surv(os_months, os_event) ~ biomarker_group, data = clinical)
ggsurvplot(fit_cp, data = clinical, pval = TRUE, risk.table = TRUE)
Cutpoint pitfalls — read before using:
1. Inflated significance: searching all cutpoints maximizes the test statistic.
The p-value from maxstat is corrected for this, but the effect size (HR)
is still biased upward. A random variable can appear highly prognostic.
2. Non-reproducibility: data-driven cutpoints rarely replicate.
Any cutpoint found this way MUST be validated in an independent cohort
or via cross-validation. Without validation, it is a hypothesis, not a finding.
3. Information loss: dichotomizing a continuous biomarker at any cutpoint
discards information. A continuous covariate in Cox regression is almost
always more powerful than a dichotomized one.
4. When cutpoints are justified:
- Clinical decision-making (treat/don't treat needs a threshold)
- An established cutpoint exists (Ki-67 > 14%, PSA > 4 ng/mL)
- Communication to non-statistical audiences
Use continuous modeling for primary analysis, cutpoints for clinical translation.
Forest Plots
library(forestmodel) # v0.6+
# Directly from a coxph object — one function call
forest_model(cox_multi)
# Produces a ggplot2-based forest plot with HRs, 95% CIs, p-values
# Customize
forest_model(cox_multi,
factor_separate_line = TRUE, # categorical levels on separate lines
format_options = forest_model_format_options(
text_size = 3.5, point_size = 3
)
)
# Alternative: survminer::ggforest() for quick plots
# Limitation: does not handle interactions or spline terms properly
survminer::ggforest(cox_multi, data = clinical)
Output Specification
| Output | Format | Description |
|---|
km_curves.pdf | PDF | Kaplan-Meier curves with risk table and p-value |
cox_summary.csv | CSV | HR, 95% CI, p-value for each covariate |
cox_zph.pdf | PDF | Schoenfeld residual plots for PH assumption |
forest_plot.pdf | PDF | Multivariate Cox forest plot |
cif_curves.pdf | PDF | Cumulative incidence curves for competing risks |
rmst_results.csv | CSV | RMST difference, ratio, CI, p-value |
cutpoint_analysis.csv | CSV | Optimal cutpoint, corrected p-value, HR at cutpoint |
Validation Checks
After running survival analysis on TCGA-GBM, verify:
Overall survival:
Median OS for all GBM: ~14-16 months
If median OS > 24 months: likely includes IDH-mutant cases (not true GBM under WHO 2021)
2-year survival: ~20-25%
IDH1 mutation (strongest single prognostic factor in glioma):
IDH-mutant median OS: ~31 months
IDH-wildtype median OS: ~14-16 months
Log-rank p-value: should be < 0.001
Cox HR for IDH-mutant vs wildtype: ~0.3-0.4 (strong protective effect)
IDH1 R132H mutation frequency: ~6% of original TCGA-GBM cohort
MGMT promoter methylation:
MGMT methylated: ~42% of IDH-wildtype GBMs
Methylated vs unmethylated: HR ~0.5-0.6
This is both prognostic (survival) and predictive (TMZ response)
Cox model quality:
C-index with IDH + MGMT + age: ~0.65-0.75
EPV check: with ~400 events and 3-5 covariates, EPV is adequate (> 80)
Proportional hazards:
Age typically satisfies PH in GBM (chronic risk factor)
Treatment effect may violate PH (delayed separation with TMZ)
If global cox.zph p < 0.05, identify the violating covariate
WHO 2021 note:
"Glioblastoma" is now restricted to IDH-wildtype tumors.
IDH-mutant grade 4 gliomas are reclassified as "Astrocytoma, IDH-mutant, Grade 4."
When using TCGA-GBM data, reclassify or note this in your analysis.
Common Pitfalls
Kaplan-Meier
- Median follow-up from the KM estimate: The median of the KM curve is median survival time, not median follow-up. Calculate median follow-up using the reverse KM method:
survfit(Surv(time, 1-event) ~ 1). Short follow-up with few events makes survival estimates unreliable beyond the last observed event.
- Interpreting the tail: KM curves at the right end are based on very few patients. Wide confidence intervals there mean the curve is unreliable. Truncate plots at the time when < 10% of patients remain at risk.
Cox Regression
- Ignoring the PH assumption: Running
coxph() and reporting HRs without checking cox.zph() first. If PH is violated, the reported HR is an average over time and may be misleading for both early and late effects.
- Overfitting: Including too many covariates relative to events. With 50 events and 10 covariates, the model is unstable. Follow the EPV >= 10 rule (preferably >= 20).
- Stepwise selection for final models: Forward/backward stepwise selection inflates Type I error and produces non-reproducible models. Pre-specify covariates based on clinical knowledge.
- Ignoring collinearity: Stage and tumor size are correlated. Including both in a Cox model inflates standard errors and produces unstable HRs. Check VIF or pairwise correlations before fitting.
Competing Risks
- Standard KM for competing risks endpoints: KM treats competing events as censored, which inflates cumulative incidence of the event of interest. Use cumulative incidence functions (CIF) via
tidycmprsk::cuminc() when competing events are present.
- Interpreting Fine-Gray as causal: Fine-Gray SDH measures the effect on cumulative incidence, not on the rate among those at risk. A covariate that increases competing event risk will artificially appear to decrease the subdistribution hazard for the primary event.
RMST and Cutpoints
- Post-hoc tau selection: Choosing tau after seeing the data invalidates the RMST test. Pre-specify tau based on clinical context (standard follow-up duration, planned analysis timepoint).
- Cutpoint without validation: An optimal cutpoint found by maxstat will appear highly significant in the discovery cohort even for a non-prognostic biomarker. Without independent validation, the cutpoint is a hypothesis, not a result.
Related Skills
cancer-multiomics: TCGA data retrieval, mutation status and gene expression for Cox covariates
immune-deconvolution: Immune cell fraction estimates as continuous survival predictors
Public Datasets for Testing
| Dataset | Samples | Use Case |
|---|
| TCGA-GBM | 611 | Glioblastoma, strong IDH1/MGMT prognostic markers, poor prognosis |
| TCGA-BRCA | 1098 | Breast cancer, molecular subtype-specific survival differences |
| TCGA-LUAD | 585 | Lung adenocarcinoma, EGFR/KRAS mutation-driven survival |
| TCGA-OV | 608 | Ovarian cancer, classic for competing risks (progression vs death) |