Survival Analysis

Time-to-event analysis for cancer clinical data. Covers Kaplan-Meier, Cox proportional hazards, competing risks, restricted mean survival time, and optimal cutpoint selection using the survival, ggsurvfit, tidycmprsk, and survRM2 packages.

When to Use This Skill

Activate when the user requests:

Kaplan-Meier survival curves with risk tables
Log-rank or stratified log-rank tests between groups
Cox proportional hazards modeling (univariable or multivariable)
Proportional hazards assumption checking
Competing risks analysis (cause-specific or Fine-Gray)
Restricted mean survival time (RMST) comparisons
Optimal biomarker cutpoint selection for survival
Forest plots for multivariate Cox models

Inputs

Data Type	Format	Source
Clinical	Tabular (time, event, covariates)	TCGA GDC via TCGAbiolinks, cBioPortal
Time variable	Days/months to event or last follow-up	`days_to_death`, `days_to_last_follow_up`
Event indicator	Binary (0 = censored, 1 = event)	Derived from `vital_status`
Covariates	Categorical or continuous	Age, stage, gene expression, mutations

Preparing TCGA Survival Data

library(TCGAbiolinks)
library(SummarizedExperiment)

query <- GDCquery(
  project = "TCGA-GBM",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)
GDCdownload(query, directory = "GDCdata")
se <- GDCprepare(query, directory = "GDCdata")
clinical <- as.data.frame(colData(se))

# Construct survival endpoint from TCGA clinical fields
# TCGA stores time in days. Convert to months for readability.
clinical$os_time <- ifelse(!is.na(clinical$days_to_death),
  as.numeric(clinical$days_to_death),
  as.numeric(clinical$days_to_last_follow_up))
clinical$os_event <- ifelse(clinical$vital_status == "Dead", 1, 0)
clinical$os_months <- clinical$os_time / 30.44

# Drop samples with missing survival data
clinical <- clinical[!is.na(clinical$os_months) & clinical$os_months > 0, ]

Kaplan-Meier Estimation

library(survival)    # v3.8+
library(ggsurvfit)   # v1.2+, pharmaverse — modern replacement for survminer

# survfit2() tracks the calling environment for cleaner legend labels
# Use survfit2() with ggsurvfit; use survfit() with base R plotting
km_fit <- survfit2(Surv(os_months, os_event) ~ gender, data = clinical)

ggsurvfit(km_fit) +
  add_confidence_interval() +
  add_risktable() +
  add_pvalue(location = "annotation") +
  scale_x_continuous(breaks = seq(0, 60, 12)) +
  labs(x = "Months", y = "Overall survival probability") +
  theme_classic(base_size = 12)

# Median survival per group
summary(km_fit)$table[, "median"]

# Median follow-up: reverse KM method (censor events, event the censored)
km_fu <- survfit(Surv(os_months, 1 - os_event) ~ 1, data = clinical)
summary(km_fu)$table["median"]
# Always report median follow-up alongside survival estimates

Log-Rank Test

# Two-group comparison
survdiff(Surv(os_months, os_event) ~ idh1_status, data = clinical)
# Returns chi-square and p-value. Tests H0: survival curves are identical.

# Stratified log-rank — adjust for a confounder without estimating its effect
survdiff(Surv(os_months, os_event) ~ treatment + strata(age_group), data = clinical)

When log-rank is appropriate:
  Groups have proportional hazards (non-crossing curves)
    -> log-rank is most powerful
  Curves cross or separate late (common in immunotherapy)
    -> log-rank has low power. Use RMST or weighted log-rank.
  > 2 groups
    -> log-rank gives a global test. Follow up with pairwise comparisons
       using p.adjust() for multiple testing correction.

Cox Proportional Hazards

Univariable

# Continuous covariate
cox_age <- coxph(Surv(os_months, os_event) ~ age_at_diagnosis, data = clinical)
summary(cox_age)
# HR per 1-year increase. exp(coef) = HR, lower/upper .95 = CI

# Categorical covariate — set reference level explicitly
clinical$stage <- relevel(factor(clinical$stage), ref = "Stage I")
cox_stage <- coxph(Surv(os_months, os_event) ~ stage, data = clinical)

Multivariable

cox_multi <- coxph(
  Surv(os_months, os_event) ~ age_at_diagnosis + gender + idh1_status + mgmt_methylation,
  data = clinical
)
summary(cox_multi)

# Concordance index (C-statistic): 0.5 = random, 1.0 = perfect discrimination
# TCGA-GBM with IDH + MGMT: expect C-index ~0.65-0.75
concordance(cox_multi)

Design Considerations

Variable selection:
  Include variables that are clinically established prognostic factors.
  Do NOT use stepwise selection (AIC/BIC) for final models in clinical contexts —
  it inflates Type I error and produces unstable models.
  Stepwise is acceptable for exploratory analysis with clear disclosure.

Events per variable (EPV):
  Classic rule: >= 10 events per predictor in the model.
  With 50 events, fit at most 5 covariates.
  More recent work (Riley 2019) recommends >= 20 EPV for adequate precision.
  Too many covariates relative to events = overfitting and unstable HRs.

Continuous vs categorical:
  Keep continuous variables continuous in Cox models. Dichotomizing age at
  median discards information and reduces power.
  If a cutpoint is needed for clinical communication, derive it separately
  (see Optimal Cutpoints below) and validate externally.

Proportional Hazards Assumption

# Schoenfeld residual test
zph <- cox.zph(cox_multi, transform = "km")
print(zph)
# Per-covariate p-values + global test
# p > 0.05: PH assumption not rejected (but doesn't prove PH holds)
# p < 0.05: evidence of non-proportionality for that covariate

# Visual check — more informative than the p-value alone
# Horizontal band of residuals -> PH holds
# Systematic trend (slope) -> PH violated, effect changes over time
plot(zph)
# Or with survminer: survminer::ggcoxzph(zph)

Interpreting Schoenfeld residuals:
  Positive slope: effect increases over time (HR drifts upward)
  Negative slope: effect weakens over time (early benefit that fades)
  U-shape or crossing zero: complex time-varying effect

  The p-value is sample-size dependent: large datasets reject PH
  for clinically negligible departures. Always look at the plot.

Handling PH Violations

# Option 1: Time-varying coefficient with tt()
# Preferred when a specific covariate violates PH
cox_tv <- coxph(
  Surv(os_months, os_event) ~ age_at_diagnosis + tt(treatment),
  data = clinical,
  tt = function(x, t, ...) x * log(t + 1)
)
# Tests whether treatment effect changes with log(time)
# Significant tt() term confirms time-varying effect

# Option 2: Stratification — for nuisance variables
# Each stratum gets its own baseline hazard; no HR estimated for that variable
cox_strat <- coxph(
  Surv(os_months, os_event) ~ age_at_diagnosis + mgmt_methylation + strata(center),
  data = clinical
)
# Re-run cox.zph() after stratification to confirm remaining covariates satisfy PH

# Option 3: Time-splitting with tmerge() — for time-varying covariates
# When a covariate value changes during follow-up (e.g., treatment switch)
library(survival)
td_data <- tmerge(
  data1 = clinical, data2 = clinical,
  id = patient_id,
  os_event = event(os_months, os_event),
  treatment_switch = tdc(switch_time, new_treatment)
)
cox_td <- coxph(Surv(tstart, tstop, os_event) ~ treatment_switch + age_at_diagnosis,
  data = td_data)

Competing Risks

Two hazard models — they answer different questions:

Cause-specific hazard (CSH):
  "Does treatment affect the rate of cancer death among patients still alive?"
  Standard coxph() treating competing events as censored.
  Use for: etiological questions, causal inference.

Fine-Gray subdistribution hazard (SDH):
  "Does treatment affect the cumulative probability of dying from cancer
   when competing events are accounted for?"
  Use for: prediction, prognostic models, clinical decision-making.

Run both and report side-by-side.
A covariate can have a significant CSH but non-significant SDH (or vice versa)
because they measure different things.

library(tidycmprsk)  # v1.1+, wraps cmprsk with tidy interface
library(ggsurvfit)

# Event coding for competing risks:
# 0 = censored, 1 = event of interest, 2 = competing event
# Example: 1 = cancer death, 2 = death from other cause
clinical$event_type <- factor(
  ifelse(clinical$os_event == 0, "censored",
    ifelse(clinical$cause_of_death == "cancer", "cancer_death", "other_death")),
  levels = c("censored", "cancer_death", "other_death")
)

# Cumulative incidence function (CIF)
cif <- cuminc(Surv(os_months, event_type) ~ treatment, data = clinical)

ggcuminc(cif, outcome = "cancer_death") +
  add_risktable() +
  labs(x = "Months", y = "Cumulative incidence of cancer death") +
  theme_classic(base_size = 12)

# Fine-Gray regression (subdistribution hazard)
fg <- crr(Surv(os_months, event_type) ~ age + treatment + stage,
  data = clinical, failcode = "cancer_death")
summary(fg)

# Cause-specific Cox (standard Cox, treating competing events as censored)
clinical$cancer_event <- ifelse(clinical$event_type == "cancer_death", 1, 0)
cs <- coxph(Surv(os_months, cancer_event) ~ age + treatment + stage, data = clinical)
summary(cs)

Restricted Mean Survival Time

library(survRM2)  # v1.0-4

# RMST: area under the survival curve up to time tau
# Avoids the PH assumption entirely — compares mean survival time between groups

rmst <- rmst2(
  time = clinical$os_months,
  status = clinical$os_event,
  arm = as.numeric(clinical$treatment == "experimental"),
  tau = 24  # restrict to 24 months — must be pre-specified
)
print(rmst)
# Returns:
#   RMST difference (treatment - control), CI, p-value
#   RMST ratio
#   RMTL ratio (restricted mean time lost)

# tau must be less than the minimum of the largest observed time in each group.
# Pre-specify tau based on clinical context (e.g., 12, 24, 36 months).
# Run sensitivity analysis across multiple tau values.

When to use RMST vs HR:
  Proportional hazards hold?
    -> HR is more efficient. Report both HR and RMST.
  Curves cross or separate late (immunotherapy, delayed effect)?
    -> RMST captures the difference that HR misses.
  Need a clinically interpretable number?
    -> RMST: "Treatment gives 3.2 extra months of survival on average"
       HR: "Treatment reduces the hazard by 30%" (less intuitive)

Optimal Cutpoint Selection

library(survminer)  # v0.5+, provides surv_cutpoint

# maxstat approach: maximally selected log-rank statistic
# Finds the cutpoint that maximally separates high/low groups
cp <- surv_cutpoint(clinical, time = "os_months", event = "os_event",
  variables = "biomarker_score")
summary(cp)
cat(sprintf("Optimal cutpoint: %.2f (corrected p = %.4f)\n",
  cp$cutpoint$cutpoint, cp$cutpoint$statistic))

# Categorize samples at the cutpoint
clinical$biomarker_group <- surv_categorize(cp)$biomarker_score

# Plot with survminer (surv_cutpoint is survminer-specific, not ggsurvfit)
fit_cp <- survfit(Surv(os_months, os_event) ~ biomarker_group, data = clinical)
ggsurvplot(fit_cp, data = clinical, pval = TRUE, risk.table = TRUE)

Cutpoint pitfalls — read before using:
  1. Inflated significance: searching all cutpoints maximizes the test statistic.
     The p-value from maxstat is corrected for this, but the effect size (HR)
     is still biased upward. A random variable can appear highly prognostic.

  2. Non-reproducibility: data-driven cutpoints rarely replicate.
     Any cutpoint found this way MUST be validated in an independent cohort
     or via cross-validation. Without validation, it is a hypothesis, not a finding.

  3. Information loss: dichotomizing a continuous biomarker at any cutpoint
     discards information. A continuous covariate in Cox regression is almost
     always more powerful than a dichotomized one.

  4. When cutpoints are justified:
     - Clinical decision-making (treat/don't treat needs a threshold)
     - An established cutpoint exists (Ki-67 > 14%, PSA > 4 ng/mL)
     - Communication to non-statistical audiences

  Use continuous modeling for primary analysis, cutpoints for clinical translation.

Forest Plots

library(forestmodel)  # v0.6+

# Directly from a coxph object — one function call
forest_model(cox_multi)
# Produces a ggplot2-based forest plot with HRs, 95% CIs, p-values

# Customize
forest_model(cox_multi,
  factor_separate_line = TRUE,       # categorical levels on separate lines
  format_options = forest_model_format_options(
    text_size = 3.5, point_size = 3
  )
)

# Alternative: survminer::ggforest() for quick plots
# Limitation: does not handle interactions or spline terms properly
survminer::ggforest(cox_multi, data = clinical)

Output Specification

Output	Format	Description
`km_curves.pdf`	PDF	Kaplan-Meier curves with risk table and p-value
`cox_summary.csv`	CSV	HR, 95% CI, p-value for each covariate
`cox_zph.pdf`	PDF	Schoenfeld residual plots for PH assumption
`forest_plot.pdf`	PDF	Multivariate Cox forest plot
`cif_curves.pdf`	PDF	Cumulative incidence curves for competing risks
`rmst_results.csv`	CSV	RMST difference, ratio, CI, p-value
`cutpoint_analysis.csv`	CSV	Optimal cutpoint, corrected p-value, HR at cutpoint

Validation Checks

After running survival analysis on TCGA-GBM, verify:

Overall survival:
  Median OS for all GBM: ~14-16 months
  If median OS > 24 months: likely includes IDH-mutant cases (not true GBM under WHO 2021)
  2-year survival: ~20-25%

IDH1 mutation (strongest single prognostic factor in glioma):
  IDH-mutant median OS: ~31 months
  IDH-wildtype median OS: ~14-16 months
  Log-rank p-value: should be < 0.001
  Cox HR for IDH-mutant vs wildtype: ~0.3-0.4 (strong protective effect)
  IDH1 R132H mutation frequency: ~6% of original TCGA-GBM cohort

MGMT promoter methylation:
  MGMT methylated: ~42% of IDH-wildtype GBMs
  Methylated vs unmethylated: HR ~0.5-0.6
  This is both prognostic (survival) and predictive (TMZ response)

Cox model quality:
  C-index with IDH + MGMT + age: ~0.65-0.75
  EPV check: with ~400 events and 3-5 covariates, EPV is adequate (> 80)

Proportional hazards:
  Age typically satisfies PH in GBM (chronic risk factor)
  Treatment effect may violate PH (delayed separation with TMZ)
  If global cox.zph p < 0.05, identify the violating covariate

WHO 2021 note:
  "Glioblastoma" is now restricted to IDH-wildtype tumors.
  IDH-mutant grade 4 gliomas are reclassified as "Astrocytoma, IDH-mutant, Grade 4."
  When using TCGA-GBM data, reclassify or note this in your analysis.

Common Pitfalls

Kaplan-Meier

Median follow-up from the KM estimate: The median of the KM curve is median survival time, not median follow-up. Calculate median follow-up using the reverse KM method: survfit(Surv(time, 1-event) ~ 1). Short follow-up with few events makes survival estimates unreliable beyond the last observed event.
Interpreting the tail: KM curves at the right end are based on very few patients. Wide confidence intervals there mean the curve is unreliable. Truncate plots at the time when < 10% of patients remain at risk.

Cox Regression

Ignoring the PH assumption: Running coxph() and reporting HRs without checking cox.zph() first. If PH is violated, the reported HR is an average over time and may be misleading for both early and late effects.
Overfitting: Including too many covariates relative to events. With 50 events and 10 covariates, the model is unstable. Follow the EPV >= 10 rule (preferably >= 20).
Stepwise selection for final models: Forward/backward stepwise selection inflates Type I error and produces non-reproducible models. Pre-specify covariates based on clinical knowledge.
Ignoring collinearity: Stage and tumor size are correlated. Including both in a Cox model inflates standard errors and produces unstable HRs. Check VIF or pairwise correlations before fitting.

Competing Risks

Standard KM for competing risks endpoints: KM treats competing events as censored, which inflates cumulative incidence of the event of interest. Use cumulative incidence functions (CIF) via tidycmprsk::cuminc() when competing events are present.
Interpreting Fine-Gray as causal: Fine-Gray SDH measures the effect on cumulative incidence, not on the rate among those at risk. A covariate that increases competing event risk will artificially appear to decrease the subdistribution hazard for the primary event.

RMST and Cutpoints

Post-hoc tau selection: Choosing tau after seeing the data invalidates the RMST test. Pre-specify tau based on clinical context (standard follow-up duration, planned analysis timepoint).
Cutpoint without validation: An optimal cutpoint found by maxstat will appear highly significant in the discovery cohort even for a non-prognostic biomarker. Without independent validation, the cutpoint is a hypothesis, not a result.

Related Skills

cancer-multiomics: TCGA data retrieval, mutation status and gene expression for Cox covariates
immune-deconvolution: Immune cell fraction estimates as continuous survival predictors

Public Datasets for Testing

Dataset	Samples	Use Case
TCGA-GBM	611	Glioblastoma, strong IDH1/MGMT prognostic markers, poor prognosis
TCGA-BRCA	1098	Breast cancer, molecular subtype-specific survival differences
TCGA-LUAD	585	Lung adenocarcinoma, EGFR/KRAS mutation-driven survival
TCGA-OV	608	Ovarian cancer, classic for competing risks (progression vs death)

ナビゲーション

Skillsとは？

リンク

Survival Analysis

Survival Analysis

When to Use This Skill

Inputs

Preparing TCGA Survival Data

Kaplan-Meier Estimation

Log-Rank Test

Cox Proportional Hazards

Univariable

Multivariable

Design Considerations

Proportional Hazards Assumption

Handling PH Violations

Competing Risks

Restricted Mean Survival Time

Optimal Cutpoint Selection

Forest Plots

Output Specification

Validation Checks

Common Pitfalls

Kaplan-Meier

Cox Regression

Competing Risks

RMST and Cutpoints

Related Skills

Public Datasets for Testing

関連スキル(🌐 Web開発)