AI Agent Guidelines for Survival Analysis Workshop

This document provides guidance for AI assistants (GitHub Copilot, etc.) working with this codebase.

🎯 Quick Context

What is this? Educational workshop on survival analysis using R and the telco churn dataset
Primary tool: RStudio Server in Podman (rocker/tidyverse)
Key technologies: R, Quarto (qmd), survival package, survminer, tidyverse
Workshop structure: Sequential multi-part workflow where later parts depend on earlier outputs
Current parts: Part 1 (KM/exploratory) → Part 2 (Cox regression) with data/model persistence between parts
Main challenge: Creating pedagogical content that balances statistical rigor with accessibility
Primary output: Rendered HTML files (one per workshop part) with embedded visualizations and explanations

🚨 Critical Rules - Read These First!

1. Threading Environment Variables (IMPORTANT for parallel processing)

Required for parallel processing with furrr/future to prevent crashes:

-e OPENBLAS_NUM_THREADS=1
-e OMP_NUM_THREADS=1  
-e MKL_NUM_THREADS=1

Why: Linear algebra libraries spawn multiple threads per worker. This compounds in parallel processing and can exhaust system resources, causing "Resource temporarily unavailable" errors.

2. Join Relationship Specification (REQUIRED in tidyverse 1.1.0+)

# ✅ CORRECT - Explicit relationship prevents warnings
left_join(data1, data2, by = "key", relationship = "many-to-many")
left_join(data1, data2, by = "key", relationship = "many-to-one")

# ❌ WRONG - Will generate warnings
left_join(data1, data2, by = "key")

3. Function Documentation Standards

All library files (lib_*.R) must have complete roxygen2 documentation. Check function signatures in the files before asking about parameters.

4. Reproducibility

Always set seeds: Use consistent seed (e.g., seed = 42) for reproducible results
Document package versions: Use renv or specify versions in Dockerfile
Cache expensive computations: Save fitted models and large results

5. Markdown/Quarto Formatting (CRITICAL for this project)

List Formatting Rules:

# ✅ CORRECT - Proper list formatting
Some introductory text explaining the list:

  - First item with two-space indentation
  - Second item
  - Multi-line item continues on next line
    with proper indentation (four spaces total)

# ❌ WRONG - Missing blank line before list
Some introductory text:
  - First item (will merge with paragraph above!)

# ❌ WRONG - Blank lines between items
  - First item

  - Second item (creates separate lists!)

# ❌ WRONG - No indentation
- First item (inconsistent with rest of document)

Critical Rules:

Blank line BEFORE lists: Always have one blank line between intro text and first list item
NO blank lines BETWEEN items: List items should be consecutive (no blank lines)
Two-space indentation: All list items start with two spaces, then the marker (- or 1.)
Continuation lines: Multi-line list items need proper indentation (4 spaces for continuation)
Auto-numbering: Use 1. for ALL numbered list items (Markdown auto-numbers them)

Why This Matters:

Without blank line before: List merges with preceding paragraph
With blank lines between items: Renders as separate lists
Without proper indentation: Markdown does not recognize as list
Manual numbering (1. 2. 3.): Fragile when reordering/adding items

🐳 Podman Setup

Standard Container Launch

# Use Justfile command (recommended)
just podman-run-image

# Manual command (explicit Podman)
podman run --rm -d \
  --userns=keep-id \
  -e RUNROOTLESS=false \
  -p "127.0.0.1:8787:8787" \
  -e USER=rstudio \
  -e PASSWORD=CHANGEME \
  -e USERID=$(id -u) \
  -e GROUPID=$(id -g) \
  -v "$(pwd):/home/rstudio/surv_workshop:z" \
  -v "$(pwd)/.rstudio_copilot:/home/rstudio/.config/github-copilot:rw" \
  --name survival-workshop \
  kaybenleroll/ws_survival_202601:latest

Key Configuration Notes

RUNROOTLESS=false - rocker images need root for initial setup
Port binding to 127.0.0.1 - Never use 0.0.0.0 (security risk)
SELinux :z suffix - Required for volume mounts on SELinux systems (lowercase :z for shared)
--userns=keep-id - Maintains user ID mapping in rootless mode
--rm flag - Auto-removes container when stopped
GitHub Copilot mount - Persists authentication across container restarts
Project mount - Maps to /home/rstudio/surv_workshop (not /project)

Podman Image Management

# Build image
just podman-build-image

# Rebuild from scratch (no cache)
just podman-rebuild-image

# Stop container
just podman-stop-image

# Remove container
just podman-rm

# Restart container
just podman-restart

# Enter container shell
just podman-bash

# View container logs
just podman-logs

# Check container status
just podman-status

SSH Tunneling for Remote Access

# Display tunnel command with your username and hostname
just ssh-tunnel

# Example output:
# Run this command on your local machine:
#   ssh -L 8787:localhost:8787 username@hostname
# Then access RStudio at: http://localhost:8787

Quarto Rendering Targets

# SEQUENTIAL WORKSHOP (RECOMMENDED)
# Render workshop parts in dependency order (earlier parts create data for later parts)
just workshop-sequence       # Render current workshop sequence (Part 1 → Part 2)
# or
just worksheet

# Render individual parts (respect dependencies!)
just initial_survival_models  # Part 1: Initial survival models and exploratory analysis
just expanded_coxph_models    # Part 2: Cox regression (requires Part 1 first!)

# CONTAINER RENDERING (uses Podman, no host R/Quarto needed)
just render-container initial_survival_models
just render-container expanded_coxph_models
just render-container-sequence    # Render current workshop sequence in container
just render-container-all         # Render all QMDs in dependency order (skips temp_*)

# HOST RENDERING (requires Quarto + R on host)
just render-host initial_survival_models
just render-host-all              # Render all on host

# SUPPLEMENTARY NOTEBOOKS
just classic_survival_models      # Classical models reference
# or
just classic

just worksheet_survival           # Legacy monolithic version (deprecated)

# Render complete workshop (all current parts + supplementary)
just all

# CLEANING
just clean-html        # Remove HTML files
just clean-cache       # Remove Quarto cache
just clean-precompute  # Remove Part 1 saved data
just clean-outputs     # Remove rendering logs
just clean-all         # Remove everything
just nuke             # Nuclear option (same as clean-all)

📊 Project Structure

project_name/
├── data/                           # Input data files (read-only)
│   └── telcochurn.csv             # Main dataset
├── precompute/                     # Inter-part data transfer (gitignored binaries)
│   ├── telco_churn_cat.parquet    # Shared data from earlier parts
│   ├── model00_null.qs            # Persisted models
│   ├── model01_vmailplan.qs       # (Part N saves, Part N+1 loads)
│   └── model03_combined.qs        # 
├── build/                          # Docker build configurations
│   ├── Dockerfile
│   └── install_packages.R
├── lib_utils.R                     # Shared utility functions (documented with roxygen2!)
├── initial_survival_models.qmd     # Workshop Part 1: KM estimation, log-rank tests
├── expanded_coxph_models.qmd       # Workshop Part 2: Cox regression, diagnostics
├── classic_survival_models.qmd     # Supplementary: Classical methods reference
├── worksheet_survival.qmd          # Legacy monolithic version (deprecated)
├── temp_*.qmd                      # Experimental notebooks (skipped by render-all)
├── Justfile                        # Task automation with extensive comments
├── .just-cache/                    # Hash-based rebuild cache (gitignored)
├── quarto_render_output.log        # Rendering logs
└── README.md                      # Project documentation

🛠️ R Code Style

Pipe Operators

# ✅ Use native pipe |> for most operations
result <- data |>
  filter(condition) |>
  mutate(new_col = value) |>
  select(columns)

# ✅ Use magrittr %>% ONLY when you need (.) functionality
data %>% set_colnames(names(.) |> to_snake_case())

# Why? Native |> is faster and built into R 4.1+

Naming Conventions

# Functions: snake_case with verb-noun pattern
calculate_metric <- function() {...}
extract_features <- function() {...}

# Variables: snake_case with type suffix
customer_data_tbl    # tibble
order_df             # data frame  
account_ids_vec      # vector
config_lst           # list

# Plots: _plot suffix
distribution_plot <- ggplot(...) + ...
survival_plot <- ggplot(...) + ...

# Models: descriptive_name_modeltype
baseline_lm          # linear model
complex_brmsfit      # brms Bayesian model
model1_coxph         # Cox proportional hazards

# Shared vs Model-Specific objects:
# NO prefix for shared objects: training_tbl, validation_tbl, lookup_tbl
# modelN_ prefix for model-specific: model1_fit, model2_predictions

# Constants: UPPER_SNAKE_CASE
MAX_RETRIES <- 5
BATCH_SIZE <- 1000
RANDOM_SEED <- 4000

# Temporary: temp_ prefix
temp_calculated_age <- year(birthdate) - current_year

Column Name Standardization

# ✅ ALWAYS convert to snake_case immediately after reading
data <- read_excel("file.xlsx") |>
  set_colnames(names(.) |> to_snake_case())

# Standard conventions:
# - All lowercase: customer_id (not CustomerID)
# - No spaces: product_name (not "Product Name")
# - Underscores for clarity: date_created, is_active
# - Minimal abbreviations: email (not eml), category (not cat)
# - Boolean prefix: is_, has_, should_, can_

Joins (Always Specify Relationship!)

# ✅ CORRECT - Explicit relationship prevents warnings
left_join(data1, data2, by = "key", relationship = "many-to-many")
left_join(data1, data2, by = "key", relationship = "many-to-one")
left_join(data1, data2, by = "key", relationship = "one-to-one")

# ❌ WRONG - Generates warnings in tidyverse >= 1.1.0
left_join(data1, data2, by = "key")

Avoid Base R Shortcuts - Use Tidyverse Pipelines

# ❌ BAD - Using $ accessor
sum(data$column == value)
mean(data$column)

# ✅ GOOD - Tidyverse pipeline
data |> filter(column == value) |> nrow()
data |> pull(column) |> mean(na.rm = TRUE)

Use `summarise()` for Aggregations

# ✅ Group multiple statistics in a single summarise() block
summary_tbl <- data_tbl |>
  summarise(
    count = n(),
    mean_val = mean(value, na.rm = TRUE),
    sd_val = sd(value, na.rm = TRUE),
    min_val = min(value, na.rm = TRUE),
    max_val = max(value, na.rm = TRUE)
    )

Parentheses Indentation (CRITICAL for readability)

# Function arguments indented 2 spaces from opening parenthesis
# Closing ) on own line at same indentation as function call

# ❌ BAD - Wrong indentation
data_tbl |>
  select(
    id, value
  )

# ✅ GOOD - Proper indentation
data_tbl |>
  select(
    id, value
    )

# For ggplot - same rule for each layer
ggplot(data_tbl) +
  geom_point(
    aes(x = var1, y = var2, color = group),
    alpha = 0.5,
    size  = 2
    ) +
  labs(
    title = "Plot Title",
    x     = "X Label",
    y     = "Y Label"
    )

Command Line Arguments (Optional - for scripts)

# Use argparse package for all R scripts
library(argparse)

parser <- ArgumentParser(
  description = "Script description",
  formatter_class = "argparse.RawDescriptionHelpFormatter",
  epilog = paste(
    sep = "\n",
    "ENVIRONMENT VARIABLES:",
    "  VAR_NAME    Description of variable",
    "",
    "EXAMPLES:",
    "  Rscript script.R --option value"
    )
  )

parser$add_argument("--option", dest = "option_name", help = "Description")
args <- parser$parse_args()

# Access with underscores (not dashes)
value <- args$option_name

Logging Standards

# Prefer structured logging over print()
write_log_entry <- function(section, message) {
  timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
  cat(glue("[{timestamp}] [{section}] {message}\n"))
}

# Usage
write_log_entry("STARTUP", "Starting data processing")
write_log_entry("PROCESSING", glue("Processed {nrow(data)} rows"))
write_log_entry("COMPLETE", "Processing finished successfully")

Output Formatting in Quarto (Optional - for notebooks)

# Use write_lines() from readr - works regardless of chunk options
write_lines("Summary Statistics:", stdout())

# With glue for formatting
summary_tbl <- data_tbl |>
  summarise(
    count = n(),
    mean_val = mean(value, na.rm = TRUE)
    )

write_lines(glue(
  "Summary Statistics:
    Count: {summary_tbl$count}
    Mean: {format(summary_tbl$mean_val, digits = 4)}"
  ), stdout())

📚 Data Persistence

File Formats

# Use parquet for tibbles (fast, portable, compressed)
data_tbl |> write_parquet_compressed("output/data.parquet")

# Use qs2 for complex objects (2-10x faster than RDS)
model_results_lst |> qs_save("models/results.qs")

# Helper function in lib_utils.R
write_parquet_compressed <- function(data, path) {
  arrow::write_parquet(data, path, compression = "zstd", compression_level = 3)
}

Always Use Pipe Notation

# ✅ GOOD
object_tbl |> write_parquet_compressed(path)
object_lst |> qs_save(path)

# ❌ BAD
write_parquet(object_tbl, path)

Manual Caching Pattern

# For expensive operations (models, large computations)
cache_file <- "models/model1_results.qs"

if (file_exists(cache_file)) {
  model1_results <- qs_read(cache_file)
} else {
  model1_results <- expensive_computation(...)
  model1_results |> qs_save(cache_file)
}

Save Before Remove

# Save large objects before clearing from memory
large_tbl |> write_parquet_compressed("output/large_tbl.parquet")
rm(large_tbl)
gc()  # Trigger garbage collection

📚 Library File Organization

lib_utils.R - Common Utilities

#' Write Log Entry
#'
#' Appends a timestamped log message to console
#'
#' @param section Character string for log section (e.g., "STARTUP", "PROCESSING")
#' @param message Log message to write
#'
#' @examples
#' \dontrun{
#' write_log_entry("PROCESSING", "Starting analysis")
#' }
write_log_entry <- function(section, message) {
  timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
  cat(glue("[{timestamp}] [{section}] {message}\n"))
}

#' Write Compressed Parquet
#'
#' Writes parquet with zstd compression level 3
#'
#' @param data Tibble or data frame to write
#' @param path Output file path
#'
#' @examples
#' \dontrun{
#' data_tbl |> write_parquet_compressed("output/data.parquet")
#' }
write_parquet_compressed <- function(data, path) {
  arrow::write_parquet(data, path, compression = "zstd", compression_level = 3)
}

Domain-Specific Libraries

Create separate library files for different domains:

lib_data_import.R - Data loading and import functions
lib_data_quality.R - Validation and quality checks
lib_modeling.R - Model fitting and prediction (Optional)
lib_visualization.R - Plotting functions (Optional)

roxygen2 Documentation Template

#' Brief One-Line Title
#'
#' More detailed description of what the function does and why it exists.
#' Explain use cases and important context.
#'
#' @param param1 Description with type (e.g., "Numeric vector of customer IDs")
#' @param param2 Description with defaults (e.g., "Maximum records, default 1000")
#' @param output_path Optional output file path (character string), default NULL
#'
#' @return Tibble with columns: col1 (numeric), col2 (character), col3 (date)
#'
#' @details Processing steps:
#'   1. Step one explanation
#'   2. Step two explanation
#'   3. Final output generation
#'
#' @note Performance: Processes ~10K records/second. Uses parallel processing.
#'
#' @examples
#' \dontrun{
#' result <- function_name(
#'   param1 = sample_data,
#'   param2 = 100,
#'   output_path = "output.parquet"
#'   )
#' glimpse(result)
#' }
function_name <- function(param1, param2 = 1000, output_path = NULL) {
  # Implementation
}

Before asking "what parameters does X take?" → Read the roxygen2 docs in the file!

🧪 Common Patterns

Data Loading and Validation

# Load data
data_tbl <- read_parquet("data/processed.parquet")

# Validate structure
glimpse(data_tbl)
summary(data_tbl)

# Check for nulls
data_tbl |>
  summarise(
    across(everything(), ~sum(is.na(.)))
    )

# Verify dimensions
write_log_entry("DATA", glue("Loaded {nrow(data_tbl)} rows, {ncol(data_tbl)} columns"))

Join Pattern with Validation

# Join with explicit relationship
combined_tbl <- left_data_tbl |>
  left_join(
    right_data_tbl,
    by = "key",
    relationship = "many-to-one"
    ) |>
  # Validate immediately after join
  filter(!is.na(key))

# Check join success
combined_tbl |>
  summarise(
    total_rows = n(),
    missing_values = sum(is.na(joined_column)),
    match_rate = 100 * (1 - sum(is.na(joined_column)) / n())
    )

Parallel Processing Pattern (Optional - for ETL projects)

library(furrr)

# Set up parallel processing (8 workers max)
plan(multisession, workers = 8)

# Process in parallel
results_tbl <- items |>
  future_map_dfr(
    ~process_item(.x),
    .progress = TRUE
    )

# Reset to sequential
plan(sequential)

Testing with Small Samples

# Always test with small sample first
test_size <- 100
test_data <- full_data |> slice(1:test_size)

# Run function on test data
test_result <- process_function(test_data)

# Validate structure
glimpse(test_result)
summary(test_result)

# If successful, run on full data
full_result <- process_function(full_data)

🎨 Quarto Notebooks (Optional - for analysis projects)

Chunk Options

#| echo: false     # Hide code by default
#| message: false  # Suppress messages
#| warning: false  # Suppress warnings
#| fig-width: 10   # Figure width
#| fig-height: 6   # Figure height

Descriptive Chunk Labels

Use descriptive labels: load_data, fit_model1, visualize_results

Narrative Text

Add explanatory text before and after code chunks explaining purpose and results.

Markdown List Formatting in Narrative Sections

CRITICAL: Lists in Quarto documents follow standard Markdown rules:

# ✅ CORRECT
Explanatory paragraph about what comes next:

  1. First numbered item
  1. Second numbered item (auto-numbered by Markdown)
  1. Third item with continuation
     that wraps to next line

Another paragraph after the list.

# ❌ WRONG - No blank line before list
This paragraph introduces:
  1. List item (will merge with paragraph!)

# ❌ WRONG - Blank lines between items
  1. First item

  1. Second item (creates two separate lists!)

# ❌ WRONG - Manual numbering
  1. First item
  2. Second item (fragile when reordering)
  3. Third item

Common Patterns in This Workshop:

Bulleted lists for examples, properties, characteristics
Numbered lists for step-by-step procedures, key findings
Always use 1. for numbered items (let Markdown handle numbering)
Ensure blank line before list, no blank lines between items
Use two-space indentation consistently

Multi-Model Organization

# Load Data section
training_tbl <- read_parquet(...)
validation_tbl <- read_parquet(...)

# Create Shared Data Objects section
lookup_tbl <- tibble(...)
subset_tbl <- training_tbl |> filter(...)

# Build Model 1 section
model1_fit <- fit_model(..., data = training_tbl)
model1_predictions <- predict(model1_fit, newdata = subset_tbl)

# Build Model 2 section
model2_fit <- fit_model(..., data = training_tbl)
model2_predictions <- predict(model2_fit, newdata = subset_tbl)

Chunk Timing (Optional but recommended)

# Setup timing hooks
chunk_times <- list()

knitr::knit_hooks$set(
  time_it = function(before, options, envir) {
    if (before) {
      chunk_times[[options$label]] <<- list(start = Sys.time())
    } else {
      chunk_times[[options$label]]$end <<- Sys.time()
      chunk_times[[options$label]]$elapsed <<-
        chunk_times[[options$label]]$end |>
        difftime(chunk_times[[options$label]]$start) |>
        as.numeric()
    }
  }
  )

knitr::opts_chunk$set(time_it = TRUE)

# Add timing summary section before R Environment

🎯 Statistical Modeling (Optional - remove if not applicable)

Model Naming and Caching

# Descriptive names with model type
baseline_lm <- lm(...)
model1_brmsfit <- brm(...)

# Cache fitted models
model1_brmsfit <- brm(
  formula,
  data = training_tbl,
  family = gaussian(),
  backend = "cmdstanr",
  seed = 4000,
  file = "models/model1_brmsfit",      # Auto-caching
  output_dir = "stan_output",          # CSV location
  output_basename = "model1_brmsfit"   # Readable names
  )

Always Set Seed

# For reproducibility
set.seed(4000)

# For Stan/brms models
brm(..., seed = 4000)

Model Diagnostics Checklist

When adding model diagnostics:

Convergence checks (Rhat, ESS)
Residual plots
Posterior predictive checks (for Bayesian)
Model comparison metrics (AIC, BIC, LOO-CV)
Visualization of fitted vs. actual

📊 Visualization Guidelines

Standard Theme

# Set theme at start of notebook/script
library(cowplot)
theme_set(theme_cowplot())

Proper Labeling

ggplot(data_tbl) +
  geom_point(
    aes(x = var1, y = var2, color = group),
    alpha = 0.5
    ) +
  labs(
    title = "Descriptive Title",
    subtitle = "Additional context",
    x = "X Axis Label",
    y = "Y Axis Label",
    caption = "Source: Data description"
    ) +
  scale_y_continuous(labels = scales::label_comma()) +
  scale_color_brewer(palette = "Set1")

Multi-Panel Plots

library(cowplot)

# Combine plots
plot_grid(
  plot1, plot2, plot3,
  ncol = 2,
  labels = c("A", "B", "C")
  )

🚫 Common Mistakes to Avoid

1. Missing Join Relationships

# ❌ WRONG - Generates warnings
left_join(data1, data2, by = "key")

# ✅ CORRECT - Explicit relationship
left_join(data1, data2, by = "key", relationship = "many-to-many")

2. Markdown List Formatting Errors

# ❌ BAD - No blank line before list
Text introducing the list:
  - Item 1 (merges with paragraph!)

# ❌ BAD - Blank lines between items
  - Item 1

  - Item 2 (separate lists!)

# ❌ BAD - No indentation
- Item 1 (inconsistent)

# ❌ BAD - Manual numbering
  1. First
  2. Second
  3. Third (fragile when reordering)

# ✅ GOOD - Proper formatting
Text introducing the list:

  - Item 1
  - Item 2
  - Item 3 continues on next line
    with proper indentation

Next paragraph.

# ✅ GOOD - Auto-numbered list
Numbered list example:

  1. First item
  1. Second item
  1. Third item

3. Too Many Parallel Workers

# ❌ DANGEROUS - May crash
plan(multisession, workers = 16)

# ✅ SAFE - 8 workers maximum
plan(multisession, workers = 8)

4. Using print() Instead of Logging

# ❌ BAD - Clutters output
print(paste("Processing", nrow(data), "records"))

# ✅ GOOD - Structured logging
write_log_entry("PROCESSING", glue("Processing {nrow(data)} records"))

5. Not Converting Column Names on Import

# ❌ BAD - Inconsistent names
data <- read_csv("messy_data.csv")

# ✅ GOOD - Standardize immediately
data <- read_csv("messy_data.csv") |>
  set_colnames(names(.) |> to_snake_case())

6. Skipping roxygen2 Documentation

# ❌ BAD - Undocumented
process_data <- function(x, y) {...}

# ✅ GOOD - Complete documentation
#' Process Data with Validation
#'
#' @param x Numeric vector of values
#' @param y Character vector of labels
#' @return Tibble with processed results
process_data <- function(x, y) {...}

7. Not Testing Before Full Run

# ❌ BAD - Running on full dataset without testing
result <- process_all_items(million_items)

# ✅ GOOD - Test with sample first
test_sample <- million_items |> slice(1:100)
test_result <- process_all_items(test_sample)
glimpse(test_result)  # Verify before full run

8. Changing Seeds Without Documentation

# ❌ BAD - Breaks reproducibility
set.seed(sample(1:10000, 1))

# ✅ GOOD - Consistent, documented seed
set.seed(42)  # Project standard seed

🔧 Troubleshooting

"Resource temporarily unavailable" with furrr

Symptom: Parallel processing crashes
Cause: Thread explosion from linear algebra libraries
Solution: Set threading environment variables in the container:

-e OPENBLAS_NUM_THREADS=1
-e OMP_NUM_THREADS=1
-e MKL_NUM_THREADS=1

"Permission denied" on volume mounts

Symptom: Container cannot read or write files
Cause: SELinux blocking access
Solution 1: Add :Z suffix: -v ./data:/data:Z
Solution 2: Disable SELinux: --security-opt label=disable

Joins creating unexpectedly large output

Symptom: left_join() produces many more rows than input
Cause: Many-to-many join without filtering
Solution: Add filtering or check relationship:

combined <- data1 |>
  left_join(data2, by = "key", relationship = "many-to-one") |>
  filter(!is.na(key))

Container will not start with RUNROOTLESS=true

Symptom: rocker/tidyverse fails to start
Cause: Image needs root for setup
Solution: Use RUNROOTLESS=false

📦 Recommended R Packages

Core Data Wrangling

tidyverse - Data manipulation (dplyr, tidyr, readr, ggplot2)
arrow - Parquet and columnar formats
qs2 - Fast object serialization
lubridate - Date/time operations
glue - String interpolation
fs - Cross-platform filesystem

Data Import/Export

readxl - Excel files
haven - SPSS, Stata, SAS
jsonlite - JSON data

Parallel Processing (Optional - for ETL)

furrr - Parallel functional programming
future - Parallel execution backend

Statistical Modeling (Optional - for analysis)

survival - Survival analysis
brms - Bayesian regression
rstanarm - Applied regression modeling
tidybayes - Tidy Bayesian analysis

Visualization

cowplot - Publication-quality plots
patchwork - Combine plots
scales - Scale formatting

Utilities

conflicted - Namespace conflict management
argparse - Command-line arguments
tictoc - Timing benchmarks

Performance Optimization

Arrow threading: arrow::set_cpu_count(), arrow::set_io_thread_count()
Parquet compression: zstd level 3 (via write_parquet_compressed())
qs2 package: Automatic compression/speed optimization
Manual caching: Use file_exists() for >2GB objects

✅ Pre-Flight Checklist

Before Running Analysis/Processing

All library files have roxygen2 documentation
Column names standardized to snake_case
Join relationships explicitly specified
Threading environment variables set in Podman
Tested with small sample first
Random seeds set for reproducibility
Expensive computations cached
Podman container running with proper volumes

Data Quality Checks

# Structure validation
glimpse(data)
summary(data)

# Null checks
data |> summarise(across(everything(), ~sum(is.na(.))))

# Dimension checks
write_log_entry("VALIDATION", glue("{nrow(data)} rows, {ncol(data)} cols"))

# Duplicate checks (if applicable)
data |> summarise(duplicates = n() - n_distinct(key_column))

🎯 Tips for AI Assistants

Context matters: Check existing code for patterns before suggesting new approaches
Preserve style: Match existing naming and formatting conventions
Document thoroughly: Add roxygen2 headers and explanatory text
Consider performance: Large datasets may need sampling or parallelization
Validate assumptions: Test assumptions explicitly with data checks
Reproducibility first: Always set seeds, document versions
Read existing docs: Check roxygen2 headers before asking about parameters
Use tidyverse patterns: Avoid base R shortcuts, use pipelines
Explicit relationships: Always specify join relationships
Test incrementally: Small samples before full runs
Output formatting: Use write_lines(text, stdout()) in Quarto
Proper indentation: Closing ) on own line, 2 spaces from opening (
Markdown lists: Blank line before, no blanks between, two-space indent, use 1. for auto-numbering
Check list formatting: Always verify lists will render as single unified lists

❓ Questions to Ask Before Making Changes

Have I read the relevant existing code?
Am I following the project's naming conventions?
Do I understand the domain context?
Have I included roxygen2 documentation for new functions?
Will this work with the Podman environment?
Are there similar implementations I can learn from?
Have I considered computational cost?
Is this change consistent with the project goals?
Am I using write_lines(text, stdout()) for Quarto output?
Are all tibbles named with _tbl suffix?
Are join relationships explicitly specified?
Are function arguments properly indented?
Have I set seeds for reproducibility?
Do lists have blank line before but not between items?
Are list items using two-space indentation?
Are numbered lists using 1. for auto-numbering?

�️ Justfile Commands Reference

Available Commands

# List all commands
just --list
# or
just

# Project information
just info              # Show project configuration
just check-quarto      # Verify Quarto installation
just check-r           # Check R in container
just check-data        # Validate data files exist
just check-precompute  # Check Part 1 output files
just check-dependencies # Validate all required files for Part 2

# Development
just watch-workshop    # Auto-render workshop sequence on changes (requires entr)
just watch <notebook>  # Watch specific notebook
just validate          # Check all QMD files without rendering
just list-notebooks    # Show all available notebooks
just list-html         # Show rendered HTML files

# Data management
just list-data         # List CSV data files
just data-size         # Show data directory size

# Utilities
just disk-usage        # Show project disk usage
just count-code        # Count lines of code
just show-logs         # View recent rendering logs
just preview <file>    # Preview HTML in terminal (needs w3m/lynx)
just open-rstudio      # Open RStudio in browser (Linux)

Hash-Based Smart Rendering

The Justfile implements content-based caching:

Only re-renders when QMD file or dependencies (lib_utils.R) change
Tracks changes via MD5 hashes in .just-cache/
Prevents unnecessary re-renders when only comments/whitespace change

Container vs Host Rendering

Container rendering (RECOMMENDED):

Uses Podman container for rendering (no host R/Quarto needed)
Targets: render-container, render-container-all, render-container-sequence
Requires container running (just podman-run-image)

Host rendering:

Requires Quarto + R installed on host system
Targets: render-host, render-host-all
Useful for quick edits when the container is not running

Dependency Ordering

render-container-all enforces sequential rendering (primary notebooks in dependency order)
Current sequence: initial_survival_models.qmd → expanded_coxph_models.qmd
Automatically skips temp_* experimental notebooks
Later parts require earlier parts' precompute/ output files

📝 Environment Variables

Standard environment variables in Dockerfile:

# Timezone
TZ=Europe/Dublin

# RStudio credentials (set in Justfile)
USER=rstudio
PASSWORD=CHANGEME

# User/Group mapping
USERID=$(id -u)
GROUPID=$(id -g)

🚀 Quick Start

# 1. Build image (first time only)
just podman-build-image

# 2. Start container
just podman-run-image

# 3. Setup SSH tunnel (if remote)
just ssh-tunnel
# Copy the displayed command, run on local machine

# 4. Access RStudio
# Open: http://localhost:8787
# User: rstudio / Password: CHANGEME

# 5. In RStudio, load libraries and test
library(tidyverse)
library(survival)
library(survminer)
source("lib_utils.R")

# 6. Run quick validation
telco_churn_tbl <- read_csv("data/telcochurn.csv")
glimpse(telco_churn_tbl)

# 7. Render workshop sequence
just workshop-sequence         # Current workshop parts in order
# or use container rendering
just render-container-sequence # Same, but inside container

🔄 Recent Updates & Known Issues

Justfile Enhancements (2026-01-08)

Comprehensive comments added: All Justfile sections now have detailed explanations
- Project configuration variables with purpose and usage
- Hash-based caching logic and rebuild conditions
- Sequential dependency ordering enforcement (primary notebooks rendered in order)
- Container vs host rendering strategies
- Temp file skipping for experimental notebooks
Container rendering targets: Added render-container, render-container-all, render-container-sequence
Host rendering targets: Added render-host, render-host-all (require Quarto on host)
Smart dependency ordering: render-container-all respects notebook dependencies automatically
Temp file handling: All temp_*.qmd files automatically skipped in batch renders
Scalable structure: Workshop can expand to additional parts (Part 3, 4, etc.) without Justfile changes

Markdown List Formatting (2026-01-08)

Critical formatting rules established: All lists now follow strict Markdown conventions
- Blank line required BEFORE list (separates from preceding paragraph)
- NO blank lines BETWEEN list items (prevents splitting into multiple lists)
- Two-space indentation for all list items
- Use 1. for all numbered items (Markdown auto-numbers sequentially)
- Four-space indentation for continuation lines
Workshop-wide formatting audit completed: All ~30+ lists verified and corrected
Impact: Proper rendering across all Markdown processors (Quarto, GitHub, Pandoc)

Survival Analysis Workshop Content (2026-01-08)

Expanded pedagogical sections: Added detailed explanations for Cox PH model sections
- Model syntax and output interpretation
- Assessment metrics (pseudo-R², concordance index) with benchmarks
- Model building narrative for predictor selection
- Comprehensive residual diagnostics (martingale, deviance)
- Proportional hazards assumption testing
Cross-references added: Theory sections now link to diagnostic validation sections
80-column formatting maintained throughout document

Last Updated: 2026-01-08
Maintainer: Mick Cooney (mcooney@describedata.com)

📖 Template Usage Guide

When using this template:

Replace all placeholders in [BRACKETS] with project-specific information
Remove inapplicable sections:
- Remove "Statistical Modeling" section for pure ETL projects
- Remove "Parallel Processing" section if not doing batch processing
- Remove "Quarto Notebooks" section for script-only projects
Add domain-specific sections: Include domain knowledge crucial for understanding
Customize conventions: Adjust to match existing team standards
Document gotchas: Add common mistakes specific to your domain
Keep updated: Treat as living document that evolves with project
Be specific: Generic guidelines less helpful than concrete examples
Include examples: Show actual code patterns to follow
Think like an AI: What context helps understand this codebase quickly?

Goal: Make AI assistants maximally effective by providing clear conventions, domain context, common patterns, and anti-patterns.

ナビゲーション

Skillsとは？

リンク

AI Agent Guidelines for Survival Analysis Workshop

AI Agent Guidelines for Survival Analysis Workshop

🎯 Quick Context

🚨 Critical Rules - Read These First!

1. Threading Environment Variables (IMPORTANT for parallel processing)

2. Join Relationship Specification (REQUIRED in tidyverse 1.1.0+)

3. Function Documentation Standards

4. Reproducibility

5. Markdown/Quarto Formatting (CRITICAL for this project)

🐳 Podman Setup

Standard Container Launch

Key Configuration Notes

Podman Image Management

SSH Tunneling for Remote Access

Quarto Rendering Targets

📊 Project Structure

🛠️ R Code Style

Pipe Operators

Naming Conventions

Column Name Standardization

Joins (Always Specify Relationship!)

Avoid Base R Shortcuts - Use Tidyverse Pipelines

Use summarise() for Aggregations

Parentheses Indentation (CRITICAL for readability)

Command Line Arguments (Optional - for scripts)

Logging Standards

Output Formatting in Quarto (Optional - for notebooks)

📚 Data Persistence

File Formats

Always Use Pipe Notation

Manual Caching Pattern

Save Before Remove

📚 Library File Organization

lib_utils.R - Common Utilities

Domain-Specific Libraries

roxygen2 Documentation Template

🧪 Common Patterns

Data Loading and Validation

Join Pattern with Validation

Parallel Processing Pattern (Optional - for ETL projects)

Testing with Small Samples

🎨 Quarto Notebooks (Optional - for analysis projects)

Chunk Options

Descriptive Chunk Labels

Narrative Text

Markdown List Formatting in Narrative Sections

Multi-Model Organization

Chunk Timing (Optional but recommended)

🎯 Statistical Modeling (Optional - remove if not applicable)

Model Naming and Caching

Always Set Seed

Model Diagnostics Checklist

📊 Visualization Guidelines

Standard Theme

Proper Labeling

Multi-Panel Plots

🚫 Common Mistakes to Avoid

1. Missing Join Relationships

2. Markdown List Formatting Errors

3. Too Many Parallel Workers

4. Using print() Instead of Logging

5. Not Converting Column Names on Import

6. Skipping roxygen2 Documentation

7. Not Testing Before Full Run

8. Changing Seeds Without Documentation

🔧 Troubleshooting

"Resource temporarily unavailable" with furrr

"Permission denied" on volume mounts

Joins creating unexpectedly large output

Container will not start with RUNROOTLESS=true

📦 Recommended R Packages

Core Data Wrangling

Data Import/Export

Parallel Processing (Optional - for ETL)

Statistical Modeling (Optional - for analysis)

Visualization

Utilities

Use `summarise()` for Aggregations