name: data-analysis description: End-to-end R data analysis for the sewage project. Writes analysis scripts following project conventions (here::here, arrow/parquet, fixest, modelsummary, native pipe), runs code review, and produces publication-ready tables and figures. This skill should be used when asked to "run an analysis", "estimate the model", "add a specification", or "write an R script". argument-hint: "[dataset path, analysis goal, or specification description]" allowed-tools: ["Read", "Grep", "Glob", "Write", "Edit", "Bash", "Agent"]
Data Analysis
Run an end-to-end data analysis following sewage project conventions.
Input: $ARGUMENTS — a dataset path, analysis goal description, or specification to estimate.
Project-Specific Context
Analysis Organisation
Scripts in scripts/R/09_analysis/ by approach:
01_descriptive/— Maps, scatter plots, Google Trends02_hedonic/— Cross-sectional hedonic regressions03_repeat_sales/— Repeat-transaction regressions04_long_difference/— 250m grid-level long differences05_news/— DiD and event studies with media coverage06_upstream_downstream/— Directional spillover07_dry_spills/— Dry spill analysis
Datasets
data/final/— Analysis-ready datasetsdata/processed/— Intermediate pipeline outputs (parquet)- All data loaded via
arrow::read_parquet()orarrow::open_dataset()
Output Destinations
- Tables:
output/tables/*.tex(modelsummary → LaTeX with tabularray) - Figures:
output/figures/*.pdfor*.png - Regression objects:
output/regs/*.rds - HTML interactive:
output/html_plots/
Required R Conventions
here::here()for all paths- Native pipe
|> fixest::feols()for regressions withvcov = "hetero"modelsummaryfor table output (tabularray format,[H]placement)arrowfor parquet I/Osnake_casenamingforcats::as_factor()for factors
Workflow
Step 1: Context Gathering
- Understand the analysis goal from
$ARGUMENTS - Read existing analysis scripts in the relevant subdirectory for patterns
- Read
scripts/R/utils/spill_aggregation_utils.Rif spill metrics are involved - Check
data/final/for available datasets - Read the relevant manuscript section in
docs/overleaf/if the analysis feeds into the paper
Step 2: Write Analysis Script
Follow the analysis script structure:
# ================================================================
# [Descriptive Title]
# Purpose: [What this script does]
# Inputs: [Data files]
# Outputs: [Figures, tables, RDS files]
# ================================================================
# === 1. Setup ============================================
library(tidyverse)
library(fixest)
library(modelsummary)
library(arrow)
library(here)
# === 2. Data Loading =====================================
df <- read_parquet(here("data", "final", "dataset.parquet"))
# === 3. Main Analysis ====================================
model <- feols(
log_price ~ spill_count | lsoa + year_quarter,
data = df,
vcov = "hetero"
)
# === 4. Tables and Figures ================================
modelsummary(
list("Main" = model),
output = here("output", "tables", "table_name.tex"),
fmt = 3
)
# === 5. Export ============================================
saveRDS(model, here("output", "regs", "model_name.rds"))
Step 3: Code Review
After writing the script, review it against the 9 categories from /review-r:
- Script structure, console hygiene, reproducibility
- Function design, figure quality, data persistence
- Comments, error handling, polish
Fix any Critical or Major issues before presenting.
Step 4: Run the Script
If the user wants execution:
cd /Users/jacopoolivieri/Library/CloudStorage/Dropbox/01_projects/sewage
Rscript scripts/R/09_analysis/[subdir]/[script_name].R
Step 5: Present Results
- Results summary — Key estimates with SEs and economic interpretation
- Script created — Path and description
- Output files — Tables and figures generated
- Code review notes — Any conventions to flag
- TODO items — Missing data, additional specifications needed
Principles
- Reproduce, don't guess. If a specific regression is requested, implement exactly that.
- Strategy alignment. If an analysis feeds into a manuscript section, the code must implement what the paper claims.
- Publication-ready output. Tables and figures should be directly includable in the paper.
- Follow existing patterns. Read neighbouring scripts in the same subdirectory for style consistency.
- Save everything. Every regression object saved as RDS, every table as LaTeX, every figure as PDF.