name: verify-dag-runtimes description: Query prod Airflow for the last N successful runs of a set of DAGs, compute avg / min / max durations, and produce a markdown report. Use this before proposing a schedule change, before updating runtime estimates in Notion, or any time someone asks "how long does X actually take in prod?".
Verify DAG runtimes from prod Airflow
What this produces
A markdown table comparing actual prod runtimes (avg of last N successful runs) to an optional expected runtime, with min/max and a variance flag. Use it to:
- Refresh runtime estimates for the Notion ingestion schedule
- Back a proposed schedule reshuffle with real data
- Detect DAGs that have drifted (actual much slower than documented) or have no recent successful runs
Prerequisites
kubectlcontext is set to the prod cluster. Runkubectl config current-context— the exact context name varies per user, but it will contain the tokensunic,prod, andetl(e.g.unic.prod.etl).- You have exec access to namespace
unic-prod
If the context isn't set, ask the user to switch contexts themselves — don't switch it for them.
Steps
1. Gather inputs from the user
Ask — don't assume — for:
- DAG list. Either: specific DAG IDs, "all DAGs on the Notion schedule", or "all active non-paused DAGs." If the user points to the Notion page, fetch it and extract the red (our DAGs) and green (project DAGs) entries.
- N (runs to average over). Default: 5. More runs = more stable average but stale runs may no longer reflect current data volumes.
- Expected runtimes (optional). If the user wants a comparison, ask where (Notion page URL, or explicit list).
2. Discover the adhoc pod dynamically
The pod name has a rolling hash. Always look it up:
POD=$(kubectl -n unic-prod get pods -l app=airflow-adhoc-queries -o jsonpath='{.items[0].metadata.name}')
3. Pull successful run history per DAG
kubectl -n unic-prod exec "$POD" -- airflow dags list-runs -d <DAG_ID> --state success -o json
The output is a JSON array of runs, newest first, each with start_date and end_date in ISO 8601. Parse the first N entries.
Serial kubectl exec calls are fine — ~1–2s each. For ~30 DAGs this takes about a minute. Don't try to parallelize unless the user asks; large parallel exec fan-outs can trip rate limits.
4. Compute stats
For each DAG's N runs: duration = end_date - start_date (minutes). Compute avg, min, max.
Apply these verdict rules:
| Condition | Verdict |
|---|---|
| 0 successful runs | "no data" — report separately |
avg > 1.5 * expected or avg - expected > 15 min | ⚠️ over |
avg < 0.5 * expected and expected - avg > 5 min | ✨ under — estimate may be too generous |
max - min > 30 min | high variance (note alongside the verdict) |
| otherwise | ✓ close |
Print durations in human-friendly format: 6m, 1h07, 3h15. Under 1 min → seconds.
5. Output a markdown report
Structure:
# Average runtime report (last N successful runs per DAG)
## Within tolerance ✓
| DAG | Expected | Avg | Min–Max |
|---|---|---|---|
...
## Notion overstates ✨
...
## Notion understates ⚠️
...
## No usable data
- `curated_cscmed_jobs` — 0 runs in history (explain why if known)
If the user didn't provide expected values, drop the "Expected" column and just show avg/min/max.
6. Flag special cases
- Highly variable DAGs (max > 2× avg, e.g.
enriched_sprintkid): call out explicitly, note the worst case, and suggest root-cause investigation. - DAGs with only failed runs recently: check
--state failedto see if the DAG is broken or the schedule just hasn't fired recently. Report failed-run durations as a sanity check on the expected runtime. - Brand-new DAGs with zero history: just say so; don't try to infer.
Known Notion-specific gotchas when comparing
- Notion entries of the form
pericalm4 (2 min)under a blue cell are IT loads, not our DAGs — don't match them to our Airflow DAG IDs. (1/4)on a Notion entry means monthly cadence. The DAG usually has 4× fewer runs to average over; N=5 might pull runs from a year back. Mention the staleness risk.- Some Notion names don't map 1:1 to dag_ids:
index_patient=curated_unic,warehouse=warehouse_unic,clinibase-ci=anonymized_clinibaseci. Ask the user if a mapping isn't obvious.