name: agent-health type: workflow description: "Reads production/traces/agent-metrics.jsonl and displays a per-agent performance summary table for the current or a specified session. Highlights agents with high error rates or OPEN circuit breaker state." argument-hint: "[--session <branch>] [--agent <name>] [--since <YYYY-MM-DD>] [--log]" user-invocable: true allowed-tools: Read, Write, Bash effort: 1 when_to_use: "Run at the end of a session, sprint, or after repeated agent failures to identify which agents are struggling. Also useful before dispatching a multi-agent workflow to check circuit breaker states."
Agent Health
Display a performance summary table from production/traces/agent-metrics.jsonl,
cross-referenced with production/session-state/circuit-state.json for live
circuit breaker states.
Steps
1. Parse arguments
| Flag | Default | Description |
|---|---|---|
--session <branch> | current branch | Filter entries by session field |
--agent <name> | all | Show only this agent |
--since <date> | no limit | Only entries with date >= YYYY-MM-DD |
--log | false | If set, append a fresh metrics snapshot to agent-metrics.jsonl |
Get current branch: git branch --show-current.
2. Read data sources
Read both files in parallel:
production/traces/agent-metrics.jsonl— historical metrics per agent per sessionproduction/session-state/circuit-state.json— live circuit breaker states
If agent-metrics.jsonl contains only the schema header line (no actual entries):
📭 No agent metrics recorded yet for this session.
Metrics are written when agents use /agent-health --log
or at the end of a session via /save-state.
Circuit breaker states (live):
[show table from circuit-state.json only]
3. Aggregate metrics
For each agent, compute across the filtered entries:
total_tasks=tasks_completed+tasks_failed+tasks_blockedsuccess_rate=tasks_completed / total_tasks * 100(0 if no tasks)error_rate= latesterror_ratefield valuecircuit_state= fromcircuit-state.json(live, not from log)
4. Render health table
🏥 Agent Health Report — session: <branch> · <date range>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Agent Tasks ✅ Done ❌ Failed ⛔ Blocked Success% Circuit
──────────────────────────────────────────────────────────────────────────────
backend-developer 8 7 1 0 87.5% 🟢 CLOSED
frontend-developer 5 5 0 0 100.0% 🟢 CLOSED
qa-engineer 6 4 2 0 66.7% 🟡 HALF-OPEN
data-engineer 2 2 0 0 100.0% 🟢 CLOSED
diagnostics 1 0 1 0 0.0% 🔴 OPEN
──────────────────────────────────────────────────────────────────────────────
TOTAL 22 18 4 0 81.8%
⚠️ Agents needing attention:
🔴 diagnostics — Circuit OPEN · fallback: surface to user
🟡 qa-engineer — Circuit HALF-OPEN · 2 failures this session
Circuit state icons:
🟢 CLOSED— healthy🟡 HALF-OPEN— recovering, monitor closely🔴 OPEN— bypassed, routed to fallback
Flag agents as needing attention if:
circuit_stateisOPENorHALF-OPENsuccess_rate< 70%tasks_failed>= 2
5. Log snapshot (if --log)
If --log flag was passed, append one entry per active agent to
production/traces/agent-metrics.jsonl:
{"date":"<YYYY-MM-DD>","session":"<branch>","agent":"<agent>","tasks_completed":<N>,"tasks_failed":<N>,"tasks_blocked":<N>,"avg_tokens_est":<N>,"error_rate":<0.0-1.0>,"circuit_state":"CLOSED|OPEN|HALF-OPEN","notes":"<optional>"}
Get circuit_state from circuit-state.json. Estimate avg_tokens_est from
decision ledger entry count × 800 tokens (rough estimate per entry) if no exact
token data is available. Note this is an estimate and mark with _est suffix.
Print after logging:
✅ Metrics snapshot logged → production/traces/agent-metrics.jsonl
[N] agents recorded · <date>
6. Suggest actions
After the table, if any agents need attention:
💡 Suggested actions:
• /resume-from <task_id> — recover failed task checkpoint
• /trace-history --risk High — audit high-risk decisions
• Check circuit-state.json — update OPEN agents once issue resolved
How metrics get into the file
Agents append entries in two ways:
- Manual: Run
/agent-health --logat end of session - Via
/save-state: When saving state with atask_id, metrics for the active agent are appended automatically
The file grows one JSON line per agent per session. Use --since to filter
to recent sessions and avoid reading stale data from weeks ago.
Quick examples
# Summary for current session
/agent-health
# Check one agent across all time
/agent-health --agent qa-engineer
# Log a fresh snapshot and view it
/agent-health --log
# Review last 7 days
/agent-health --since 2026-04-09