name: scaffold-review description: Analyze conversation history, find gaps and drift in AGENTS/CLAUDE instructions and skills, propose and apply targeted improvements. user-invocable: true
Scaffold Review
Analyze recent agent conversation history to find what's broken, stale, or missing in your scaffolding (AGENTS/CLAUDE instructions, skills, project configs). Propose changes, apply them, and record what you did.
The goal is convergence: each run brings the scaffold closer to how the user actually works.
Step 1: Load State
Resolve the active agent home:
if [ -n "${AGENT_HOME:-}" ]; then
:
elif [ -n "${CODEX_HOME:-}" ] || [ -n "${CODEX_THREAD_ID:-}" ] || [ -n "${CODEX_CI:-}" ]; then
AGENT_HOME="${CODEX_HOME:-$HOME/.codex}"
elif [ -n "${CLAUDE_HOME:-}" ] || [ -n "${CLAUDECODE:-}" ] || [ -n "${CLAUDE_CODE:-}" ]; then
AGENT_HOME="${CLAUDE_HOME:-$HOME/.claude}"
elif [ -d "$HOME/.codex/sessions" ] && [ ! -d "$HOME/.claude/projects" ]; then
AGENT_HOME="$HOME/.codex"
elif [ -d "$HOME/.claude/projects" ] && [ ! -d "$HOME/.codex/sessions" ]; then
AGENT_HOME="$HOME/.claude"
else
echo "Unable to infer AGENT_HOME. Set AGENT_HOME explicitly." >&2
exit 1
fi
Read the review ledger (memory of prior runs):
!cat "$AGENT_HOME/scaffold-review-ledger.json" 2>/dev/null || echo '{"runs": [], "deferred": [], "trends": []}'
Find conversations since last run (or last 14 days if first run), with sizes for budgeting:
!if [ -d "$AGENT_HOME/projects" ]; then find "$AGENT_HOME/projects" -name '*.jsonl' -not -path '*/subagents/*' -mtime -14 -size +10k -exec ls -lh {} \; ; elif [ -d "$AGENT_HOME/sessions" ]; then find "$AGENT_HOME/sessions" -name '*.jsonl' -mtime -14 -size +10k -exec ls -lh {} \; ; fi | awk '{print $5, $9}' | sort -k2
Budget check: If total JSONL exceeds 5MB, split the corpus across agents rather than having each read everything.
Step 2: Extract Signals
Use 3 focused analyzers in parallel. For Codex, these are parallel shell/Python extraction passes over JSONL, not separate agent sessions.
Agent 1: Corrections & Friction
Scan user messages for:
- Explicit corrections ("no, I meant...", "that's not what I asked", "actually...")
- Behavioral directives ("don't do X", "always do Y")
- Frustration markers (short messages after long assistant responses, re-prompting the same thing)
- User doing something manually after the assistant offered to do it (trust failure)
For each correction, answer: Is there scaffold guidance for this? Was it followed? Was it wrong?
Output: list of corrections with root cause (missing guidance / stale guidance / buried guidance / wrong guidance).
Agent 2: Usage Patterns & Drift
From assistant tool-call records, extract:
- File access heatmap: top 20 files by Read/Edit frequency. Compare against what AGENTS/CLAUDE instructions reference.
- Command frequency: top commands by prefix (git, python, cargo, etc.)
- Skill invocation rates: which skills are used, which are never used
- New tools/patterns: anything in recent conversations but not older ones
- Dead references: paths in AGENTS/CLAUDE instructions that no longer appear in conversations
Output: frequency tables + list of stale/missing references.
Agent 3: Workflow & Structure
Look at multi-step patterns:
- Repeated sequences across conversations (e.g., kill server -> launch -> health check -> benchmark -> kill)
- Session preambles: first 3 user messages from each conversation. If the user explains the same thing across 2+ sessions, that's a scaffold gap.
- Things that disappeared: commands/files/patterns that used to appear but don't anymore
Classify patterns by stability:
- Crystallized (5+ conversations): codify into skill or AGENTS/CLAUDE instructions
- Stable (3-4): add as guidance, keep watching
- Emerging (2): note as trend, don't codify yet
Output: pattern list with stability ratings + gap analysis.
Analyzer Rules (all analyzers)
- Never read a full JSONL file. Use
head -c 50000or targeted grep extraction:# Codex user messages python3 - <<'PY' import json for line in open("file.jsonl", errors="ignore"): obj = json.loads(line) if obj.get("type") != "response_item": continue payload = obj.get("payload", {}) if payload.get("type") != "message" or payload.get("role") != "user": continue parts = [block.get("text", "") for block in payload.get("content", []) if block.get("type") == "input_text"] text = " ".join(parts) if text: print(text[:200]) PY # Codex tool usage counts grep '"type":"function_call"' file.jsonl | grep -o '"name":"[^"]*"' | sort | uniq -c | sort -rn | head -20 # Command prefixes from exec_command calls python3 - <<'PY' import json from collections import Counter counts = Counter() for line in open("file.jsonl", errors="ignore"): obj = json.loads(line) if obj.get("type") != "response_item": continue payload = obj.get("payload", {}) if payload.get("type") != "function_call" or payload.get("name") != "exec_command": continue args = json.loads(payload.get("arguments", "{}")) cmd = args.get("cmd", "").strip().splitlines() if cmd: counts[cmd[0].split()[0]] += 1 for name, count in counts.most_common(20): print(count, name) PY - Max 15-20 conversations per analyzer. Sample by recency if there are more.
- Return structured findings in <300 lines. Conclusions, not data dumps.
Codex-Specific Notes
- Codex session logs usually store user, assistant, and tool activity under
response_item.payload. - Commentary and final answers are both assistant messages; use
payload.phasewhen you need to separate progress updates from final responses. - Tool calls appear as
payload.type == "function_call"with JSON-encodedarguments. write_stdinpolling loops are common in remote or long-running jobs; treat them as one workflow, not separate tasks.
Step 3: Synthesize & Compare
After all agents report, read the current scaffold:
"$AGENT_HOME/CLAUDE.md"(withAGENTS.mdsymlink for Codex)- All
"$AGENT_HOME/skills/*/SKILL.md" - Project-specific
CLAUDE.md(orAGENTS.mdsymlink) files (find via conversation paths)
Cross-reference agent findings against the scaffold. Classify each finding:
| Status | Meaning | Action |
|---|---|---|
| Conflict | Scaffold says X, user corrects to Y | Fix immediately |
| Stale | Scaffold references dead path/tool | Update or remove |
| Gap | Repeated pattern, scaffold is silent | Add content |
| Buried | Info exists but in wrong place | Reorganize |
| Dead | Skill/section never used | Remove |
Also compare against ledger trends:
- Confirmed (seen 3+ runs): should have prominent scaffold placement
- Emerging (seen 2 runs): note, don't act yet
- Reversed (was trending, stopped): investigate why -- scaffold fix worked, or user gave up?
Step 4: Propose Changes
Organize proposals by type:
Tier 1: Corrections
Things the user explicitly corrected. Highest confidence -- apply unless vetoed.
Tier 2: Structural
Reorganizations: sections that should be split into skills, skills that overlap and should merge, info in the wrong file.
Tier 3: New Content
Workflows, paths, patterns that belong in the scaffold but aren't there yet. Apply the necessity test: would this have prevented a specific observed failure? If the assistant would get it right without the guidance, don't add it.
Tier 4: Deletions
Stale content, unused skills, dead references. Show evidence of staleness.
Tier 5: New Skills
Only if a crystallized workflow (5+ conversations) would clearly benefit from being a dedicated skill. Don't create skills speculatively.
For each proposal, include:
- Evidence: which conversations, what frequency
- Current state: what the scaffold says now (quote it)
- Proposed change: the exact edit
- Confidence: high / medium / low
Present all proposals to the user before applying.
Step 5: Apply & Record
For approved changes:
- Apply all edits
- Re-read modified files to check for internal consistency
- Update the ledger:
{
"timestamp": "<now>",
"conversations_analyzed": "<count>",
"proposals": [
{
"description": "...",
"tier": "<1-5>",
"status": "applied|deferred|rejected",
"confidence": "high|medium|low"
}
],
"trends_updated": ["..."]
}
Write ledger:
cat > "$AGENT_HOME/scaffold-review-ledger.json" << 'EOF'
<updated ledger content>
EOF
For deferred proposals, record the reason so a future run can reassess.
Conversation JSONL Format
Records differ by agent implementation.
For Codex, the common shape is:
User / assistant messages:
{
"type": "response_item",
"payload": {
"type": "message",
"role": "user|assistant|developer",
"content": [
{ "type": "input_text|output_text", "text": "..." }
],
"phase": "commentary|final"
}
}
Tool calls:
{
"type": "response_item",
"payload": {
"type": "function_call",
"name": "exec_command",
"arguments": "{\"cmd\":\"...\"}"
}
}
Claude-style records may still appear in older logs or other agent homes. Prefer the Codex schema when ~/.codex/sessions is the source.