name: skill-evolution description: "Self-evolving skill system. Skills are scored after execution (0-100) on 5 dimensions. Score 90+ over 5 runs = crystallized (locked). Score below 30 = auto-repair attempted. Skills improve themselves through usage feedback."
Skill Evolution
Darwinian selection for skills. Skills that produce good outcomes are crystallized and protected. Skills that produce poor outcomes are repaired or archived. Every execution generates a score that drives the next generation of the skill.
The 5 Scoring Dimensions
Each skill execution is scored 0-100 on five dimensions:
| Dimension | Weight | What It Measures |
|---|---|---|
| Accuracy | 25% | Did the skill produce the correct result for the task? |
| Relevance | 20% | Was the skill content applicable to the actual use case? |
| Token Efficiency | 20% | Did the skill guide the agent without bloat or repetition? |
| User Satisfaction | 20% | Did the outcome meet or exceed user expectations? |
| Reusability | 15% | Could another agent use this skill in a similar situation? |
Composite score = weighted average of all five dimensions (0-100).
Scoring Rubric
90-100: Excellent -- candidate for crystallization
70-89: Good -- active skill, no action needed
50-69: Adequate -- flag for review after 3 more runs
30-49: Poor -- schedule auto-repair attempt
0-29: Critical -- immediate auto-repair or archive
Skill Lifecycle
DRAFT ACTIVE CRYSTALLIZED ARCHIVED
| | | |
New skill In regular use Proven stable Deprecated/replaced
| | | |
+-- first run ->+-- score >90 ++-- score <30 |
| for 5+ runs | (3 attempts) |
+-- score <30 -->+ auto-repair |
| auto-repair | fails 3x -->--+
+-- score >90 -->+
Draft
New skills enter as Draft. They receive no special protection and are evaluated critically on first use. A Draft skill that scores below 30 on its very first run is discarded rather than repaired.
Active
Skills in regular use. Scores are tracked in ~/.claude/skill-scores.jsonl. No action unless scores trend below 30 or above 90 over a rolling window of 5 runs.
Crystallized
A skill that maintains an average composite score above 90 over 5 or more consecutive runs is crystallized:
- Git tag applied:
skill/<name>/crystallized-v<N> - Read-only flag added to frontmatter:
locked: true - Skill is excluded from auto-repair
- Changes require explicit human unlock + PR
Archived
A skill that fails auto-repair 3 times is archived:
- Moved to
skills/_archived/<name>/ - Git tag applied:
skill/<name>/archived - Replacement skill drafted by
catalystagent if the capability is still needed
Score Storage Format
Append one record per execution to ~/.claude/skill-scores.jsonl:
{"skill":"experiment-loop","ts":"2026-04-07T10:00:00Z","session":"abc123","scores":{"accuracy":88,"relevance":92,"token_efficiency":75,"user_satisfaction":90,"reusability":85},"composite":86.5,"feedback":"Loop ran 4 iterations successfully, target nearly met"}
{"skill":"experiment-loop","ts":"2026-04-07T14:30:00Z","session":"def456","scores":{"accuracy":95,"relevance":90,"token_efficiency":82,"user_satisfaction":95,"reusability":88},"composite":90.4,"feedback":"Bundle size reduced 28%, target exceeded"}
Score CLI (quick check)
# Average scores for a skill (last 10 runs)
cat ~/.claude/skill-scores.jsonl | python3 -c "
import sys, json, statistics
skill = '$1'
runs = [json.loads(l) for l in sys.stdin if json.loads(l).get('skill') == skill][-10:]
if runs:
avg = statistics.mean(r['composite'] for r in runs)
print(f'{skill}: {avg:.1f} avg over {len(runs)} runs')
"
Crystallization Protocol
When a skill reaches 90+ composite score over 5+ consecutive runs:
- Verify scores in
~/.claude/skill-scores.jsonl-- confirm no outliers inflating the average - Add
locked: trueto the skill's frontmatter - Apply git tag:
git tag skill/<name>/crystallized-v1 -m "Crystallized: avg score 92.3 over 7 runs" git push origin skill/<name>/crystallized-v1 - Log the crystallization in
thoughts/SKILL-EVOLUTION.md - Notify via canavar cross-training so all agents know this skill is stable
Auto-Repair Protocol
When a skill's composite score drops below 30:
Diagnosis
- Identify the lowest-scoring dimension (the primary failure mode)
- Read the last 3 session feedback notes from
~/.claude/skill-scores.jsonl - Summarize what went wrong (specific, not vague)
Repair
The catalyst agent rewrites the failing section(s) of the skill:
- Only the sections relevant to the low-scoring dimension
- Preserve all high-scoring sections unchanged
- Add a concrete example for the repaired section
Validation
After repair, the skill is re-scored on a synthetic test case by the verifier agent:
- Synthetic score must be 50+ to proceed to Active state
- If synthetic score < 50, attempt 2 of 3 repairs begins
Escalation
After 3 failed auto-repairs:
- Archive the skill
- Alert via
thoughts/SKILL-EVOLUTION.md - Spawn
catalystto draft a replacement from scratch
Evolution Log Format
Append events to thoughts/SKILL-EVOLUTION.md:
## 2026-04-07
### skill: experiment-loop
- Status change: Active -> Crystallized
- Trigger: avg composite 91.2 over 6 consecutive runs
- Git tag: skill/experiment-loop/crystallized-v1
- Notable strength: Token Efficiency dimension consistently 85+
### skill: legacy-deploy-helper
- Status change: Active -> Auto-Repair (attempt 1/3)
- Trigger: composite 24 on last run
- Lowest dimension: Relevance (12) -- skill referenced outdated Heroku patterns
- Repair: catalyst rewrote "Deployment Targets" section with Vercel/Railway focus
- Post-repair synthetic score: 71 -- promoted back to Active
Integration with Canavar Cross-Training
Skill evolution data feeds into canavar's cross-training pipeline:
- A crystallized skill is injected into canavar's
skill-matrix.jsonwithtrust: locked - An archived skill is marked
trust: deprecated-- agents stop referencing it - Auto-repair failures are logged to
error-ledger.jsonlwithsource: skill-evolution - The canavar leaderboard tracks which agents most frequently produce high-scoring skill executions
# View crystallized skills
node ~/.claude/hooks/dist/canavar-cli.mjs leaderboard --filter crystallized
# View skills needing repair
cat ~/.claude/skill-scores.jsonl | python3 -c "
import sys, json, collections
runs = [json.loads(l) for l in sys.stdin]
low = {r['skill'] for r in runs if r['composite'] < 30}
print('Skills needing repair:', low)
"
Activation
This skill activates automatically when:
- A skill completes an execution (PostToolUse hook)
- A skill is referenced in a session that ends with user dissatisfaction
- The
verifieragent reports a skill-guided task as failed
Agents involved: catalyst (repair), verifier (validation), self-learner (feedback extraction), canavar (cross-training propagation).