Evolve Agents - Agent Self-Improvement Analysis

Analyze agent performance, identify capability gaps, and propose improvements using the Evolution System.

Usage

# Analyze all agents
.claude/commands/evolve-agents.md analyze

# Analyze specific agent
.claude/commands/evolve-agents.md analyze --agent-id=coder-agent

# Generate weekly evolution report
.claude/commands/evolve-agents.md report --period=weekly

# Check thresholds and recommend updates
.claude/commands/evolve-agents.md check-thresholds

# View capability gaps
.claude/commands/evolve-agents.md gaps

# View skill suggestions
.claude/commands/evolve-agents.md suggestions

# View prompt history
.claude/commands/evolve-agents.md prompt-history --agent-id=coder-agent

# Evolve agent prompts
.claude/commands/evolve-agents.md evolve --agent-id=coder-agent

# Run A/B test on prompts
.claude/commands/evolve-agents.md ab-test --agent-id=coder-agent

Command Actions

1. analyze - Comprehensive Agent Analysis

Analyze agent performance metrics including:

Success rate over time
Average task duration
Token efficiency (success per 1000 tokens)
User feedback ratings
Performance trends (improving, stable, declining)

Output:

Performance summary table
Trend analysis
Comparison with previous periods
Recommendations for improvement

Example Output:

Agent Performance Analysis
========================

Agent: coder-agent
------------------
Success Rate:        87.5% (↑ 5.2% vs last week)
Avg Duration:        45.3s (↓ 8.1s vs last week)
Token Efficiency:    0.72 (↑ 0.08 vs last week)
User Rating:         4.2/5.0
Performance Trend:   IMPROVING

Tasks Completed:     120
Failures:           15
Retries:            8

Top Error Types:
1. timeout (5 occurrences)
2. validation_error (4 occurrences)
3. tool_limitation (3 occurrences)

2. report - Generate Evolution Report

Generate a comprehensive evolution report for the specified period:

Overall system performance
Per-agent performance summaries
Identified capability gaps
Skill suggestions
Prompt updates
Improvements deployed

Periods:

daily - Last 24 hours
weekly - Last 7 days (default)
monthly - Last 30 days
custom - Specify start/end dates

Output:

Executive summary
Detailed metrics
Visual trends
Actionable recommendations

3. check-thresholds - Auto-Evolution Checks

Check if any agents meet criteria for automatic evolution:

Success rate drops below threshold
Performance declining trend
High failure rate in specific task types

Thresholds:

Success rate drop: 10%
Minimum task count: 10 tasks
Declining trend: 2+ consecutive periods

Output:

Evolution Threshold Analysis
===========================

Agents Requiring Attention:
---------------------------

1. coder-agent
   Current Version:    v3
   Success Rate:       72.5% (↓ 15.2% vs previous period)
   Threshold:          success_rate_drop
   Recommended Action: EVOLVE
   Reason:            Success rate dropped by 15.2%

2. tester-agent
   Current Version:    v2
   Success Rate:       stable
   Performance:        DECLINING trend for 14 days
   Recommended Action: AB_TEST
   Reason:            Consistent performance decline

4. gaps - View Capability Gaps

Display identified capability gaps from task failures:

Gap category (missing_skill, tool_limitation, knowledge_gap, pattern_failure)
Severity (low, medium, high, critical)
Affected tasks
Frequency
Error patterns

Output:

Capability Gaps
==============

CRITICAL Gaps (2):
------------------

1. Gap ID: gap-1234567890
   Category:        tool_limitation
   Description:     Agent struggles with database query tasks requiring SQL execution
   Failure Count:   12
   Frequency:       3.4 failures/week
   Severity:        CRITICAL
   Affected Tasks:  12 tasks
   Error Pattern:   "No database client available"

2. Gap ID: gap-0987654321
   Category:        missing_skill
   Description:     Agent lacks capability for async/parallel task handling
   Failure Count:   8
   Frequency:       2.3 failures/week
   Severity:        CRITICAL

5. suggestions - View Skill Suggestions

Display proposed skills to address capability gaps:

Skill name and description
Addressed gaps
Estimated impact
Implementation complexity
Required tools/training

Output:

Skill Suggestions
================

HIGH PRIORITY (3):
------------------

1. Enhanced Database Integration
   Category:            tool_usage
   Addresses Gaps:      gap-1234567890
   Estimated Impact:
     - Gaps Closed:              1
     - Tasks Unblocked:          12
     - Success Rate Improvement: +15%
   Implementation:      MEDIUM complexity
   Required Tools:      database-client, sql-executor

2. Async Task Manager
   Category:            specialized_skill
   Addresses Gaps:      gap-0987654321
   Estimated Impact:
     - Gaps Closed:              1
     - Tasks Unblocked:          8
     - Success Rate Improvement: +20%
   Implementation:      HIGH complexity
   Required Training:   async_patterns, concurrency_control

6. prompt-history - View Prompt Evolution

Display prompt version history for an agent:

Version number
Activation/deactivation dates
Performance summary
Improvement over previous version

Output:

Prompt History: coder-agent
===========================

v4 (ACTIVE)
-----------
Activated:      2024-01-15 14:30:00
Tasks:          45
Success Rate:   87.5%
Avg Duration:   45.3s
Token Efficiency: 0.72
Improvement:    +12.5% vs v3

v3 (ARCHIVED)
-----------
Activated:      2024-01-08 09:00:00
Deactivated:    2024-01-15 14:30:00
Tasks:          120
Success Rate:   75.0%
Avg Duration:   53.4s
Token Efficiency: 0.64
Improvement:    +5.0% vs v2

7. evolve - Trigger Agent Evolution

Manually trigger prompt evolution for an agent:

Analyze recent failures
Suggest prompt mutations
Generate new variant
Register for A/B testing

Process:

Analyze failure patterns
Identify mutation opportunities
Generate new prompt variant
Add to A/B testing pool
Report expected improvements

Output:

Agent Evolution: coder-agent
===========================

Current Version: v3

Failure Analysis:
-----------------
- timeout errors (5x) → Add time management constraints
- validation errors (4x) → Clarify output format requirements

Suggested Mutations:
-------------------
1. ADD_CONSTRAINT (system prompt)
   Confidence: 70%
   Description: Add time management and efficiency constraints

2. CLARIFY (user prompt)
   Confidence: 80%
   Description: Clarify output format requirements

New Variant Created: v4
-----------------------
Status:     TESTING
Trial Count: 0
UCB1 Score: 0.0

The new variant will be tested using UCB1 multi-armed bandit selection.
Expected to reach promotion threshold after 20 trials.

8. ab-test - A/B Test Management

View and manage A/B testing of prompt variants:

Active variants
Trial counts
Success rates
UCB1 scores
Selection probability

Output:

A/B Testing Status: coder-agent
===============================

Active Variants:
---------------

v4 (TESTING)
  Trials:         15
  Success Rate:   85.0%
  Avg Duration:   42.1s
  UCB1 Score:     0.92
  Selection Prob: 65%

v3 (ACTIVE)
  Trials:         120
  Success Rate:   75.0%
  Avg Duration:   53.4s
  UCB1 Score:     0.78
  Selection Prob: 35%

Next Selection: v4 (UCB1 algorithm)

Promotion Status:
----------------
v4 needs 5 more trials before promotion consideration
Current improvement: +10.0% success rate vs v3
Threshold for promotion: +5.0% improvement

Implementation Guide

1. Initialize Evolution System

import Database from 'better-sqlite3';
import { EvolutionSystem } from '.claude/orchestration/evolution';

// Initialize database
const db = new Database('.claude/orchestration/db/agents.db');

// Create evolution system
const evolution = new EvolutionSystem(db, {
  autoEvolutionEnabled: true,
  explorationParameter: 2.0,
  minTrialsBeforePromotion: 20,
});

2. Track Task Completion

// After task completes
await evolution.trackTaskCompletion({
  agentId: 'coder-agent',
  taskId: 'task-123',
  variantId: 'coder-agent-v4',
  success: true,
  duration: 45300, // ms
  tokens: 1250,
  userRating: 4.5,
});

3. Collect Implicit Feedback

// User retries task
evolution.feedbackLoop.trackRetry('task-123', 'coder-agent');

// User edits output
evolution.feedbackLoop.trackEdit('task-123', 'coder-agent', 'minor');

// User abandons task
evolution.feedbackLoop.trackAbandon('task-123', 'coder-agent');

4. Generate Reports

// Weekly report
const report = evolution.generateWeeklyReport();

console.log('Overall Success Rate:', report.summary.overallSuccessRate);
console.log('Total Tasks:', report.summary.totalTasks);

// Per-agent performance
for (const perf of report.agentPerformance) {
  console.log(`${perf.agentId}: ${perf.successRate}% (${perf.successRateChange > 0 ? '↑' : '↓'} ${Math.abs(perf.successRateChange)}%)`);
}

5. Check and Apply Evolution

// Check thresholds
const updates = evolution.feedbackLoop.checkThresholds();

// Apply recommended updates
for (const update of updates) {
  if (update.recommendedAction === 'evolve') {
    await evolution.evolveAgent(update.agentId);
  }
}

UCB1 Algorithm Explanation

The system uses the UCB1 (Upper Confidence Bound) algorithm for prompt variant selection, which balances:

Exploitation: Selecting variants with proven high success rates Exploration: Testing new or under-tested variants

Formula:

UCB1 = avg_success_rate + c * sqrt(ln(total_trials) / variant_trials)

Where:

avg_success_rate: Historical success rate of variant
c: Exploration parameter (default: 2.0)
total_trials: Total trials across all variants
variant_trials: Trials for this specific variant

Selection Strategy:

Always select untried variants first (forced exploration)
Calculate UCB1 score for each variant
Select variant with highest UCB1 score
Update statistics after task completion
Promote variant to "active" after sufficient trials and proven improvement

Configuration

Default configuration can be customized:

const evolution = new EvolutionSystem(db, {
  // Tracking
  trackingEnabled: true,
  metricsRetentionDays: 90,

  // A/B Testing
  abTestingEnabled: true,
  minTrialsBeforePromotion: 20,
  confidenceLevel: 0.95,
  explorationParameter: 2.0,

  // Auto-Evolution
  autoEvolutionEnabled: true,
  evolutionThreshold: {
    minSuccessRateDrop: 10, // 10% drop triggers evolution
    minTaskCount: 10,
  },

  // Feedback
  implicitFeedbackWeight: 0.3,
  feedbackDecayHalfLife: 7, // days

  // Reporting
  reportFrequency: 'weekly',
  reportRetentionCount: 12,
});

Database Schema

All evolution data is stored in SQLite:

evolution_performance_metrics - Task completion metrics
evolution_user_feedback - Explicit and implicit feedback
evolution_task_failures - Detailed failure tracking
evolution_prompt_variants - Prompt versions and A/B testing
evolution_capability_gaps - Identified capability gaps
evolution_skill_suggestions - Proposed improvements
evolution_reports - Generated reports

See .claude/orchestration/db/evolution.sql for complete schema.

Integration with Orchestration System

The evolution system integrates seamlessly with the existing orchestration system:

Automatic Tracking: All task completions are automatically tracked
Checkpoint Integration: Evolution state saved in checkpoints
Activity Logging: Evolution events logged to activity log
Obsidian Sync: Reports synced to Obsidian vault for review

Best Practices

Regular Reporting: Generate weekly reports to track trends
Review Gaps: Address critical capability gaps promptly
Monitor A/B Tests: Track variant performance during testing phase
Feedback Collection: Actively collect user feedback for better evolution
Gradual Evolution: Don't change too many agents at once
Version Control: Keep prompt history for rollback capability

ナビゲーション

Skillsとは？

リンク

Evolve Agents - Agent Self-Improvement Analysis

Evolve Agents - Agent Self-Improvement Analysis

Usage

Command Actions

1. analyze - Comprehensive Agent Analysis

2. report - Generate Evolution Report

3. check-thresholds - Auto-Evolution Checks

4. gaps - View Capability Gaps

5. suggestions - View Skill Suggestions

6. prompt-history - View Prompt Evolution

7. evolve - Trigger Agent Evolution

8. ab-test - A/B Test Management

Implementation Guide

1. Initialize Evolution System

2. Track Task Completion

3. Collect Implicit Feedback

4. Generate Reports

5. Check and Apply Evolution

UCB1 Algorithm Explanation

Configuration

Database Schema

Integration with Orchestration System

Best Practices

See Also

関連スキル(🔧 開発ツール)