Benchmarks - AI Agent Guide

Agent guide for the LEAN4 benchmark subsystem. For the project overview and OpenMath skills workflow, see the root AGENTS.md.

Overview

A benchmarking system for evaluating LEAN4 theorem proving capabilities:

Stores benchmark problems as LEAN4 files with sorry placeholders
Runs AI proof generation (direct API or headless agent CLI)
Verifies proofs with the LEAN compiler
Tracks performance metrics (compilation time, memory usage)

sets/benchmark_metadata.json → BenchmarkManager (load/filter)
 → BenchmarkRunner (subprocess LEAN compiler + psutil memory monitor)
 → results/results_YYYYMMDD_HHMMSS.json
 → ResultsEvaluator (analysis and reports)

Runtime and Privacy Notes

solve.py --prompt ... requires ANTHROPIC_API_KEY.
solve.py --agent <name> requires the selected CLI to be installed and already authenticated/configured locally.
Agent-mode runs execute an external CLI in a temporary workspace. The benchmark prelude narrows the task in prompt text, but actual filesystem/network enforcement depends on the selected CLI runtime. Some backends, such as claude-code, use non-interactive full-permission flags so unattended runs can finish.
The codex backend is currently a placeholder CLI builder and may need local customization before use.
Generated artifacts under answers/ and results/*.json are local benchmark outputs and are gitignored by default in this repository.
Raw provider reasoning or agent transcript data is not persisted unless you explicitly opt in with solve.py --save-diagnostics and check.py --include-diagnostics.

Generating Proofs with an AI Agent (Phase 1)

The primary workflow for automated proof generation is a two-phase pipeline:

Solve (solve.py) — default: --agent claude-code --prelude skills/openmath-lean-benchmark/prelude/default.md Check (check.py) — default: latest run; --run-id to specify

Phase 1: Direct API mode (single LLM call)

# Default: agent + prelude (all benchmarks)
python3 skills/openmath-lean-benchmark/utils/solve.py

# Scope: single benchmark, difficulty, topic (can combine)
python3 skills/openmath-lean-benchmark/utils/solve.py --difficulty easy --topic algebra
python3 skills/openmath-lean-benchmark/utils/solve.py --benchmark-id easy_algebra_001 -v

# Direct LLM call
python3 skills/openmath-lean-benchmark/utils/solve.py --prompt "Complete the proof." --benchmark-id easy_algebra_001

# Agent mode (default is claude-code; override agent/prelude)
python3 skills/openmath-lean-benchmark/utils/solve.py --agent claude-code --benchmark-id easy_algebra_001
python3 skills/openmath-lean-benchmark/utils/solve.py --agent aider --difficulty easy
python3 skills/openmath-lean-benchmark/utils/solve.py --agent gemini --difficulty medium
python3 skills/openmath-lean-benchmark/utils/solve.py --agent opencode --difficulty easy
python3 skills/openmath-lean-benchmark/utils/solve.py --agent claude-code --difficulty hard --agent-timeout 1200
python3 skills/openmath-lean-benchmark/utils/solve.py --prelude skills/openmath-lean-benchmark/prelude/default.md --benchmark-id easy_algebra_001

# Stream agent output live (Rich panels, stream-json) and optional budget
python3 skills/openmath-lean-benchmark/utils/solve.py -v --benchmark-id easy_algebra_001
python3 skills/openmath-lean-benchmark/utils/solve.py --agent-max-budget-usd 0.5

# Optional: persist raw provider/agent diagnostics to trace files
python3 skills/openmath-lean-benchmark/utils/solve.py --save-diagnostics --benchmark-id easy_algebra_001

Check: Verify generated proofs

# Check latest run
python3 skills/openmath-lean-benchmark/utils/check.py

# Check specific run
python3 skills/openmath-lean-benchmark/utils/check.py --run-id run_20260311_194832
python3 skills/openmath-lean-benchmark/utils/check.py -v

Run metadata schema (`answers/{run_id}/run_metadata.json`)

{
  "run_id": "run_YYYYMMDD_HHMMSS",
  "provider": "agent:claude-code",
  "model": null,
  "agent_tool": "claude-code",
  "agent_timeout": 600,
  "agent_max_budget_usd": 1.0,
  "prelude": "prelude/default.md",
  "diagnostics_saved": false,
  "start_timestamp": "...",
  "end_timestamp": "...",
  "total_benchmarks": 5,
  "completed": 5,
  "errors": 0,
  "total_input_tokens": 0,
  "total_output_tokens": 0,
  "total_diagnostic_tokens": 0,
  "total_wall_time_seconds": 42.3
}

Phase 2 verification output (skills/openmath-lean-benchmark/results/verify_{run_id}_{timestamp}.json) includes per-result token counts and trace_file by default. Raw diagnostic output is copied into that file only when check.py --include-diagnostics is used and the original solve run was created with --save-diagnostics.

Task 1: Adding a New Benchmark

Workflow:

Create the .lean file in appropriate directory
Add metadata entry to skills/openmath-lean-benchmark/sets/benchmark_metadata.json
Validate using python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py
Test execution with python3 skills/openmath-lean-benchmark/utils/validate.py --benchmark-id <id>

Example - Adding Easy Algebra Benchmark:

# Step 1: Create the LEAN file
# File: skills/openmath-lean-benchmark/sets/easy/algebra/associativity.lean

-- Benchmark: Associativity of Addition
-- Difficulty: easy
-- Topic: algebra
-- Description: Prove associativity of natural number addition

theorem nat_add_assoc (a b c : Nat) : (a + b) + c = a + (b + c) := by
  sorry

# Step 2: Update metadata
# Edit: skills/openmath-lean-benchmark/sets/benchmark_metadata.json

{
  "benchmark_id": "easy_algebra_002",
  "file_path": "easy/algebra/associativity.lean",
  "difficulty": "easy",
  "topic": "algebra",
  "description": "Associativity of natural number addition",
  "lemmas": ["nat_add_assoc"],
  "expected_time_ms": 1000,
  "tags": ["associativity", "addition", "natural-numbers"]
}

# Step 3: Validate
python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py

# Step 4: Test
python3 skills/openmath-lean-benchmark/utils/validate.py --benchmark-id easy_algebra_002 -v

Task 2: Running Benchmark Tests

Basic Execution:

# Validate that benchmark sets compile (expect failure, not error)
python3 skills/openmath-lean-benchmark/utils/validate.py

# Filter by difficulty
python3 skills/openmath-lean-benchmark/utils/validate.py --difficulty easy

# Filter by topic
python3 skills/openmath-lean-benchmark/utils/validate.py --topic algebra

# Specific benchmark
python3 skills/openmath-lean-benchmark/utils/validate.py --benchmark-id easy_algebra_001

# Verbose mode (detailed output)
python3 skills/openmath-lean-benchmark/utils/validate.py -v

Expected Outputs:

Console: Progress updates, summary statistics
File: skills/openmath-lean-benchmark/results/results_YYYYMMDD_HHMMSS.json

Task 3: Analyzing Results

Basic Analysis:

# View latest results
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py

# Compare multiple runs
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --compare

# Performance report
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --report performance

# Specific result files
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --pattern "results_20261103_*.json"

Task 4: Validating Benchmark Integrity

# Validate all benchmarks
python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py

Validation Checks:

File existence
Lemma name matching between file and metadata
Presence of sorry in benchmark files

📐 Technical Specifications

LEAN4 Benchmark Format

Required Structure:

-- Benchmark: [Title]
-- Difficulty: [easy|medium|hard]
-- Topic: [algebra|combinatorics|logic]
-- Description: [Brief description]

theorem [theorem_name] [parameters] : [statement] := by
  sorry

Important Constraints:

Must contain at least one lemma with sorry
Lemma names in file must match metadata
File must be compilable by LEAN4 (lean --make)

Metadata Schema

Required Fields:

{
  "benchmark_id": "string (unique identifier)",
  "file_path": "string (relative to benchmark_sets/)",
  "difficulty": "easy|medium|hard",
  "topic": "string",
  "description": "string",
  "lemmas": ["array of lemma names"],
  "expected_time_ms": "number (milliseconds)",
  "tags": ["array of strings"]
}

Results Schema

Output Format:

{
  "run_metadata": {
    "timestamp": "ISO 8601 timestamp",
    "lean_version": "string",
    "total_benchmarks": "number",
    "total_passed": "number",
    "total_failed": "number",
    "total_errors": "number",
    "total_time_ms": "number",
    "total_memory_mb": "number",
    "wall_time_seconds": "number"
  },
  "results": [
    {
      "benchmark_id": "string",
      "file_path": "string",
      "difficulty": "string",
      "topic": "string",
      "status": "success|failure|error",
      "compilation_time_ms": "number",
      "memory_usage_mb": "number",
      "lemmas_tested": ["array"],
      "lemmas_solved": ["array"],
      "has_sorry": "boolean",
      "error_message": "string|null",
      "compiler_output": "string",
      "timestamp": "ISO 8601 timestamp"
    }
  ]
}

Status Definitions

success: Compiled without errors, no sorry remains
failure: Compiled successfully, but still contains sorry
error: Compilation failed (syntax errors, type errors, timeout)

⚙️ Configuration

Important Configuration Variables (`skills/openmath-lean-benchmark/utils/config.py`)

# Executable
LEAN_EXECUTABLE = "lean"  # Assumes lean is in PATH

# Timeouts (seconds)
EASY_TIMEOUT = 60      # 1 minute
MEDIUM_TIMEOUT = 180   # 3 minutes
HARD_TIMEOUT = 600     # 10 minutes

# Memory limit
MAX_MEMORY_MB = 4096   # 4GB

# Directory paths (auto-configured)
BENCHMARKS_ROOT = Path(__file__).parent.parent.resolve()
BENCHMARK_DIR   = BENCHMARKS_ROOT / "sets"
RESULTS_DIR     = BENCHMARKS_ROOT / "results"
ANSWERS_DIR     = BENCHMARKS_ROOT / "answers"

# Direct API provider defaults
DEFAULT_PROVIDER       = "claude"
DEFAULT_MODEL          = "claude-sonnet-4-6"
DEFAULT_REASONING_BUDGET = 8000

# Agent-mode defaults
DEFAULT_AGENT_TIMEOUT = 600       # seconds the agent process may run
DEFAULT_AGENT_MAX_BUDGET_USD = 1.0  # max USD per agent run (e.g. claude --max-budget-usd); None = no limit

When to Modify:

LEAN not in PATH: Update LEAN_EXECUTABLE in skills/openmath-lean-benchmark/utils/config.py
Timeouts too short/long: Adjust difficulty-specific timeouts or DEFAULT_AGENT_TIMEOUT
Memory constraints: Modify MAX_MEMORY_MB
Default agent duration: Modify DEFAULT_AGENT_TIMEOUT
Agent spend cap: Modify DEFAULT_AGENT_MAX_BUDGET_USD or use --agent-max-budget-usd

🎯 Best Practices for Agents

1. Before Making Changes

✅ DO:

Read relevant README files first
Validate existing benchmarks before adding new ones
Check metadata schema compliance
Test new benchmarks individually before bulk operations
Keep generated artifacts local unless you have explicitly sanitized them

❌ DON'T:

Modify existing benchmark files without backing them up
Add benchmarks without metadata entries
Run benchmarks without understanding timeout limits
Commit results to version control (unless intentional)
Persist raw provider diagnostics unless you explicitly need them

2. When Adding Benchmarks

✅ DO:

Use descriptive benchmark IDs (format: {difficulty}_{topic}_{number})
Include comprehensive comments in LEAN files
Set realistic expected_time_ms based on difficulty
Add relevant tags for categorization
Validate immediately after creation

❌ DON'T:

Use duplicate benchmark IDs
Create benchmarks without sorry placeholders
Omit documentation comments
Skip metadata updates

3. When Running Tests

✅ DO:

Start with small test runs (single benchmark or easy difficulty)
Use verbose mode (-v) for debugging
Monitor system resources for long-running tests
Review compiler output for errors
Use --save-diagnostics only for targeted local debugging

❌ DON'T:

Run all benchmarks on untested code
Ignore timeout warnings
Delete result files without archiving
Interrupt running benchmarks abruptly

4. When Analyzing Results

✅ DO:

Compare results over multiple runs for trends
Investigate performance outliers
Keep historical results for regression testing only if they remain sanitized and intentionally stored
Document unexpected behaviors

❌ DON'T:

Rely on single test run
Ignore error messages
Delete result files prematurely
Skip validation of anomalies

🔍 Troubleshooting Guide

Common Issues and Solutions

Issue: "LEAN not found" or "Unknown command: lean"

Solution:

Verify LEAN4 installation: lean --version
If not installed, install LEAN4
If installed but not in PATH, update LEAN_EXECUTABLE in config.py

Issue: "Import errors" when running Python scripts

Solution:

Ensure you're in the project root directory
Install dependencies: pip3 install -r skills/openmath-lean-benchmark/utils/requirements.txt
Use python3 (not python) on macOS/Linux

Issue: agent CLI missing or not authenticated

Solution:

Install the selected CLI so it is available in PATH
Complete the CLI's local sign-in or provider configuration before running solve.py
For direct API mode, set ANTHROPIC_API_KEY

Issue: "Timeout errors" during benchmark execution

Solution:

Check timeout settings in skills/openmath-lean-benchmark/utils/config.py
Increase timeout for specific difficulty level
Verify benchmark doesn't have infinite loops
Use --verbose to see where it hangs

Issue: "Mismatched lemmas" validation error

Solution:

Verify lemma names in .lean file match metadata
Check for typos in lemma names
Ensure file has correct LEAN syntax
Re-validate after corrections

Issue: `unexpected identifier; expected command` on `lemma` keyword

Solution:

lemma is no longer a recognized keyword in Lean 4.29+; use theorem instead
All benchmark files must use theorem for top-level declarations
When adding new benchmarks, always use theorem, not lemma

Issue: Results showing all "failure" status

Solution:

This is expected for benchmarks with sorry placeholders
"failure" means compiled successfully but proofs incomplete
Only completed proofs (no sorry) show "success"

📊 Performance Expectations

Typical Execution Times

Difficulty	Benchmarks	Avg Time/Benchmark	Total Time (10 benchmarks)
Easy	1-5 lemmas	100-1000ms	1-10 seconds
Medium	3-8 lemmas	1-5 seconds	10-50 seconds
Hard	5-15 lemmas	5-30 seconds	50-300 seconds

Memory Usage

Easy: 20-100 MB per benchmark
Medium: 50-500 MB per benchmark
Hard: 100-2000 MB per benchmark

🚀 Quick Reference Commands

# Setup
pip3 install -r skills/openmath-lean-benchmark/utils/requirements.txt

# Validation
python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py

# Solve: generate proofs (default agent + prelude)
python3 skills/openmath-lean-benchmark/utils/solve.py --agent claude-code
python3 skills/openmath-lean-benchmark/utils/solve.py --agent claude-code --difficulty easy --agent-timeout 300
python3 skills/openmath-lean-benchmark/utils/solve.py --agent aider --benchmark-id easy_algebra_001 -v
python3 skills/openmath-lean-benchmark/utils/solve.py --save-diagnostics --benchmark-id easy_algebra_001

# Check: verify generated proofs with LEAN compiler
python3 skills/openmath-lean-benchmark/utils/check.py
python3 skills/openmath-lean-benchmark/utils/check.py --run-id run_20260311_194832 -v
python3 skills/openmath-lean-benchmark/utils/check.py --run-id run_20260311_194832 --include-diagnostics

# Validate: verify benchmark sets compile (expect failure, not error)
python3 skills/openmath-lean-benchmark/utils/validate.py                    # All
python3 skills/openmath-lean-benchmark/utils/validate.py --difficulty easy  # By difficulty
python3 skills/openmath-lean-benchmark/utils/validate.py --topic algebra    # By topic
python3 skills/openmath-lean-benchmark/utils/validate.py -v                 # Verbose

# Analyze results
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py                       # Latest summary
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --compare             # Compare runs
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --report performance  # Performance details

📝 Notes for Agents

File Modification Guidelines

skills/openmath-lean-benchmark/sets/ - Add new benchmarks, update metadata
skills/openmath-lean-benchmark/results/ - Read-only generated outputs; gitignored by default
skills/openmath-lean-benchmark/utils/ - Modify config.py for settings, avoid changing core logic unless necessary
README files - Update when adding new features or workflows

Integration Points

CI/CD: Results can be parsed for automated testing
Analytics: JSON results ideal for data analysis pipelines
Reporting: evaluate_results.py provides customizable reports

Extension Opportunities

Agents can extend this system by:

Adding new topics (create subdirectories, update metadata)
Adding new agent tool backends in skills/openmath-lean-benchmark/utils/providers/agents/ (e.g. new module + AGENT_REGISTRY)
Implementing custom result analyzers
Creating visualization tools for results
Building automated benchmark generators
Integrating with proof assistants

Additional Resources

LEAN4 Documentation: lean-lang.org
Benchmark Format Guide: sets/README.md
Results Schema: results/README.md
Utils Documentation: utils/README.md

Agent Checklist

Before completing benchmark work, verify:

All new benchmarks have metadata entries
Benchmark IDs are unique
LEAN files compile without syntax errors
Validation passes (python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py)
Test runs complete successfully
Results are properly formatted JSON
No raw provider diagnostics in stored files unless explicitly intended

ナビゲーション

Skillsとは？

リンク

Benchmarks - AI Agent Guide

Benchmarks - AI Agent Guide

Overview

Runtime and Privacy Notes

Generating Proofs with an AI Agent (Phase 1)

Phase 1: Direct API mode (single LLM call)

Check: Verify generated proofs

Run metadata schema (answers/{run_id}/run_metadata.json)

Task 1: Adding a New Benchmark

Task 2: Running Benchmark Tests

Task 3: Analyzing Results

Task 4: Validating Benchmark Integrity

📐 Technical Specifications

LEAN4 Benchmark Format

Metadata Schema

Results Schema

Status Definitions

⚙️ Configuration

Important Configuration Variables (skills/openmath-lean-benchmark/utils/config.py)

🎯 Best Practices for Agents

1. Before Making Changes

2. When Adding Benchmarks

3. When Running Tests

4. When Analyzing Results

🔍 Troubleshooting Guide

Common Issues and Solutions

Issue: "LEAN not found" or "Unknown command: lean"

Issue: "Import errors" when running Python scripts

Issue: agent CLI missing or not authenticated

Issue: "Timeout errors" during benchmark execution

Issue: "Mismatched lemmas" validation error

Issue: unexpected identifier; expected command on lemma keyword

Issue: Results showing all "failure" status

📊 Performance Expectations

Typical Execution Times

Memory Usage

🚀 Quick Reference Commands

📝 Notes for Agents

File Modification Guidelines

Integration Points

Extension Opportunities

Additional Resources

Agent Checklist

関連スキル(🔧 開発ツール)

Run metadata schema (`answers/{run_id}/run_metadata.json`)

Important Configuration Variables (`skills/openmath-lean-benchmark/utils/config.py`)

Issue: `unexpected identifier; expected command` on `lemma` keyword