Benchmarks - AI Agent Guide
Agent guide for the LEAN4 benchmark subsystem. For the project overview and OpenMath skills workflow, see the root AGENTS.md.
Overview
A benchmarking system for evaluating LEAN4 theorem proving capabilities:
- Stores benchmark problems as LEAN4 files with
sorryplaceholders - Runs AI proof generation (direct API or headless agent CLI)
- Verifies proofs with the LEAN compiler
- Tracks performance metrics (compilation time, memory usage)
sets/benchmark_metadata.json → BenchmarkManager (load/filter)
→ BenchmarkRunner (subprocess LEAN compiler + psutil memory monitor)
→ results/results_YYYYMMDD_HHMMSS.json
→ ResultsEvaluator (analysis and reports)
Runtime and Privacy Notes
solve.py --prompt ...requiresANTHROPIC_API_KEY.solve.py --agent <name>requires the selected CLI to be installed and already authenticated/configured locally.- Agent-mode runs execute an external CLI in a temporary workspace. The benchmark prelude narrows the task in prompt text, but actual filesystem/network enforcement depends on the selected CLI runtime. Some backends, such as
claude-code, use non-interactive full-permission flags so unattended runs can finish. - The
codexbackend is currently a placeholder CLI builder and may need local customization before use. - Generated artifacts under
answers/andresults/*.jsonare local benchmark outputs and are gitignored by default in this repository. - Raw provider reasoning or agent transcript data is not persisted unless you explicitly opt in with
solve.py --save-diagnosticsandcheck.py --include-diagnostics.
Generating Proofs with an AI Agent (Phase 1)
The primary workflow for automated proof generation is a two-phase pipeline:
Solve (solve.py) — default: --agent claude-code --prelude skills/openmath-lean-benchmark/prelude/default.md
Check (check.py) — default: latest run; --run-id to specify
Phase 1: Direct API mode (single LLM call)
# Default: agent + prelude (all benchmarks)
python3 skills/openmath-lean-benchmark/utils/solve.py
# Scope: single benchmark, difficulty, topic (can combine)
python3 skills/openmath-lean-benchmark/utils/solve.py --difficulty easy --topic algebra
python3 skills/openmath-lean-benchmark/utils/solve.py --benchmark-id easy_algebra_001 -v
# Direct LLM call
python3 skills/openmath-lean-benchmark/utils/solve.py --prompt "Complete the proof." --benchmark-id easy_algebra_001
# Agent mode (default is claude-code; override agent/prelude)
python3 skills/openmath-lean-benchmark/utils/solve.py --agent claude-code --benchmark-id easy_algebra_001
python3 skills/openmath-lean-benchmark/utils/solve.py --agent aider --difficulty easy
python3 skills/openmath-lean-benchmark/utils/solve.py --agent gemini --difficulty medium
python3 skills/openmath-lean-benchmark/utils/solve.py --agent opencode --difficulty easy
python3 skills/openmath-lean-benchmark/utils/solve.py --agent claude-code --difficulty hard --agent-timeout 1200
python3 skills/openmath-lean-benchmark/utils/solve.py --prelude skills/openmath-lean-benchmark/prelude/default.md --benchmark-id easy_algebra_001
# Stream agent output live (Rich panels, stream-json) and optional budget
python3 skills/openmath-lean-benchmark/utils/solve.py -v --benchmark-id easy_algebra_001
python3 skills/openmath-lean-benchmark/utils/solve.py --agent-max-budget-usd 0.5
# Optional: persist raw provider/agent diagnostics to trace files
python3 skills/openmath-lean-benchmark/utils/solve.py --save-diagnostics --benchmark-id easy_algebra_001
Check: Verify generated proofs
# Check latest run
python3 skills/openmath-lean-benchmark/utils/check.py
# Check specific run
python3 skills/openmath-lean-benchmark/utils/check.py --run-id run_20260311_194832
python3 skills/openmath-lean-benchmark/utils/check.py -v
Run metadata schema (answers/{run_id}/run_metadata.json)
{
"run_id": "run_YYYYMMDD_HHMMSS",
"provider": "agent:claude-code",
"model": null,
"agent_tool": "claude-code",
"agent_timeout": 600,
"agent_max_budget_usd": 1.0,
"prelude": "prelude/default.md",
"diagnostics_saved": false,
"start_timestamp": "...",
"end_timestamp": "...",
"total_benchmarks": 5,
"completed": 5,
"errors": 0,
"total_input_tokens": 0,
"total_output_tokens": 0,
"total_diagnostic_tokens": 0,
"total_wall_time_seconds": 42.3
}
Phase 2 verification output (skills/openmath-lean-benchmark/results/verify_{run_id}_{timestamp}.json) includes per-result token counts and trace_file by default. Raw diagnostic output is copied into that file only when check.py --include-diagnostics is used and the original solve run was created with --save-diagnostics.
Task 1: Adding a New Benchmark
Workflow:
- Create the
.leanfile in appropriate directory - Add metadata entry to
skills/openmath-lean-benchmark/sets/benchmark_metadata.json - Validate using
python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py - Test execution with
python3 skills/openmath-lean-benchmark/utils/validate.py --benchmark-id <id>
Example - Adding Easy Algebra Benchmark:
# Step 1: Create the LEAN file
# File: skills/openmath-lean-benchmark/sets/easy/algebra/associativity.lean
-- Benchmark: Associativity of Addition
-- Difficulty: easy
-- Topic: algebra
-- Description: Prove associativity of natural number addition
theorem nat_add_assoc (a b c : Nat) : (a + b) + c = a + (b + c) := by
sorry
# Step 2: Update metadata
# Edit: skills/openmath-lean-benchmark/sets/benchmark_metadata.json
{
"benchmark_id": "easy_algebra_002",
"file_path": "easy/algebra/associativity.lean",
"difficulty": "easy",
"topic": "algebra",
"description": "Associativity of natural number addition",
"lemmas": ["nat_add_assoc"],
"expected_time_ms": 1000,
"tags": ["associativity", "addition", "natural-numbers"]
}
# Step 3: Validate
python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py
# Step 4: Test
python3 skills/openmath-lean-benchmark/utils/validate.py --benchmark-id easy_algebra_002 -v
Task 2: Running Benchmark Tests
Basic Execution:
# Validate that benchmark sets compile (expect failure, not error)
python3 skills/openmath-lean-benchmark/utils/validate.py
# Filter by difficulty
python3 skills/openmath-lean-benchmark/utils/validate.py --difficulty easy
# Filter by topic
python3 skills/openmath-lean-benchmark/utils/validate.py --topic algebra
# Specific benchmark
python3 skills/openmath-lean-benchmark/utils/validate.py --benchmark-id easy_algebra_001
# Verbose mode (detailed output)
python3 skills/openmath-lean-benchmark/utils/validate.py -v
Expected Outputs:
- Console: Progress updates, summary statistics
- File:
skills/openmath-lean-benchmark/results/results_YYYYMMDD_HHMMSS.json
Task 3: Analyzing Results
Basic Analysis:
# View latest results
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py
# Compare multiple runs
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --compare
# Performance report
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --report performance
# Specific result files
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --pattern "results_20261103_*.json"
Task 4: Validating Benchmark Integrity
# Validate all benchmarks
python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py
Validation Checks:
- File existence
- Lemma name matching between file and metadata
- Presence of
sorryin benchmark files
📐 Technical Specifications
LEAN4 Benchmark Format
Required Structure:
-- Benchmark: [Title]
-- Difficulty: [easy|medium|hard]
-- Topic: [algebra|combinatorics|logic]
-- Description: [Brief description]
theorem [theorem_name] [parameters] : [statement] := by
sorry
Important Constraints:
- Must contain at least one lemma with
sorry - Lemma names in file must match metadata
- File must be compilable by LEAN4 (
lean --make)
Metadata Schema
Required Fields:
{
"benchmark_id": "string (unique identifier)",
"file_path": "string (relative to benchmark_sets/)",
"difficulty": "easy|medium|hard",
"topic": "string",
"description": "string",
"lemmas": ["array of lemma names"],
"expected_time_ms": "number (milliseconds)",
"tags": ["array of strings"]
}
Results Schema
Output Format:
{
"run_metadata": {
"timestamp": "ISO 8601 timestamp",
"lean_version": "string",
"total_benchmarks": "number",
"total_passed": "number",
"total_failed": "number",
"total_errors": "number",
"total_time_ms": "number",
"total_memory_mb": "number",
"wall_time_seconds": "number"
},
"results": [
{
"benchmark_id": "string",
"file_path": "string",
"difficulty": "string",
"topic": "string",
"status": "success|failure|error",
"compilation_time_ms": "number",
"memory_usage_mb": "number",
"lemmas_tested": ["array"],
"lemmas_solved": ["array"],
"has_sorry": "boolean",
"error_message": "string|null",
"compiler_output": "string",
"timestamp": "ISO 8601 timestamp"
}
]
}
Status Definitions
- success: Compiled without errors, no
sorryremains - failure: Compiled successfully, but still contains
sorry - error: Compilation failed (syntax errors, type errors, timeout)
⚙️ Configuration
Important Configuration Variables (skills/openmath-lean-benchmark/utils/config.py)
# Executable
LEAN_EXECUTABLE = "lean" # Assumes lean is in PATH
# Timeouts (seconds)
EASY_TIMEOUT = 60 # 1 minute
MEDIUM_TIMEOUT = 180 # 3 minutes
HARD_TIMEOUT = 600 # 10 minutes
# Memory limit
MAX_MEMORY_MB = 4096 # 4GB
# Directory paths (auto-configured)
BENCHMARKS_ROOT = Path(__file__).parent.parent.resolve()
BENCHMARK_DIR = BENCHMARKS_ROOT / "sets"
RESULTS_DIR = BENCHMARKS_ROOT / "results"
ANSWERS_DIR = BENCHMARKS_ROOT / "answers"
# Direct API provider defaults
DEFAULT_PROVIDER = "claude"
DEFAULT_MODEL = "claude-sonnet-4-6"
DEFAULT_REASONING_BUDGET = 8000
# Agent-mode defaults
DEFAULT_AGENT_TIMEOUT = 600 # seconds the agent process may run
DEFAULT_AGENT_MAX_BUDGET_USD = 1.0 # max USD per agent run (e.g. claude --max-budget-usd); None = no limit
When to Modify:
- LEAN not in PATH: Update
LEAN_EXECUTABLEinskills/openmath-lean-benchmark/utils/config.py - Timeouts too short/long: Adjust difficulty-specific timeouts or
DEFAULT_AGENT_TIMEOUT - Memory constraints: Modify
MAX_MEMORY_MB - Default agent duration: Modify
DEFAULT_AGENT_TIMEOUT - Agent spend cap: Modify
DEFAULT_AGENT_MAX_BUDGET_USDor use--agent-max-budget-usd
🎯 Best Practices for Agents
1. Before Making Changes
✅ DO:
- Read relevant README files first
- Validate existing benchmarks before adding new ones
- Check metadata schema compliance
- Test new benchmarks individually before bulk operations
- Keep generated artifacts local unless you have explicitly sanitized them
❌ DON'T:
- Modify existing benchmark files without backing them up
- Add benchmarks without metadata entries
- Run benchmarks without understanding timeout limits
- Commit results to version control (unless intentional)
- Persist raw provider diagnostics unless you explicitly need them
2. When Adding Benchmarks
✅ DO:
- Use descriptive benchmark IDs (format:
{difficulty}_{topic}_{number}) - Include comprehensive comments in LEAN files
- Set realistic expected_time_ms based on difficulty
- Add relevant tags for categorization
- Validate immediately after creation
❌ DON'T:
- Use duplicate benchmark IDs
- Create benchmarks without
sorryplaceholders - Omit documentation comments
- Skip metadata updates
3. When Running Tests
✅ DO:
- Start with small test runs (single benchmark or easy difficulty)
- Use verbose mode (
-v) for debugging - Monitor system resources for long-running tests
- Review compiler output for errors
- Use
--save-diagnosticsonly for targeted local debugging
❌ DON'T:
- Run all benchmarks on untested code
- Ignore timeout warnings
- Delete result files without archiving
- Interrupt running benchmarks abruptly
4. When Analyzing Results
✅ DO:
- Compare results over multiple runs for trends
- Investigate performance outliers
- Keep historical results for regression testing only if they remain sanitized and intentionally stored
- Document unexpected behaviors
❌ DON'T:
- Rely on single test run
- Ignore error messages
- Delete result files prematurely
- Skip validation of anomalies
🔍 Troubleshooting Guide
Common Issues and Solutions
Issue: "LEAN not found" or "Unknown command: lean"
Solution:
- Verify LEAN4 installation:
lean --version - If not installed, install LEAN4
- If installed but not in PATH, update
LEAN_EXECUTABLEinconfig.py
Issue: "Import errors" when running Python scripts
Solution:
- Ensure you're in the project root directory
- Install dependencies:
pip3 install -r skills/openmath-lean-benchmark/utils/requirements.txt - Use
python3(notpython) on macOS/Linux
Issue: agent CLI missing or not authenticated
Solution:
- Install the selected CLI so it is available in
PATH - Complete the CLI's local sign-in or provider configuration before running
solve.py - For direct API mode, set
ANTHROPIC_API_KEY
Issue: "Timeout errors" during benchmark execution
Solution:
- Check timeout settings in
skills/openmath-lean-benchmark/utils/config.py - Increase timeout for specific difficulty level
- Verify benchmark doesn't have infinite loops
- Use
--verboseto see where it hangs
Issue: "Mismatched lemmas" validation error
Solution:
- Verify lemma names in
.leanfile match metadata - Check for typos in lemma names
- Ensure file has correct LEAN syntax
- Re-validate after corrections
Issue: unexpected identifier; expected command on lemma keyword
Solution:
lemmais no longer a recognized keyword in Lean 4.29+; usetheoreminstead- All benchmark files must use
theoremfor top-level declarations - When adding new benchmarks, always use
theorem, notlemma
Issue: Results showing all "failure" status
Solution:
- This is expected for benchmarks with
sorryplaceholders - "failure" means compiled successfully but proofs incomplete
- Only completed proofs (no
sorry) show "success"
📊 Performance Expectations
Typical Execution Times
| Difficulty | Benchmarks | Avg Time/Benchmark | Total Time (10 benchmarks) |
|---|---|---|---|
| Easy | 1-5 lemmas | 100-1000ms | 1-10 seconds |
| Medium | 3-8 lemmas | 1-5 seconds | 10-50 seconds |
| Hard | 5-15 lemmas | 5-30 seconds | 50-300 seconds |
Memory Usage
- Easy: 20-100 MB per benchmark
- Medium: 50-500 MB per benchmark
- Hard: 100-2000 MB per benchmark
🚀 Quick Reference Commands
# Setup
pip3 install -r skills/openmath-lean-benchmark/utils/requirements.txt
# Validation
python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py
# Solve: generate proofs (default agent + prelude)
python3 skills/openmath-lean-benchmark/utils/solve.py --agent claude-code
python3 skills/openmath-lean-benchmark/utils/solve.py --agent claude-code --difficulty easy --agent-timeout 300
python3 skills/openmath-lean-benchmark/utils/solve.py --agent aider --benchmark-id easy_algebra_001 -v
python3 skills/openmath-lean-benchmark/utils/solve.py --save-diagnostics --benchmark-id easy_algebra_001
# Check: verify generated proofs with LEAN compiler
python3 skills/openmath-lean-benchmark/utils/check.py
python3 skills/openmath-lean-benchmark/utils/check.py --run-id run_20260311_194832 -v
python3 skills/openmath-lean-benchmark/utils/check.py --run-id run_20260311_194832 --include-diagnostics
# Validate: verify benchmark sets compile (expect failure, not error)
python3 skills/openmath-lean-benchmark/utils/validate.py # All
python3 skills/openmath-lean-benchmark/utils/validate.py --difficulty easy # By difficulty
python3 skills/openmath-lean-benchmark/utils/validate.py --topic algebra # By topic
python3 skills/openmath-lean-benchmark/utils/validate.py -v # Verbose
# Analyze results
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py # Latest summary
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --compare # Compare runs
python3 skills/openmath-lean-benchmark/utils/evaluate_results.py --report performance # Performance details
📝 Notes for Agents
File Modification Guidelines
- skills/openmath-lean-benchmark/sets/ - Add new benchmarks, update metadata
- skills/openmath-lean-benchmark/results/ - Read-only generated outputs; gitignored by default
- skills/openmath-lean-benchmark/utils/ - Modify config.py for settings, avoid changing core logic unless necessary
- README files - Update when adding new features or workflows
Integration Points
- CI/CD: Results can be parsed for automated testing
- Analytics: JSON results ideal for data analysis pipelines
- Reporting: evaluate_results.py provides customizable reports
Extension Opportunities
Agents can extend this system by:
- Adding new topics (create subdirectories, update metadata)
- Adding new agent tool backends in
skills/openmath-lean-benchmark/utils/providers/agents/(e.g. new module + AGENT_REGISTRY) - Implementing custom result analyzers
- Creating visualization tools for results
- Building automated benchmark generators
- Integrating with proof assistants
Additional Resources
- LEAN4 Documentation: lean-lang.org
- Benchmark Format Guide:
sets/README.md - Results Schema:
results/README.md - Utils Documentation:
utils/README.md
Agent Checklist
Before completing benchmark work, verify:
- All new benchmarks have metadata entries
- Benchmark IDs are unique
- LEAN files compile without syntax errors
- Validation passes (
python3 skills/openmath-lean-benchmark/utils/benchmark_manager.py) - Test runs complete successfully
- Results are properly formatted JSON
- No raw provider diagnostics in stored files unless explicitly intended