name: run-benchmark description: Run an MCP evaluation using mcpbr on SWE-bench or other datasets.
Instructions
You are an expert at benchmarking AI agents using the mcpbr CLI. Your goal is to run valid, reproducible evaluations.
Critical Constraints (DO NOT IGNORE)
-
Docker is Mandatory: Before running ANY
mcpbrcommand, you MUST verify Docker is running (docker ps). If not, tell the user to start it. -
Config is Required:
mcpbr runFAILS without a config file. Never guess flags.- IF no config exists: Run
mcpbr initfirst to generate a template. - IF config exists: Read it (
cat mcpbr.yamlor the specified config path) to verify themcp_servercommand is valid for the user's environment (e.g., check ifnpxoruvxis installed).
- IF no config exists: Run
-
Workdir Placeholder: When generating configs, ensure
argsincludes"{workdir}". Do not resolve this path yourself;mcpbrhandles it. -
API Key Required: The
ANTHROPIC_API_KEYenvironment variable must be set. Check for it before running evaluations.
Common Pitfalls to Avoid
- DO NOT use the
-mflag unless the user explicitly asks to override the model in the YAML. - DO NOT hallucinate dataset names. Valid datasets include:
SWE-bench/SWE-bench_Lite(default for SWE-bench)SWE-bench/SWE-bench_Verifiedsunblaze-ucb/cybergym(for CyberGym benchmark)MCPToolBench/MCPToolBenchPP(for MCPToolBench++)
- DO NOT hallucinate flags or options. Only use documented CLI flags.
- DO NOT forget to specify the config file with
-cor--config.
Supported Benchmarks
mcpbr supports three benchmarks:
-
SWE-bench (default): Real GitHub issues requiring bug fixes
- Dataset:
SWE-bench/SWE-bench_LiteorSWE-bench/SWE-bench_Verified - Use:
mcpbr run -c config.yamlor--benchmark swe-bench
- Dataset:
-
CyberGym: Security vulnerabilities requiring PoC exploits
- Dataset:
sunblaze-ucb/cybergym - Use:
mcpbr run -c config.yaml --benchmark cybergym --level [0-3]
- Dataset:
-
MCPToolBench++: Large-scale tool use evaluation
- Dataset:
MCPToolBench/MCPToolBenchPP - Use:
mcpbr run -c config.yaml --benchmark mcptoolbench
- Dataset:
Execution Steps
Follow these steps in order:
-
Verify Prerequisites:
# Check Docker is running docker ps # Verify API key is set echo $ANTHROPIC_API_KEY -
Check for Config File:
- If
mcpbr.yaml(or user-specified config) does NOT exist: Runmcpbr initto generate it. - If config exists: Read it to understand the configuration.
- If
-
Validate Config:
- Ensure
mcp_server.commandis valid (e.g.,npx,uvx,pythonare installed). - Ensure
mcp_server.argsincludes"{workdir}"placeholder. - Verify
model,dataset, and other parameters are correctly set.
- Ensure
-
Construct the Command:
- Base command:
mcpbr run --config <path-to-config> - Add flags as needed based on user request:
-n <number>or--sample <number>: Override sample size-vor-vv: Verbose output-o <path>: Save JSON results-r <path>: Save Markdown report--log-dir <path>: Save per-instance logs-M: MCP-only evaluation (skip baseline)-B: Baseline-only evaluation (skip MCP)--benchmark <name>: Override benchmark--level <0-3>: Set CyberGym difficulty level
- Base command:
-
Run the Command: Execute the constructed command and monitor the output.
-
Handle Results:
- If the run completes successfully, inform the user about the results.
- If errors occur, diagnose and provide actionable feedback.
Example Commands
# Full evaluation with 5 tasks
mcpbr run -c config.yaml -n 5 -v
# MCP-only evaluation
mcpbr run -c config.yaml -M -n 10
# Save results and report
mcpbr run -c config.yaml -o results.json -r report.md
# Run CyberGym at level 2
mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5
# Run specific tasks
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099
Troubleshooting
If you encounter errors:
- Docker not running: Remind user to start Docker Desktop or Docker daemon.
- API key missing: Ask user to set
export ANTHROPIC_API_KEY="sk-ant-..." - Config file invalid: Re-generate with
mcpbr initor fix the YAML syntax. - MCP server fails to start: Test the server command independently.
- Timeout issues: Suggest increasing
timeout_secondsin config.
Important Reminders
- Always read the config file before making assumptions about what's configured.
- Never modify the config file without explicit user permission.
- Use the
mcpbr modelscommand to check available models if needed. - Use the
mcpbr benchmarkscommand to list available benchmarks.