name: mcp-evaluation-skill version: 1.0.0 category: mcp-development description: Comprehensive evaluation creation for MCP servers - question generation, answer verification, and XML formatting for agent usability testing triggers:

"create evaluations"
"evaluation questions"
"test MCP server"
"verify MCP tools"
"agent usability" dependencies:
mcp-builder-skill author: Engineering Standards Committee last_updated: 2025-12-29

MCP Evaluation Skill

Description

This skill provides a systematic approach to creating comprehensive evaluation suites for MCP (Model Context Protocol) servers. Evaluations test whether AI agents can effectively use MCP tools to answer realistic, complex questions - the ultimate measure of MCP server quality.

Core Capabilities:

Question generation methodology (simple → moderate → complex)
Answer verification through manual solving
XML format specification for evaluation frameworks
Complexity distribution optimization (2-3-2 pattern)
Independence and stability validation
Real-world use case identification

When to Use This Skill

Use this skill when you need to:

Create evaluation suites for new MCP servers
Validate MCP tool usability by AI agents
Test complex multi-tool workflows
Verify agent can discover and use tools correctly
Generate realistic questions based on actual data
Ensure stable, verifiable answers

Trigger Phrases:

"Create 10 evaluation questions for this MCP server"
"Generate evaluation suite"
"Test if agents can use these tools"
"Verify MCP server with evaluations"
"Create XML evaluation file"

Don't use this skill for:

Unit testing (use validator-role-skill instead)
Integration testing (different testing methodology)
Manual QA testing (evaluations are for automated agent testing)
API documentation (use scribe-role-skill)

Prerequisites

Knowledge Requirements

MCP Protocol Understanding
- Tool, resource, and prompt concepts
- Input schemas (Pydantic/Zod)
- Response format best practices
- Agent-centric design principles
Evaluation Theory
- Independence (no question dependencies)
- Read-only operations (non-destructive)
- Verifiability (string comparison)
- Stability (answer doesn't change over time)
- Complexity levels (simple, moderate, complex)
Domain Knowledge
- Understanding of target API/service
- Realistic use cases humans care about
- Data relationships and patterns
- Edge cases worth testing

Environment Setup

# Ensure MCP server is running
npm run build
node dist/index.js &

# Or use evaluation harness (recommended)
# Harness manages server lifecycle automatically

Project Context

Phase 4 of MCP Development: Evaluations come after implementation (Phases 1-3)
MCP Server Running: Must have working MCP server to explore data
Tool Documentation: Understand what each tool does
Read-Only Access: Evaluation questions must not modify data

Workflow

Phase 1: Tool Inspection and Understanding

1.1 List All Available Tools

Objective: Understand the complete capability surface of the MCP server

# If using MCP inspector
mcp-inspector --server ./dist/index.js tools list

# Manual inspection via code
grep -r "@tool" src/mcp/tools/

Document Each Tool:

Tool Name	Purpose	Input Parameters	Output	Complexity
`list_miners`	Get all registered miners	`{ limit?, offset? }`	`{ miners: [...] }`	Simple
`get_miner_status`	Get detailed miner status	`{ minerId }`	`{ status, hashrate, temp }`	Simple
`update_firmware`	Update miner firmware	`{ minerId, version }`	`{ jobId, status }`	Complex
`get_fleet_summary`	Aggregated fleet metrics	`{ tenantId? }`	`{ total, online, hashrate }`	Moderate

Key Insights to Capture:

Which tools return lists vs single items?
Which tools require IDs from other tools? (workflow chaining)
Which tools have optional parameters?
Which tools enable complex multi-step questions?

1.2 Understand Tool Relationships

Pattern: Map Tool Dependencies

list_miners → get_miner_status (requires minerId from list)
                ↓
          update_firmware (requires minerId)
                ↓
          check_job_status (requires jobId from update)

Workflow Chains to Test:

Discovery → Detail: list_miners → get_miner_status
Discovery → Action: list_miners → update_firmware → check_job_status
Aggregation → Filter: get_fleet_summary → list_miners (with filters)
Multi-Resource: get_miner_status + get_pool_config + get_firmware_version

Phase 2: Content Exploration (Read-Only)

2.1 Use READ-ONLY Tools to Explore Data

Critical Rule: Never use destructive operations during exploration

Exploration Strategy:

// Example: Explore miner fleet
const miners = await mcpServer.callTool("list_miners", { limit: 100 });
// Identify interesting miners: highest hashrate, highest temp, offline, etc.

const detailedStatus = await mcpServer.callTool("get_miner_status", {
  minerId: miners.miners[0].id
});
// Understand status structure: what fields exist? What values?

const fleetSummary = await mcpServer.callTool("get_fleet_summary", {});
// Understand aggregated metrics: total miners, online count, average hashrate

Data Patterns to Identify:

Uniqueness: Which fields uniquely identify entities?
- Example: minerId, serialNumber, ipAddress
Relationships: How do entities relate?
- Example: Miners → Pools, Miners → Firmware Versions
Ranges: What are typical value ranges?
- Example: Temperature (40-80°C), Hashrate (90-100 TH/s)
Edge Cases: Interesting outliers to test
- Example: Offline miners, miners with errors, miners updating firmware
Aggregations: What can be calculated?
- Example: Total hashrate, average temperature, count by status

2.2 Document Data Characteristics

Data Classification Matrix:

Data Type	Change Frequency	Uniqueness	Suitable for Evaluation?
Miner ID	Never	Unique	✅ Yes (stable reference)
Hashrate	Every 1-5s	Non-unique	❌ No (too volatile)
Firmware version	Rarely	Non-unique	✅ Yes (stable)
Temperature	Every 1-5s	Non-unique	❌ No (too volatile)
Pool URL	Rarely	Non-unique	✅ Yes (stable)
Error messages	Varies	Non-unique	⚠️ Maybe (if persistent)

Stable vs Volatile Data:

Stable: Suitable for evaluation answers (firmware versions, pool URLs, miner counts)
Volatile: Unsuitable (hashrate, temperature, current status)

Phase 3: Question Generation

3.1 Complexity Distribution (2-3-2 Pattern)

Target Distribution for 10 Questions:

2 Simple (1-2 tool calls, straightforward lookup)
6 Moderate (2-4 tool calls, some reasoning/filtering)
2 Complex (4+ tool calls, deep exploration, multi-step workflows)

3.2 Simple Questions (Single Tool or Straightforward Workflow)

Characteristics:

1-2 tool calls
Obvious solution path
Direct lookup or simple filter
Answer is immediate from tool output

Examples:

Simple Discovery

<question>How many miners are currently registered in the fleet?</question>
<answer>127</answer>
<!-- Tools: list_miners (count total) -->

Simple Detail Lookup

<question>What firmware version is miner-abc-123 running?</question>
<answer>2.5.1</answer>
<!-- Tools: get_miner_status(miner-abc-123) → firmware_version -->

3.3 Moderate Questions (Multi-Tool, Filtering, Reasoning)

Characteristics:

2-4 tool calls
Requires filtering or sorting
Some logic to combine results
May need to identify "best" or "worst"

Examples:

Find by Characteristic

<question>Which miner in the fleet has the highest hashrate? What is its IP address?</question>
<answer>192.168.1.157</answer>
<!-- Tools:
     1. list_miners
     2. get_miner_status for each (or use fleet summary)
     3. Identify max hashrate
     4. Return IP address
-->

Aggregation with Filter

<question>How many miners are currently offline in tenant 'prod-west'?</question>
<answer>3</answer>
<!-- Tools:
     1. list_miners({ tenantId: 'prod-west' })
     2. Filter by status === 'offline'
     3. Count results
-->

Cross-Resource Query

<question>Which pool URL is configured for the miner with serial number SN-7891? Include the pool priority.</question>
<answer>stratum+tcp://pool.example.com:3333 (priority: 0)</answer>
<!-- Tools:
     1. list_miners → find miner by serial number
     2. get_pool_config(minerId) → get pool configuration
     3. Extract URL and priority
-->

3.4 Complex Questions (Deep Exploration, Multi-Step)

Characteristics:

4+ tool calls
Requires exploring multiple layers
Chained dependencies (output of one tool feeds next)
Combines data from multiple sources
May require finding relationships or patterns

Examples:

Deep Workflow Exploration

<question>Find the miner with the oldest firmware version in the fleet. What is its current hashrate in TH/s?</question>
<answer>87.3</answer>
<!-- Tools:
     1. list_miners (get all miners)
     2. get_miner_status for each (or batch query)
     3. Identify oldest firmware version
     4. Get hashrate for that specific miner
-->

Multi-Condition Search

<question>Among miners running firmware 2.5.x, which one has been online the longest? What is its uptime in hours?</question>
<answer>1847</answer>
<!-- Tools:
     1. list_miners
     2. get_miner_status for each
     3. Filter by firmware version (2.5.x regex)
     4. Identify max uptime
     5. Convert to hours and return
-->

Pattern Discovery

<question>Which firmware version is most commonly deployed across all miners in the 'prod' tenant? How many miners use it?</question>
<answer>2.5.1 (94 miners)</answer>
<!-- Tools:
     1. list_miners({ tenantId: 'prod' })
     2. get_miner_status for each
     3. Group by firmware version
     4. Find most common (mode)
     5. Return version + count
-->

3.5 Question Quality Checklist

For each generated question, verify:

Independent: Doesn't depend on answers from other questions
Read-Only: Only uses non-destructive tools
Verifiable: Has single, clear answer (string comparison)
Stable: Answer won't change over time (no volatile data)
Realistic: Based on actual use case humans care about
Answerable: Agent can solve with available tools
Clear: Unambiguous what's being asked
Complete: Includes all context needed

Red Flags (Avoid These):

❌ "What is the current temperature of miner-123?" (too volatile)
❌ "Update firmware and tell me the result" (destructive)
❌ "Solve question 3 first, then answer this" (dependent)
❌ "Approximately how many miners..." (vague, not verifiable)

Phase 4: Answer Verification

4.1 Manually Solve Each Question

Critical Rule: You must solve every question yourself to verify the answer

Verification Process:

// For each question, document solving process:

// Question: "How many miners are in tenant 'prod-west'?"

// Step 1: Call list_miners
const miners = await mcpServer.callTool("list_miners", {
  tenantId: "prod-west"
});
// Result: { miners: [...], total: 47 }

// Step 2: Verify count
console.log(`Total miners: ${miners.total}`);
// Output: Total miners: 47

// Step 3: Document answer
// Answer: 47

// Step 4: Verify stability
// - Tenant membership rarely changes ✅
// - Answer won't be volatile ✅
// - Answer is deterministic ✅

4.2 Answer Format Guidelines

String Comparison Requirements:

Answer Type	Format	Example
Number	Plain number	`47` (not "47 miners")
String	Exact string	`prod-west` (not "Tenant: prod-west")
IP Address	Standard notation	`192.168.1.100`
URL	Full URL	`stratum+tcp://pool.example.com:3333`
Version	Semantic version	`2.5.1` (not "v2.5.1")
Boolean	`true` or `false`	`true` (lowercase)
List	Comma-separated	`miner-1,miner-2,miner-3` (no spaces)

Multiple-Part Answers:

If question asks for multiple pieces of information, format as structured answer:

<question>What is the IP address and pool URL for miner-abc-123?</question>
<answer>192.168.1.100, stratum+tcp://pool.example.com:3333</answer>
<!-- Clear delimiter (comma + space) between parts -->

4.3 Stability Verification

Check Answer Stability:

Re-run verification after 1 hour - answer should be same
Identify dependencies - what would cause answer to change?
Avoid time-sensitive data - current status, real-time metrics
Use historical or configuration data - firmware versions, pool URLs, miner IDs

Stable vs Unstable Examples:

Question	Stability	Reason
"How many miners are registered?"	✅ Stable	Rarely changes
"What is miner-123's hashrate?"	❌ Unstable	Changes every second
"Which firmware version is on miner-abc?"	✅ Stable	Only changes on update
"How many miners are currently online?"	❌ Unstable	Changes frequently
"What pool URL is miner-xyz using?"	✅ Stable	Configuration data

Phase 5: XML Output Generation

5.1 XML Format Specification

Complete Evaluation File Structure:

<?xml version="1.0" encoding="UTF-8"?>
<evaluation>
  <metadata>
    <name>Braiins OS MCP Server Evaluation</name>
    <version>1.0</version>
    <created>2025-12-29</created>
    <author>Engineering Team</author>
    <description>Comprehensive evaluation suite testing agent usability of Braiins OS MCP server</description>
  </metadata>

  <qa_pairs>
    <!-- Simple questions (2) -->
    <qa_pair>
      <id>eval-001</id>
      <difficulty>simple</difficulty>
      <question>How many miners are currently registered in the fleet?</question>
      <answer>127</answer>
      <tools_required>list_miners</tools_required>
      <expected_call_count>1</expected_call_count>
    </qa_pair>

    <qa_pair>
      <id>eval-002</id>
      <difficulty>simple</difficulty>
      <question>What firmware version is miner-abc-123 running?</question>
      <answer>2.5.1</answer>
      <tools_required>get_miner_status</tools_required>
      <expected_call_count>1</expected_call_count>
    </qa_pair>

    <!-- Moderate questions (6) -->
    <qa_pair>
      <id>eval-003</id>
      <difficulty>moderate</difficulty>
      <question>Which miner in the fleet has the highest hashrate? What is its IP address?</question>
      <answer>192.168.1.157</answer>
      <tools_required>list_miners, get_miner_status</tools_required>
      <expected_call_count>3-5</expected_call_count>
    </qa_pair>

    <!-- ... 5 more moderate questions ... -->

    <!-- Complex questions (2) -->
    <qa_pair>
      <id>eval-009</id>
      <difficulty>complex</difficulty>
      <question>Find the miner with the oldest firmware version in the fleet. What is its current hashrate in TH/s?</question>
      <answer>87.3</answer>
      <tools_required>list_miners, get_miner_status</tools_required>
      <expected_call_count>5+</expected_call_count>
    </qa_pair>

    <qa_pair>
      <id>eval-010</id>
      <difficulty>complex</difficulty>
      <question>Which firmware version is most commonly deployed across all miners in the 'prod' tenant? How many miners use it?</question>
      <answer>2.5.1 (94 miners)</answer>
      <tools_required>list_miners, get_miner_status</tools_required>
      <expected_call_count>5+</expected_call_count>
    </qa_pair>
  </qa_pairs>

  <statistics>
    <total_questions>10</total_questions>
    <simple_count>2</simple_count>
    <moderate_count>6</moderate_count>
    <complex_count>2</complex_count>
    <total_tools>4</total_tools>
    <avg_tools_per_question>2.3</avg_tools_per_question>
  </statistics>
</evaluation>

5.2 Metadata Best Practices

Name: Descriptive name of MCP server being evaluated
Version: Evaluation suite version (bump when questions change)
Created: ISO 8601 date (YYYY-MM-DD)
Author: Team or individual who created evaluations
Description: Brief explanation of what's being tested

5.3 QA Pair Best Practices

Required Fields:

<id>: Unique identifier (eval-001, eval-002, ...)
<difficulty>: simple | moderate | complex
<question>: Clear, unambiguous question text
<answer>: Verified answer (string comparison format)

Optional but Recommended Fields:

<tools_required>: Comma-separated tool names needed
<expected_call_count>: How many tool calls expected (for performance testing)
<rationale>: Why this question is valuable (internal documentation)

Examples

Example 1: Complete Evaluation Creation Process

Target: Braiins OS MCP Server with 4 tools

Step 1: Tool Inspection

// Available tools:
1. list_miners({ limit?, offset?, tenantId? })
2. get_miner_status({ minerId })
3. get_fleet_summary({ tenantId? })
4. get_pool_config({ minerId })

Step 2: Data Exploration

// Discover data patterns
const miners = await callTool("list_miners", { limit: 100 });
// Found: 127 miners total, IDs like "miner-abc-123"

const status = await callTool("get_miner_status", {
  minerId: miners.miners[0].id
});
// Found: firmware version (stable), hashrate (volatile), temperature (volatile)

const summary = await callTool("get_fleet_summary", {});
// Found: total count, online count, total hashrate

Step 3: Generate 10 Questions

<evaluation>
  <!-- 2 Simple -->
  <qa_pair>
    <question>How many miners are registered?</question>
    <answer>127</answer>
  </qa_pair>

  <qa_pair>
    <question>What is miner-abc-123's firmware version?</question>
    <answer>2.5.1</answer>
  </qa_pair>

  <!-- 6 Moderate -->
  <qa_pair>
    <question>How many miners in tenant 'prod-west' are online?</question>
    <answer>44</answer>
  </qa_pair>

  <!-- ... 5 more moderate ... -->

  <!-- 2 Complex -->
  <qa_pair>
    <question>Which miner has the oldest firmware? What is its pool URL?</question>
    <answer>stratum+tcp://old-pool.example.com:3333</answer>
  </qa_pair>

  <!-- ... 1 more complex ... -->
</evaluation>

Step 4: Verify All Answers

// Manually solve each question and verify answer stability
// Document solving process for future reference

Example 2: Question Evolution (Bad → Good)

❌ Bad Question (Volatile Answer):

<question>What is the current hashrate of miner-abc-123?</question>
<answer>95.7</answer>
<!-- Problem: Hashrate changes every second - unstable! -->

✅ Good Question (Stable Answer):

<question>What firmware version is miner-abc-123 running?</question>
<answer>2.5.1</answer>
<!-- Good: Firmware version only changes on updates - stable! -->

❌ Bad Question (Dependent):

<question>Using the miner ID from question 3, what is its temperature?</question>
<!-- Problem: Depends on question 3 - not independent! -->

✅ Good Question (Independent):

<question>What is the pool URL for miner-abc-123?</question>
<answer>stratum+tcp://pool.example.com:3333</answer>
<!-- Good: Self-contained, no dependencies -->

Quality Standards

Evaluation Quality Checklist

Coverage
- Tests all major tools at least once
- Tests common workflows (list → detail)
- Tests edge cases (empty results, errors)
- Tests aggregation and filtering
Complexity Distribution
- 2 simple questions (20%)
- 6 moderate questions (60%)
- 2 complex questions (20%)
- Total: 10 questions
Question Quality
- All questions are independent
- All questions use read-only tools
- All questions have verifiable answers
- All questions have stable answers
- All questions are realistic use cases
Answer Quality
- All answers manually verified
- All answers use string comparison format
- All answers are stable (re-verified after 1 hour)
- All answers are unambiguous
XML Format
- Valid XML structure
- Metadata complete
- Statistics calculated
- Consistent formatting

Performance Targets

Agent Success Rates:

Simple questions: 95%+ success rate
Moderate questions: 80%+ success rate
Complex questions: 60%+ success rate
Overall: 75%+ success rate

Tool Call Efficiency:

Simple: 1-2 tool calls on average
Moderate: 3-4 tool calls on average
Complex: 5-7 tool calls on average

Common Pitfalls

❌ Pitfall 1: Volatile Data in Answers

Problem: Using real-time metrics that change constantly

<!-- BAD: Temperature changes every second -->
<question>What is miner-123's current temperature?</question>
<answer>65°C</answer>

Solution: Use stable configuration or historical data

<!-- GOOD: Firmware version only changes on updates -->
<question>What firmware version is miner-123 running?</question>
<answer>2.5.1</answer>

❌ Pitfall 2: Dependent Questions

Problem: Questions that rely on previous answers

<!-- BAD: Depends on identifying miner in previous question -->
<question>What is the pool URL for the miner from question 5?</question>

Solution: Make every question self-contained

<!-- GOOD: Fully self-contained -->
<question>What is the pool URL for miner-abc-123?</question>
<answer>stratum+tcp://pool.example.com:3333</answer>

❌ Pitfall 3: Ambiguous Answers

Problem: Multiple valid interpretations

<!-- BAD: Ambiguous format -->
<question>How many miners are offline?</question>
<answer>3 miners are offline</answer>
<!-- Agent might return just "3" or "three" or "3 miners" -->

Solution: Specify exact format in question or normalize answer

<!-- GOOD: Clear number format -->
<question>How many miners are offline?</question>
<answer>3</answer>
<!-- Clear: just the number -->

Integration with Evaluation Harness

Running Evaluations

Evaluation Harness Setup:

# Create evaluation harness script
cat > run-evaluation.ts <<'EOF'
import { MCPClient } from '@modelcontextprotocol/client';
import { parseEvaluation } from './eval-parser';

async function runEvaluation(evalPath: string) {
  const client = new MCPClient('./dist/index.js');
  const evaluation = parseEvaluation(evalPath);

  let passed = 0;
  let failed = 0;

  for (const qa of evaluation.questions) {
    try {
      const answer = await client.ask(qa.question);
      if (answer === qa.answer) {
        passed++;
        console.log(`✅ ${qa.id}: PASS`);
      } else {
        failed++;
        console.log(`❌ ${qa.id}: FAIL (expected: ${qa.answer}, got: ${answer})`);
      }
    } catch (error) {
      failed++;
      console.log(`❌ ${qa.id}: ERROR - ${error.message}`);
    }
  }

  console.log(`\nResults: ${passed}/${passed + failed} passed (${(passed / (passed + failed) * 100).toFixed(1)}%)`);
}

runEvaluation('./evaluations/braiins-os.xml');
EOF

Usage:

npm run build
npm run evaluate

References

MCP Evaluation Guide: See mcp-builder-skill reference/evaluation.md
Question Generation Theory: See mcp-builder-skill Phase 4
Agent-Centric Design: MCP Best Practices (modelcontextprotocol.io)
Braiins OS API: See braiins-os skill for domain knowledge

Version History:

1.0.0 (2025-12-29): Initial release - Question generation, answer verification, XML formatting

ナビゲーション

Skillsとは？

リンク

mcp-evaluation-skill

MCP Evaluation Skill

Description

When to Use This Skill

Prerequisites

Knowledge Requirements

Environment Setup

Project Context

Workflow

Phase 1: Tool Inspection and Understanding

1.1 List All Available Tools

1.2 Understand Tool Relationships

Phase 2: Content Exploration (Read-Only)

2.1 Use READ-ONLY Tools to Explore Data

2.2 Document Data Characteristics

Phase 3: Question Generation

3.1 Complexity Distribution (2-3-2 Pattern)

3.2 Simple Questions (Single Tool or Straightforward Workflow)

3.3 Moderate Questions (Multi-Tool, Filtering, Reasoning)

3.4 Complex Questions (Deep Exploration, Multi-Step)

3.5 Question Quality Checklist

Phase 4: Answer Verification

4.1 Manually Solve Each Question

4.2 Answer Format Guidelines

4.3 Stability Verification

Phase 5: XML Output Generation

5.1 XML Format Specification

5.2 Metadata Best Practices

5.3 QA Pair Best Practices

Examples

Example 1: Complete Evaluation Creation Process

Example 2: Question Evolution (Bad → Good)

Quality Standards

Evaluation Quality Checklist

Performance Targets

Common Pitfalls

❌ Pitfall 1: Volatile Data in Answers

❌ Pitfall 2: Dependent Questions

❌ Pitfall 3: Ambiguous Answers

Integration with Evaluation Harness

Running Evaluations

References

関連スキル(🔧 開発ツール)