System for testing multi-agent behavior consistency across prompts, tools, skills, models, and agent configs.
Skills(SKILL.md)は、AIエージェント(Claude Code、Cursor、Codexなど)に特定の能力を追加するための設定ファイルです。
詳しく見る →System for testing multi-agent behavior consistency across prompts, tools, skills, models, and agent configs.
EvalKit is a conversational evaluation framework for AI agents that guides you through creating robust evaluations using the Strands Evals SDK. Through natural conversation, you can plan evaluations, generate test data, execute evaluations, and analyze results.
Measure model performance on test datasets. Use when assessing accuracy, precision, recall, and other metrics.
Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
Evaluate RAG systems with hit rate, MRR, faithfulness metrics and compare retrieval strategies. Use when testing retrieval quality, generating evaluation datasets, comparing embeddings or retrievers, A/B testing, or measuring production RAG performance.
Evaluate skills by executing them across sonnet, opus, and haiku models using sub-agents. Use when testing if a skill works correctly, comparing model performance, or finding the cheapest compatible model. Returns numeric scores (0-100) to differentiate model capabilities.
Evaluate agent systems with quality gates and LLM-as-judge. Use when you need to measure component quality or implement quality gates. Not for simple unit testing or binary pass/fail checks without nuance.
Build systematic evaluation frameworks for LLM applications.
Comprehensive EVE Online project management and ESI integration toolkit. Use when updating, auditing, or integrating ESI into EVE Online projects like EVE_Rebellion, EVE_Gatekeeper, EVE_Ships, or any EVE-related development. Triggers on project updates, ESI integration, compliance checking, asset management, or multi-project coordination.
Use when generating branded QR codes for ProductTank SF events - speaker LinkedIn profiles, sponsor websites, or Slack join links. Handles single/bulk generation, correct logo mapping, GDrive upload, and mandatory test-scanning.
Create new event scraping scripts for websites. Use when adding a new event source to the Asheville Event Feed. ALWAYS start by detecting the CMS/platform and trying known API endpoints first. Browser scraping is NOT supported (Vercel limitation). Handles API-based, HTML/JSON-LD, and hybrid patterns with comprehensive testing workflows.
Record domain events and dispatch to inbox handlers for side effects, audit trails, and activity feeds. Use when building activity logs, syncing external services, or decoupling event creation from processing. Triggers on event recording, audit trails, activity feeds, or inbox patterns.
Use when spec and code diverge - AI analyzes mismatches, recommends update spec vs fix code with reasoning, handles evolution with user control or auto-updates
Retrieve and extract content from URLs with AI-powered summarization and structured data extraction. Use for scraping web pages, extracting specific information, summarizing articles, or crawling websites with subpages.
Web research using Exa AI search engine. Use when: user needs web search, finding articles, research papers, news, company info, or similar content. Triggers on: 'search for', 'find articles about', 'research', 'what's the latest on', 'find companies like', 'similar to [url]'.
Process exam/test paper documents from DOCX format into structured markdown. Use when Claude needs to: (1) Extract exam content from Word documents (.docx), (2) Analyze images in exam papers using vision tools, (3) Convert questions to structured markdown with proper image references, (4) Understand question context to match images with appropriate questions, (5) Create organized exam output with YAML frontmatter and sections
Example custom Skill demonstrating template generation and best practices. Use this as a reference when creating your own custom Skills.
When the user has an Excel file with incomplete data that needs to be populated by searching for and extracting specific information from external sources. This skill is triggered by requests to fill missing columns in spreadsheets, batch data enrichment tasks, or when working with structured data that requires additional research to complete. It handles Excel file discovery, reading existing data, writing new data to specific cells, and maintaining data integrity across multiple rows.
Import Excel data into RVT projects. Update element parameters, create schedules, and sync external data sources.
Create or resume an execution plan - a design document that a coding agent can follow to deliver a working feature or system change
shannon-execution-verifier
Execute one feature (FEAT-XXX) at a time using docs/forge/ideas/<IDEA_ID>/latest/tasks.md as the source of truth. Creates a short workspace checklist and tracks progress so reruns continue automatically.
<objective>
Execute approved task specifications sequentially with TDD, comprehensive testing, and validation. This skill should be used for implementing tasks from approved specs with full audit trail.
This skill should be used when executing tasks from ai-state/active/tasks.yaml sequentially. It loads tasks, gathers context, implements features with phase-appropriate testing, updates task status in tasks.yaml, organizes tests into ai-state/regressions/ folders, and logs all operations to operations.log. Use after write-plan creates tasks.yaml or when resuming development work.
Complete development lifecycle for GitHub/local issues - branch, implement, test, PR, merge with quality gates
Handle common execution failures with specialized recovery strategies. Fix syntax errors, import/dependency issues, path/file problems, permission denial, and connection timeouts. Use proactively when encountering errors or as automatic recovery mechanism.
Transform raw data from CSVs, Google Sheets, or databases into executive-ready reports with visualizations, key metrics, trend analysis, and actionable recommendations. Creates data-driven narratives for leadership. Use when users need to turn spreadsheets into executive summaries or board reports.
Transform poorly formatted executive summaries into professionally formatted documents matching a specific brand template. Use when the user provides an executive summary (text, markdown, docx, or PDF) and wants it reformatted with precise brand styling including Work Sans fonts, branded color scheme (red #DA291C accent, navy #032340 table headers), specific table formatting, header/footer with logo graphics, and consistent spacing. This skill ensures pixel-perfect replication of the template's typography, tables, bullets, and page layout.
Strategic leadership for GabeDA - refines requirements, orchestrates skills, makes architectural decisions, and ensures project coherence. Acts as CEO/CTO/PM to bridge vision and execution.
Code generation and file modification agent with delegation capabilities
タスクの性質を LLM ベースで深く分析し、適切な executor(claudecode/codex/coderabbit/user)を判定する専門 Skill。キーワードベースの単純判定を置き換える。
Designs deliberate practice exercises applying evidence-based learning strategies like retrieval practice, spaced repetition, and interleaving. Activate when educators need varied exercise types (fill-in-blank, debug-this, build-from-scratch, extend-code, AI-collaborative) targeting learning objectives with appropriate difficulty progression. Creates exercise sets that apply cognitive science principles to maximize retention and skill development. Use when designing practice activities for Python concepts, creating homework assignments, generating problem sets, or evaluating exercise quality.
Break a high-level backlog item into executable sub-items
UX/UI design and user experience
Selects the most relevant experiences, projects, awards, and credentials from the master context based on JD keywords.
experiment-tracker
Planning or executing thesis experiments. Covers the lifecycle from ideation through polishing, tracking table, SPEC.md format, stage structure.
Do experiment-driven research (hypotheses → minimal repros → evidence) and continuously improve research skills + tooling. Use when behavior is uncertain, contested, or performance-sensitive.
>
Domain expert routing. When the knowledge base cannot answer user questions, find and notify the corresponding expert based on the question domain. Only available in IM mode. Trigger condition: No results in 6-stage retrieval.
Comprehensive guidance for understanding, designing, and implementing expert systems using rule-based inference, knowledge representation, and the complete development lifecycle. Use when users need help with expert system concepts, architecture design, rule-based reasoning (forward/backward chaining), knowledge acquisition, development planning, or implementation strategies.
Design YAML expertise file structures for agent experts. Use when creating mental models for domain-specific agents, defining expertise schema, or structuring knowledge for Act-Learn-Reuse workflows.
Exploit researcher persona specializing in attack surface analysis, exploit scenario generation, and vulnerability chaining
Advanced exploratory testing techniques with Session-Based Test Management (SBTM), RST heuristics, and test tours. Use when planning exploration sessions, investigating bugs, or discovering unknown quality risks.
'Find all files relevant to a query with orthogonal exploration for comprehensive coverage. Returns topic-specific overview + file list with line ranges. Uses parallel agents for thorough+ levels to ensure nothing is missed.'
Explores data in a Bauplan lakehouse safely using the Bauplan Python SDK. Use to inspect namespaces, tables, schemas, samples, and profiling queries; and to export larger result sets to files. Read-only exploration only; no writes or pipeline runs.
You are an expert codebase exploration specialist with deep understanding of code patterns, architectural structures, and implementation details. Your expertise lies in efficiently navigating unfamili
Building and deploying Expo React Native apps to iOS. Use when configuring EAS Build, submitting to TestFlight, App Store deployment, managing certificates, or troubleshooting build issues.
Эксперт Expo конфигурации. Используй для React Native Expo, app.json и EAS Build.