Skills Research

Skillsとは？

Skills（SKILL.md）は、AIエージェント（Claude Code、Cursor、Codexなど）に特定の能力を追加するための設定ファイルです。

詳しく見る →

リンク

検索に戻る

Enterprise Guardrails, Evaluation & Safety

https://github.com/udbfd68-cell/AURION-APP

★0更新: 3週間前Claude Code🔧 開発ツールテスト自動化コードレビュー自動化テスト Git・PREnglish

> Sources: OpenAI "Practical Guide to Building Agents" (2025), Google "Agents Companion v2" (Feb 2025)

Enterprise Guardrails, Evaluation & Safety

Sources: OpenAI "Practical Guide to Building Agents" (2025), Google "Agents Companion v2" (Feb 2025)

Guardrails — Layered Defense

No single guardrail provides sufficient protection. Use multiple specialized guardrails together.

Types of Guardrails

Type	Purpose	Implementation
Relevance classifier	Flag off-topic queries	LLM-based classifier checking scope
Safety classifier	Detect jailbreaks/prompt injections	Input validation model
PII filter	Prevent PII exposure	Regex + NER on outputs
Moderation	Flag hate/harassment/violence	Content classifier on inputs
Tool safeguards	Risk-rate each tool	Low/Medium/High based on reversibility, permissions, financial impact
Rules-based	Block known threats	Blocklists, input length limits, regex filters
Output validation	Ensure brand alignment	Prompt engineering + content checks

Tool Risk Assessment

Rate every tool the agent can access:

Low risk: Read-only, no side effects (fetch data, search)
Medium risk: Write access but reversible (update record, create draft)
High risk: Irreversible or financial impact (send email, process payment, delete data)

High-risk actions should require:

Human approval
Additional confirmation steps
Audit logging
Rate limiting

Building Guardrails — Heuristic

Start with data privacy and content safety
Add guardrails based on real-world edge cases and failures
Optimize for both security AND user experience
Iterate as your agent evolves

Evaluation Framework

Three Levels of Evaluation

Level 1 — Agent Capabilities

Can it understand instructions correctly?
Can it reason logically?
Can it use tools appropriately?
Benchmark with public leaderboards (BFCL, τ-bench, PlanBench, AgentBench)

Level 2 — Trajectory (Steps Taken)

Did it take the right steps?
Were the tool calls correct and efficient?
6 metrics: Exact match, In-order match, Any-order match, Precision, Recall, Single-tool use

Level 3 — Final Response

Does the output achieve the goal?
Is it accurate, relevant, correct?
Use an autorater (LLM-as-judge) with defined criteria

Human-in-the-Loop Evaluation

Essential because humans can evaluate:

Subjectivity: Creativity, common sense, nuance
Context: Broader implications of agent actions
Iterative improvement: Rich feedback for refinement
Evaluator calibration: Tune autoraters against human judgment

Methods: Direct assessment, comparative evaluation, user studies

Evaluation Method Comparison

Method	Strengths	Weaknesses
Human	Captures nuance, considers human factors	Subjective, expensive, doesn't scale
LLM-as-Judge	Scalable, efficient, consistent	May miss intermediate steps, limited by LLM
Automated Metrics	Objective, scalable, efficient	May not capture full capabilities, gameable

Success Metrics

Metric Hierarchy

Business KPIs (north star) — revenue, engagement, conversion
Goal completion rate — agent achieves its objective
Critical task success — key milestones completed
Application telemetry — latency, errors, throughput
Human feedback — 👍👎, surveys, in-context feedback
Trace/observability — full audit trail of agent decisions

Multi-Agent Metrics (Additional)

Cooperation & Coordination: How well do agents work together?
Planning & Task Assignment: Right plan? Did we stick to it?
Agent Utilization: How effectively do agents select the right peer?
Scalability: Does quality improve as more agents are added?

Human Intervention Design

When to Escalate to Human

Exceeding failure thresholds: Set limits on retries/actions; escalate when exceeded
High-risk actions: Sensitive, irreversible, or high-stakes operations

Intervention Principles

Critical early in deployment — helps identify failures and edge cases
Builds evaluation feedback cycle
Confidence grows → reduce frequency over time
Always maintain the option for user to regain control

Enterprise Security (OpenAI)

Data ownership: Enterprise retains full ownership
Encryption: In transit and at rest (SOC 2 Type 2, CSA STAR)
Access controls: Granular, role-based (who sees/manages data)
Flexible retention: Adjust logging/storage to organizational policies
Model isolation: Training data isolation from customer data

Practical Guardrail Checklist for Aurion Studio

Input Guardrails

Zod validation on all API inputs ✅ (already done)
Input length limits on user prompts
Rate limiting per user/IP ✅ (already done, but in-memory)
CSRF origin check ✅ (already done)
Sanitize HTML inputs ✅ (already done)

Output Guardrails

Sanitize generated code before preview ✅ (sanitizeForPreview)
CSP headers for preview iframe
Output length limits on AI responses
Content moderation on generated output

Tool Guardrails

Risk-rate all 46 API routes (read-only vs write vs deploy)
Human confirmation for deploy actions
Audit logging for sensitive operations
API key rotation strategy

Evaluation

Automated tests for API routes ✅ (201 tests)
E2E tests with Playwright (TODO)
Goal completion tracking for generated apps
User feedback mechanism (thumbs up/down)
Error tracing and observability

GitHub で開く Raw を見るマーケットで出品する →

ナビゲーション

Skillsとは？

リンク

Enterprise Guardrails, Evaluation & Safety

Enterprise Guardrails, Evaluation & Safety

Guardrails — Layered Defense

Types of Guardrails

Tool Risk Assessment

Building Guardrails — Heuristic

Evaluation Framework

Three Levels of Evaluation

Human-in-the-Loop Evaluation

Evaluation Method Comparison

Success Metrics

Metric Hierarchy

Multi-Agent Metrics (Additional)

Human Intervention Design

When to Escalate to Human

Intervention Principles

Enterprise Security (OpenAI)

Practical Guardrail Checklist for Aurion Studio

Input Guardrails

Output Guardrails

Tool Guardrails

Evaluation

関連スキル(🔧 開発ツール)