name: agent-observability description: strategies for agent observability (logging, tracing, metrics). Use this to instrument agents for debugging, performance tracking, and quality assurance.
Agent Observability Strategies
Goal
Move beyond simple monitoring ("Is it running?") to deep observability ("How is it thinking?"), enabling the diagnosis of complex failures in non-deterministic systems.
The Three Pillars of Observability
1. Structured Logging (The Diary)
- Definition: Immutable, timestamped records of discrete events.
- Best Practice: Use structured JSON logs to capture the full context: prompt/response pairs, intermediate reasoning (Chain of Thought), and tool inputs/outputs.
- Pattern: Record the intent before an action and the outcome after to distinguish between decision failures and execution failures.
2. Distributed Tracing (The Narrative)
- Definition: A visual "yarn" connecting individual log entries (spans) into a single end-to-end task execution.
- Usage: Essential for root cause analysis. It reveals if a bad final answer was caused by a retrieval failure (RAG), a tool error, or an LLM hallucination.
- Standard: Use OpenTelemetry to link spans across services.
3. Metrics (The Scorecard)
Aggregated data points for tracking health over time. Separate these into two dashboards:
System Metrics (Operational Health)
- Audience: SREs / DevOps.
- Key Metrics: P99 Latency, Error Rate (traces with
error=true), Token Consumption, and API Cost per Run.
Quality Metrics (Decision Health)
- Audience: Product / Data Science.
- Key Metrics:
- Trajectory Adherence: Did the agent follow the ideal path?
- Hallucination Rate: Frequency of ungrounded statements.
- Task Completion Rate: Percentage of traces reaching a "success" state.
Operational Best Practices
- Dynamic Sampling: To save costs, log 100% of errors but only sample 10% of successful traces in production.
- PII Redaction: Integrate PII scrubbing directly into the logging pipeline to sanitize user inputs before storage.