name: 'datadog-analysis' description: 'Analyze Datadog error logs for Packmind production services (api-proprietary, mcp-proprietary, frontend-proprietary), group them into patterns, root-cause against the codebase, and produce a structured bug report. Triggers on Datadog, production logs, prod errors, service health, or periodic error reviews.'

Datadog Analysis

Analyze production error logs from Packmind Datadog services, group them into patterns, cross-reference stack traces with the codebase, and produce a structured markdown report with root causes and Datadog search patterns.

Prerequisites

The Datadog MCP server must be connected. If not connected, prompt the user to run /mcp first.
Read references/datadog_mcp.md before making any MCP tool calls for guidance on tool usage, gotchas, and known pitfalls.

Services

The analysis covers three production services. Each maps to a Datadog service name, a codebase location, and a Dockerfile:

Datadog service	App	Codebase	Dockerfile	Runtime
`api-proprietary`	API	`apps/api/` + all `packages/`	`dockerfile/Dockerfile.api`	Node.js (NestJS, TypeORM, Redis/ioredis, BullMQ)
`mcp-proprietary`	MCP Server	`apps/mcp-server/` + all `packages/`	`dockerfile/Dockerfile.mcp`	Node.js (tree-sitter, SSE)
`frontend-proprietary`	Frontend	`apps/frontend/`	`dockerfile/Dockerfile.frontend`	Nginx (static SPA serving)

Root cause analysis should trace errors back to source files in the monorepo. For Nginx (frontend), also check the Nginx configs in dockerfile/nginx.*.conf and the entrypoint dockerfile/nginx-entrypoint.sh.

Parameters

Parameter	Default	Description
Days to analyze	7	Number of past days to look at. Override by user request (e.g., "last 3 days")

Exclusions

The following log patterns should be discarded and not included in the report. Skip them during pattern discovery and do not count them as errors:

(node:1) [DEP0060] DeprecationWarning: The util._extend API is deprecated. Please use Object.assign() instead. -- Known Node.js deprecation from a transitive dependency. Noise, not actionable. Filter with -DEP0060.
Nginx stale asset 404s (open() "/usr/share/nginx/html/assets/..." failed (2: No such file or directory)) -- Expected SPA behavior after deployments. Browsers with a cached index.html request old hashed JS chunks that no longer exist. Not a bug. Filter with -"No such file or directory" -"/assets/" on frontend-proprietary.

When filtering in Phase 1, exclude these patterns from the analysis by appending the exclusion terms to Datadog queries, or remove them during report consolidation.

Workflow

Phase 1: Discover Error Patterns (all services in parallel)

For each of the three services, launch two parallel MCP calls (6 calls total, all in parallel). If rate-limited by the MCP server, fall back to batching 2 calls per service sequentially.

Every Datadog MCP call requires a telemetry object with an intent string describing the call's purpose (e.g., {"intent": "Discover error patterns for api-proprietary over last 7 days"}). Keep intents concise and avoid including PII or secrets.

Pattern discovery -- Use mcp__datadog-mcp__search_datadog_logs with:
- query: service:{service_name} status:(error OR critical OR emergency)
- from: now-{N}d (where N = number of days, default 7)
- use_log_patterns: true
- max_tokens: 10000
Error message counts -- Use mcp__datadog-mcp__analyze_datadog_logs with:
- filter: service:{service_name} status:(error OR critical OR emergency)
- sql_query: SELECT message, count(*) as cnt FROM logs GROUP BY message ORDER BY cnt DESC LIMIT 50
- from: now-{N}d
- max_tokens: 10000

From these results, identify the distinct error groups per service. If a service has zero errors in the period, mention "No issues found" in the report and skip Phases 2-3 for that service.

Phase 2: Deep Dive Each Error Group

For each distinct error group identified in Phase 1:

Fetch raw logs -- Use search_datadog_logs with a targeted query to get full stack traces and context. Use extra_fields: ["*"] for tag metadata when useful.
Get daily distribution -- Use analyze_datadog_logs with:
- sql_query: SELECT DATE_TRUNC('day', timestamp) as day, count(*) as cnt FROM logs WHERE message LIKE '%<pattern>%' GROUP BY DATE_TRUNC('day', timestamp) ORDER BY DATE_TRUNC('day', timestamp)
Count occurrences -- Use analyze_datadog_logs to get total unique occurrences grouped by message.

Parallelize independent MCP calls wherever possible to save time.

Frontend-Specific Notes

For frontend-proprietary, Nginx writes all error_log output (including [notice]) to stderr. Datadog classifies stderr as status:error. Filter out Nginx lifecycle noise:

Ignore patterns containing [notice] (worker start/stop, SIGQUIT, SIGCHLD, SIGIO) -- these are normal Nginx operations misclassified as errors
Focus on [error] (404s for missing files) and [alert] (permission issues, config errors)

Phase 3: Codebase Root Cause Analysis

For each application-level error (not infra/external):

Before grepping, consult references/known_patterns.md — if the error matches a catalogued pattern, jump straight to its entry point and skip to step 2.

Grep for the error class or message in the codebase using the Grep tool (e.g., SpaceMembershipRequiredError, Recipe.*not found)
Read the source files where the error is thrown
Trace the call chain: error class -> service/use case -> controller/adapter
Identify the root cause: missing error handling, wrong HTTP status, race condition, missing validation, Dockerfile misconfiguration, etc.

For frontend Nginx errors, check:

dockerfile/Dockerfile.frontend for permission/ownership issues
dockerfile/nginx.k8s.conf, dockerfile/nginx.k8s.no-ingress.conf, dockerfile/nginx.compose.conf for config issues
dockerfile/nginx-entrypoint.sh for entrypoint issues

Phase 4: Generate Report

Read references/report-template.md before writing the report for the output path, scaffold, severity ordering, occurrence labels, and final summary table.

ナビゲーション

Skillsとは？

リンク

'datadog-analysis'