name: langchain-prod-checklist description: 'Production readiness checklist for LangChain applications.
Use when preparing for launch, validating deployment readiness,
or auditing existing production LangChain systems.
Trigger: "langchain production", "langchain prod ready",
"deploy langchain", "langchain launch checklist", "go-live langchain".
' allowed-tools: Read, Write, Edit, Bash(node:), Bash(npm:) version: 1.0.0 license: MIT author: Jeremy Longshore jeremy@intentsolutions.io tags:
- saas
- langchain
- deployment
- audit compatibility: Designed for Claude Code, also compatible with Codex and OpenClaw
LangChain Production Checklist
Overview
Comprehensive go-live checklist for deploying LangChain applications to production. Covers configuration, resilience, observability, performance, security, testing, deployment, and cost management.
1. Configuration & Secrets
- All API keys in secrets manager (not
.envin production) - Environment-specific configs (dev/staging/prod) validated with Zod
- Startup validation fails fast on missing config
-
.envfiles in.gitignore
// Startup validation
import { z } from "zod";
const ProdConfig = z.object({
OPENAI_API_KEY: z.string().startsWith("sk-"),
LANGSMITH_API_KEY: z.string().startsWith("lsv2_"),
NODE_ENV: z.literal("production"),
});
try {
ProdConfig.parse(process.env);
} catch (e) {
console.error("Invalid production config:", e);
process.exit(1);
}
2. Error Handling & Resilience
-
maxRetriesconfigured on all models (3-5) -
timeoutset on all models (30-60s) - Fallback models configured with
.withFallbacks() - Error responses return safe messages (no stack traces to users)
const model = new ChatOpenAI({
model: "gpt-4o-mini",
maxRetries: 5,
timeout: 30000,
}).withFallbacks({
fallbacks: [new ChatAnthropic({ model: "claude-sonnet-4-20250514" })],
});
3. Observability
- LangSmith tracing enabled (
LANGSMITH_TRACING=true) -
LANGCHAIN_CALLBACKS_BACKGROUND=true(non-serverless only) - Structured logging on all LLM/tool calls
- Prometheus metrics exported (requests, latency, tokens, errors)
- Alerting rules configured (error rate >5%, P95 latency >5s)
4. Performance
- Caching enabled for repeated queries (Redis or SQLite)
-
maxConcurrencyset on batch operations - Streaming enabled for user-facing responses
- Connection pooling configured
- Prompt length optimized (no unnecessary verbosity)
5. Security
- User input isolated in human messages (never in system prompts)
- Input length limits enforced
- Prompt injection patterns logged/flagged
- Tools restricted to allowlisted operations
- LLM output validated before display (no PII/key leakage)
- Audit logging on all LLM and tool calls
- Rate limiting per user/IP
6. Testing
- Unit tests for all chains (using
FakeListChatModel, no API calls) - Integration tests with real LLMs (gated behind CI secrets)
- RAG pipeline validation (retrieval relevance + no hallucination)
- Tool unit tests (valid input, invalid input, error cases)
- Load testing completed (concurrent users, batch operations)
7. Deployment
- Health check endpoint returns LLM connectivity status
- Graceful shutdown handles in-flight requests
- Rolling deployment (zero downtime)
- Rollback procedure documented and tested
- Container resource limits set (memory, CPU)
// Health check endpoint
app.get("/health", async (_req, res) => {
const checks: Record<string, string> = { server: "ok" };
try {
await model.invoke("ping");
checks.llm = "ok";
} catch (e: any) {
checks.llm = `error: ${e.message.slice(0, 100)}`;
}
const healthy = Object.values(checks).every((v) => v === "ok");
res.status(healthy ? 200 : 503).json({ status: healthy ? "healthy" : "degraded", checks });
});
// Graceful shutdown
process.on("SIGTERM", async () => {
console.log("Shutting down gracefully...");
server.close(() => process.exit(0));
setTimeout(() => process.exit(1), 10000); // force after 10s
});
8. Cost Management
- Token usage tracking callback attached
- Daily/monthly budget limits enforced
- Model tiering: cheap model for simple tasks, powerful for complex
- Cost alerts configured (Slack/email on threshold)
- Cost per user/tenant tracked
Pre-Launch Validation Script
async function validateProduction() {
const results: Record<string, string> = {};
// 1. Config
try {
ProdConfig.parse(process.env);
results["Config"] = "PASS";
} catch { results["Config"] = "FAIL: missing env vars"; }
// 2. LLM connectivity
try {
await model.invoke("ping");
results["LLM"] = "PASS";
} catch (e: any) { results["LLM"] = `FAIL: ${e.message.slice(0, 50)}`; }
// 3. Fallback
try {
const fallbackModel = model.withFallbacks({ fallbacks: [fallback] });
await fallbackModel.invoke("ping");
results["Fallback"] = "PASS";
} catch { results["Fallback"] = "FAIL"; }
// 4. LangSmith
results["LangSmith"] = process.env.LANGSMITH_TRACING === "true" ? "PASS" : "WARN: disabled";
// 5. Health endpoint
try {
const res = await fetch("http://localhost:8000/health");
results["Health"] = res.ok ? "PASS" : "FAIL";
} catch { results["Health"] = "FAIL: not reachable"; }
console.table(results);
const allPass = Object.values(results).every((v) => v === "PASS");
console.log(allPass ? "READY FOR PRODUCTION" : "ISSUES FOUND - FIX BEFORE LAUNCH");
return allPass;
}
Error Handling
| Issue | Cause | Fix |
|---|---|---|
| API key missing at startup | Secrets not mounted | Check deployment config |
| No fallback on outage | .withFallbacks() not configured | Add fallback model |
| LangSmith trace gaps | Background callbacks in serverless | Set LANGCHAIN_CALLBACKS_BACKGROUND=false |
| Cache miss storm | Redis down | Implement graceful degradation |
Resources
Next Steps
After launch, use langchain-observability for monitoring and langchain-incident-runbook for incident response.