name: llm-app-patterns
type: reference
description: "Provides architectural patterns for LLM-powered applications and AI assistants, including prompt engineering, RAG, agent loops, conversation management, and evaluation. Use when building AI-based features, chatbots, or complex AI system architectures."
paths: ["/*.py", "/.ts", "**/openai", "/anthropic*", "/langchain*", "/chatbot*", "/assistant*"]
effort: 3
allowed-tools: Read, Glob, Grep, Write, Edit, Bash
user-invocable: true
when_to_use: "When designing LLM applications, building AI assistants/chatbots, implementing RAG pipelines, or setting up agent architectures."
LLM Application & AI Assistant Patterns
Resources
Architecture decision matrix
| Pattern | Use when | Cost |
|---|
| Simple RAG | FAQ, docs Q&A | Low |
| Hybrid RAG (semantic + BM25) | Mixed query types | Medium |
| Function calling | Structured tool use | Low |
| ReAct agent | Multi-step reasoning | Medium |
| Plan-and-execute | Complex decomposable tasks | High |
| Multi-agent | Research, critique-refine | Very High |
RAG: critical config numbers
CHUNK_CONFIG = {
"chunk_size": 512, # tokens — sweet spot for most docs
"chunk_overlap": 50, # prevents context loss at boundaries
"separators": ["\n\n", "\n", ". ", " "],
}
# Hybrid search alpha: 1.0=semantic only, 0.0=BM25 only, 0.5=balanced
RAG: retrieval strategies
# Basic: semantic search
results = vector_db.similarity_search(embed(query), top_k=5)
# Better: hybrid (semantic + keyword via RRF)
def hybrid_search(query, alpha=0.5):
return rrf_merge(vector_db.search(query), bm25_search(query), alpha)
# Best for recall: multi-query (3 variations, deduplicate)
queries = llm.generate_variations(query, n=3)
results = deduplicate([semantic_search(q) for q in queries])
RAG: generation prompt template
RAG_PROMPT = """Answer based ONLY on the context below.
If insufficient, say "I don't have enough information."
Context: {context}
Question: {question}
Answer:"""
Agent: function calling loop
messages = [{"role": "user", "content": question}]
while True:
response = llm.chat(messages=messages, tools=TOOLS, tool_choice="auto")
if not response.tool_calls:
return response.content
for call in response.tool_calls:
result = execute_tool(call.name, call.arguments)
messages.append({"role": "tool", "tool_call_id": call.id, "content": str(result)})
Production: caching (only temperature=0 responses)
def get_or_generate(prompt, model, **kwargs):
deterministic = kwargs.get("temperature", 1.0) == 0
if deterministic:
key = sha256(f"{model}:{prompt}:{json.dumps(kwargs, sort_keys=True)}")
if cached := redis.get(key): return cached
response = llm.generate(prompt, model=model, **kwargs)
if deterministic: redis.setex(key, 3600, response)
return response
Production: retry + fallback
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(5))
def call_llm(prompt): return llm.generate(prompt)
# Fallback chain
for model in [primary] + fallbacks:
try: return llm.generate(prompt, model=model)
except (RateLimitError, APIError): continue
LLMOps: key metrics
Latency : p50, p99 response time
Quality : satisfaction (thumbs), task completion %, hallucination rate
Cost : cost_per_request, tokens_per_request, cache_hit_rate
Health : error_rate, timeout_rate, retry_rate
Embedding model selection
| Model | Dims | Cost | Use |
|---|
| text-embedding-3-small | 1536 | $0.02/1M | Most cases |
| text-embedding-3-large | 3072 | $0.13/1M | High accuracy |
| bge-large (local) | 1024 | Free | Self-hosted |