name: performant-ai description: Strategies for high-performance AI/LLM systems (Context Management, Prompt Engineering, RAG, Inference Tuning). triggers: [ai, llm, performance, context window, tokens, prompt engineering, rag, inference, latency] tags: [coding, ai, architecture] context_cost: medium
Performant AI Skill
Goal
Optimizing the interaction, speed, and cost-effectiveness of LLM-based systems by mastering context management and inference strategies.
Capabilities
1. Context Window Engineering
- Context Pruning: Implement logic to remove irrelevant or redundant tokens from the prompt to fit within limits and reduce cost.
- Summarization Chains: Use "recursive summarization" for long conversations or documents.
- Observation Masking: Hide older or less critical data to keep the attention of the model on the immediate task.
2. Efficient Prompting (Latency & Cost)
- Few-Shot Optimization: Minimize the number of examples to the bare minimum needed for accuracy.
- Output Structuring: Use JSON mode or structured outputs to reduce parsing errors and retry loops.
- Prompt Compression: Use tools or manual techniques to shorten instructions without losing semantic meaning.
3. RAG Optimization (Retrieval-Augmented Generation)
- Chunking Strategy: Optimize chunk sizes and overlap for the specific domain (e.g., small chunks for semantic search, large for summaries).
- Hybrid Search: Combine Vector search (semantic) with Keyword search (BM25) for higher precision.
- Re-ranking: Use a secondary, smaller model to re-rank the top-K results before sending them to the expensive LLM.
4. Inference & Routing Strategies
- Brain Mode Routing: Arbitrate between "Local" models (faster/cheaper) and "Remote" models (complex/slower) based on task difficulty.
- Speculative Decoding: (Where possible) use smaller models to draft tokens for larger models to verify, speeding up generation.
- Cache Hits: Implement semantic caching (Redis) to reuse LLM responses for similar queries.
5. Architectural Patterns
- Self-Correction Loops: Build reflection phases into the agent flow to catch errors early.
- Asynchronous Agents: Run independent research or tool calls in parallel to reduce perceived latency (Loki Mode).
Steps
- Token Audit: Trace the token count of typical requests to find "bloat" in systemic prompts.
- Latency Mapping: Break down Time-to-First-Token (TTFT) and Total Generation Time.
- Retrieval Benchmark: Measure the Hit Rate and Recall of the RAG pipeline.
- Cost Projection: Estimate monthly burn based on different model providers and context sizes.
Deliverables
COST_OPTIMIZATION_REPORT_TEMPLATE.md: Analysis of prompt efficiency and LLM token usage.ARCHITECTURE_REVIEW_TEMPLATE.md: Configuration for vector DB, chunking, and search weights.SCALABILITY_ANALYSIS_TEMPLATE.md: Logic table for local vs remote model selection and context scaling.
Security & Guardrails
1. Data Privacy
- PII Masking: Ensure no Personally Identifiable Information is sent to remote LLM providers without encryption or redaction.
- Data Leakage: Verify that RAG sources do not inadvertently expose unauthorized documents to the user.
2. Reliability
- Hallucination Checks: Mandatory verification step for critical facts generated by the LLM.
- Fallback Logic: Always have a "conservative" fallback if the primary LLM fails or hits rate limits.
3. Agent Guardrails
- No Infinite Loops: Implement strict limits on agent reflection or self-healing cycles (Max 5 attempts).
- Cost Ceiling: Set token or dollar limits per session to prevent runaway autonomous spending.