name: llm-cost-optimization description: Reduce LLM API costs without sacrificing quality. Covers prompt caching (Anthropic), local response caching, prompt compression, debouncing triggers, and cost analysis. Use when building LLM-powered features, analyzing API costs, optimizing prompts, or implementing caching strategies.
LLM Cost Optimization
Practical techniques to reduce LLM API costs by 35-65%.
Quick Reference
| Technique | Savings | When to Use | Reference |
|---|---|---|---|
| Prompt Caching | 25-45% | Same system prompt, frequent calls | caching.md |
| Response Cache | 100% | Repeated identical requests | caching.md |
| Prompt Compression | 10-20% | Long system prompts | prompts.md |
| Debouncing | 50%+ | Duplicate triggers | triggers.md |
The 80/20 of LLM Costs
For short user inputs, system prompts dominate costs:
| Text Length | Input Tokens | System Prompt % |
|---|---|---|
| Short (~100 chars) | ~250 | 80-87% |
| Medium (~500 chars) | ~450 | 44% |
| Long (~2000 chars) | ~900 | 22% |
Optimization priority:
- Cache system prompts (biggest impact)
- Cache identical requests (free repeats)
- Debounce triggers (prevent waste)
- Compress prompts (last resort)
Cost Estimation (Claude Haiku 3.5)
| Text Length | Est. Cost |
|---|---|
| Short (~100 chars) | ~$0.0004 |
| Medium (~500 chars) | ~$0.0008 |
| Long (~2000 chars) | ~$0.002 |
Benchmark: 1000 translations ≈ $0.80 (before optimization)
Implementation Checklist
Before Building
- Add logging to every AI trigger point
- Verify triggers fire exactly once per user action
- Check for Pressed/Released event duplicates
Caching Strategy
- Enable Anthropic Prompt Caching for system prompts
- Implement local response cache (hash-based)
- Include model name in cache key
- Set reasonable cache limits (e.g., 500 entries LRU)
Prompt Design
- Measure current token count
- Identify critical rules (security, output format)
- Test quality after compression
- Document WHY for each rule kept
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Trigger fires twice | 2x cost | Check event.state |
| No prompt caching | Full price every call | Use cache_control |
| Aggressive prompt compression | Quality drops | Keep critical rules |
| Cache key missing model | Wrong results | Include model in key |
Quick Wins
1. Check for Duplicate Triggers
// Before ANY optimization, verify this
log::info!("AI trigger fired: {:?}", event);
if event.state != ShortcutState::Pressed {
return; // Ignore Released events
}
2. Enable Prompt Caching (Anthropic)
let system = vec![SystemBlock {
block_type: "text".to_string(),
text: system_prompt,
cache_control: CacheControl { cache_type: "ephemeral".to_string() },
}];
3. Add Response Cache
// Check cache before API call
if let Some(cached) = get_cached(&text, &model) {
return Ok(cached); // Free!
}
// Save after API call
save_to_cache(&text, &result, &model)?;
Anti-Patterns
- TOON format for plain text - Only helps with structured data
- Caching without model key - Haiku vs Sonnet give different results
- Prompt compression first - Optimize triggers and caching before touching prompts