name: llm-rankings description: Comprehensive LLM model evaluation and ranking system. Use when users ask to compare language models, find the best model for a specific task, understand model capabilities, get pricing information, or need help selecting between GPT-4, Claude, Gemini, Llama, or other LLMs. Provides benchmark-based rankings, cost analysis, and use-case-specific recommendations across reasoning, code generation, long context, multimodal, and other capabilities.
LLM Rankings Skill
Comprehensive evaluation and ranking system for comparing language models across performance, cost, and technical dimensions.
Core Capabilities
This skill provides four main ranking methodologies:
- Benchmark-Based Rankings - Objective comparisons using MMLU, GSM8K, HumanEval scores
- Task-Specific Rankings - Weighted recommendations for code generation, creative writing, reasoning, etc.
- Cost-Effectiveness Rankings - Performance per dollar analysis
- Real-World Performance - API reliability, documentation quality, ease of integration
Standard Workflows
Simple Comparison Request
When user asks "Which LLM is better for X?":
- Load relevant benchmark data from
references/benchmarks.md - Filter models matching requirements
- Calculate rankings with appropriate weighting
- Present top 3-5 recommendations with justification
- Include pricing information from
references/pricing.md
Detailed Analysis Request
When user asks for comprehensive comparison:
- Load model specifications from
references/model-details.md - Generate side-by-side comparison table
- Include benchmark scores across multiple tests
- Calculate cost projections for expected usage
- Provide deployment considerations
Best Model for Task Query
When user describes a specific use case:
- Parse task requirements (performance needs, budget, technical constraints)
- Map to capability dimensions
- Load task-specific rankings from
references/use-cases.md - Return top 3 models with detailed reasoning
- Include caveats and alternative suggestions
Reference Resources
Load these files as needed to inform recommendations:
- benchmarks.md - Comprehensive benchmark scores (MMLU, GSM8K, HumanEval, MMMU, etc.)
- model-details.md - Technical specifications, context windows, API details, capabilities
- use-cases.md - Task-specific recommendations organised by common use cases
- pricing.md - Current pricing across all providers, cost optimisation strategies
Output Formats
Quick Recommendation
Present concise recommendations with model name, key strength, pricing snapshot, and one-sentence justification.
Comparison Table
Use markdown tables comparing models across relevant dimensions (performance, context window, pricing, best use).
Detailed Analysis
Structure as:
- Executive summary (2-3 sentences)
- Top recommendations (ranked with justification)
- Performance comparison (benchmark scores)
- Cost analysis (usage projections)
- Implementation considerations
- Alternative options
Key Principles
- Evidence-Based - Support all rankings with benchmark data or documented performance
- Context-Aware - Consider user's specific requirements, budget, technical environment
- Transparent - Explain weighting decisions and ranking criteria clearly
- Current Information - Use web_search to verify latest releases, pricing changes, benchmark updates
- Practical Focus - Prioritise real-world usage factors over pure benchmark scores
- Balanced - Present strengths and weaknesses honestly for each model
Important Considerations
- Benchmark Limitations - Benchmarks don't perfectly reflect real-world performance
- Task Specificity - A model's ranking varies significantly by use case
- Pricing Volatility - API pricing changes frequently; verify for important decisions
- Access Availability - Some models have waitlists or geographic restrictions
- Trade-offs - Larger context windows often mean slower processing
Usage Notes
- Always verify current pricing and availability via web search for recent changes
- Consider user's deployment environment (API vs self-hosted)
- Account for additional costs (vision inputs, fine-tuning, enterprise features)
- Recommend testing on user's specific use case before committing
- Highlight when free tiers or trials are available
Model Coverage
Provides comprehensive coverage of:
- Anthropic: Claude Opus 4.1/4, Sonnet 4.5/4, Haiku 4
- OpenAI: GPT-4 Turbo, GPT-4o, GPT-4o-mini, o1-preview, o1-mini
- Google: Gemini 1.5 Pro, Gemini 1.5 Flash
- Meta: Llama 3.1 (405B, 70B, 8B)
- Mistral: Large 2, Small
- DeepSeek: Coder V2
- Other providers as relevant to user queries