name: llm-rankings description: Comprehensive LLM model evaluation and ranking system. Use when users ask to compare language models, find the best model for a specific task, understand model capabilities, get pricing information, or need help selecting between GPT-4, Claude, Gemini, Llama, or other LLMs. Provides benchmark-based rankings, cost analysis, and use-case-specific recommendations across reasoning, code generation, long context, multimodal, and other capabilities.

LLM Rankings Skill

Comprehensive evaluation and ranking system for comparing language models across performance, cost, and technical dimensions.

Core Capabilities

This skill provides four main ranking methodologies:

Benchmark-Based Rankings - Objective comparisons using MMLU, GSM8K, HumanEval scores
Task-Specific Rankings - Weighted recommendations for code generation, creative writing, reasoning, etc.
Cost-Effectiveness Rankings - Performance per dollar analysis
Real-World Performance - API reliability, documentation quality, ease of integration

Standard Workflows

Simple Comparison Request

When user asks "Which LLM is better for X?":

Load relevant benchmark data from references/benchmarks.md
Filter models matching requirements
Calculate rankings with appropriate weighting
Present top 3-5 recommendations with justification
Include pricing information from references/pricing.md

Detailed Analysis Request

When user asks for comprehensive comparison:

Load model specifications from references/model-details.md
Generate side-by-side comparison table
Include benchmark scores across multiple tests
Calculate cost projections for expected usage
Provide deployment considerations

Best Model for Task Query

When user describes a specific use case:

Parse task requirements (performance needs, budget, technical constraints)
Map to capability dimensions
Load task-specific rankings from references/use-cases.md
Return top 3 models with detailed reasoning
Include caveats and alternative suggestions

Reference Resources

Load these files as needed to inform recommendations:

benchmarks.md - Comprehensive benchmark scores (MMLU, GSM8K, HumanEval, MMMU, etc.)
model-details.md - Technical specifications, context windows, API details, capabilities
use-cases.md - Task-specific recommendations organised by common use cases
pricing.md - Current pricing across all providers, cost optimisation strategies

Output Formats

Quick Recommendation

Present concise recommendations with model name, key strength, pricing snapshot, and one-sentence justification.

Comparison Table

Use markdown tables comparing models across relevant dimensions (performance, context window, pricing, best use).

Detailed Analysis

Structure as:

Executive summary (2-3 sentences)
Top recommendations (ranked with justification)
Performance comparison (benchmark scores)
Cost analysis (usage projections)
Implementation considerations
Alternative options

Key Principles

Evidence-Based - Support all rankings with benchmark data or documented performance
Context-Aware - Consider user's specific requirements, budget, technical environment
Transparent - Explain weighting decisions and ranking criteria clearly
Current Information - Use web_search to verify latest releases, pricing changes, benchmark updates
Practical Focus - Prioritise real-world usage factors over pure benchmark scores
Balanced - Present strengths and weaknesses honestly for each model

Important Considerations

Benchmark Limitations - Benchmarks don't perfectly reflect real-world performance
Task Specificity - A model's ranking varies significantly by use case
Pricing Volatility - API pricing changes frequently; verify for important decisions
Access Availability - Some models have waitlists or geographic restrictions
Trade-offs - Larger context windows often mean slower processing

Usage Notes

Always verify current pricing and availability via web search for recent changes
Consider user's deployment environment (API vs self-hosted)
Account for additional costs (vision inputs, fine-tuning, enterprise features)
Recommend testing on user's specific use case before committing
Highlight when free tiers or trials are available

Model Coverage

Provides comprehensive coverage of:

Anthropic: Claude Opus 4.1/4, Sonnet 4.5/4, Haiku 4
OpenAI: GPT-4 Turbo, GPT-4o, GPT-4o-mini, o1-preview, o1-mini
Google: Gemini 1.5 Pro, Gemini 1.5 Flash
Meta: Llama 3.1 (405B, 70B, 8B)
Mistral: Large 2, Small
DeepSeek: Coder V2
Other providers as relevant to user queries

ナビゲーション

Skillsとは？

リンク

llm-rankings