Average score across HuggingFace's Open LLM Leaderboard benchmarks (IFEval, BBH, MATH Level 5, GPQA, MUSR, MMLU-PRO) — the community standard for open-source model evaluation.
Standardized harness across thousands of open models enables direct apples-to-apples comparisons; broad benchmark bundle reduces reliance on any single task family.
Coverage is primarily open-weight models, so cross-ecosystem comparisons can be incomplete; leaderboard conditions may differ from deployment settings such as tool use or safety policy.
Higher is better. Percentage score averaged across 6 benchmarks. Top open-source models score 50-75%.
Max score: 100
Based on ~5,000 samples, ~500 input tokens and ~100 output tokens per sample (600 total per sample).
Estimates from Open LLM Leaderboard. Actual costs vary with reasoning effort and multi-turn prompting.
Coverage is primarily open-weight models, so cross-ecosystem comparisons can be incomplete.