Open LLM Average

Average score across HuggingFace's Open LLM Leaderboard benchmarks (IFEval, BBH, MATH Level 5, GPQA, MUSR, MMLU-PRO) — the community standard for open-source model evaluation.

aggregateopen_llm_leaderboard

Strengths

Standardized harness across thousands of open models enables direct apples-to-apples comparisons; broad benchmark bundle reduces reliance on any single task family.

Caveats

Coverage is primarily open-weight models, so cross-ecosystem comparisons can be incomplete; leaderboard conditions may differ from deployment settings such as tool use or safety policy.

How to interpret scores

Contamination risk: moderateFreshness: periodic

Higher is better. Percentage score averaged across 6 benchmarks. Top open-source models score 50-75%.

Max score: 100

Eval Cost Estimate

Based on ~5,000 samples, ~500 input tokens and ~100 output tokens per sample (600 total per sample).

Estimates from Open LLM Leaderboard. Actual costs vary with reasoning effort and multi-turn prompting.

Relevant use cases

Open-source model selectionFine-tuning baseline comparisonModel capability tracking

Caveat to keep in mind

Coverage is primarily open-weight models, so cross-ecosystem comparisons can be incomplete.

Leaderboard

Last synced: May 28, 2026, 9:09 AM

Sort by:

#	Model	Provider	Score	Price/M	Est. Eval Cost	Score/$
1	GPT-4o mini	OpenAI	59.8	$0.26	$0.68	88.5
2	GPT-4.1 nano	OpenAI	58.6	$0.18	$0.45	130
3	Mistral Large 2411	Mistral	51.1	$3.00	$8.00	6.39
4	ERNIE 4.5 300B A47B	Baidu	49.8	$0.49	$1.25	39.9
5	Arctic Instruct	Snowflake	48.5	$0.00	$0.0000	—
6	Sarvam M (Reasoning)	Sarvam	48.2	$0.00	$0.0000	—
7	Qwen2.5 Instruct 72B	Alibaba	48.0	$0.37	$1.10	43.6
8	Google: Gemma 3 12B (free)	Google	47.8	$0.00	$0.0000	—
9	NVIDIA Nemotron Nano 9B V2 (Non-reasoning)	NVIDIA	47.0	$0.09	$0.22	211
10	GPT-4o (May '24)	OpenAI	46.8	$7.50	$20.00	2.34
11	NVIDIA Nemotron Nano 9B V2 (Reasoning)	NVIDIA	46.6	$0.07	$0.18	259
12	Mistral Large (Feb '24)	Mistral	46.5	$6.00	$16.00	2.91
13	Qwen2.5 72B Instruct	Alibaba	46.3	$0.19	$0.49	93.6
14	Mistral: Mixtral 8x22B Instruct	Mistral	45.9	$3.00	$8.00	5.73
15	Kimi Linear 48B A3B Instruct	Kimi	45.8	$0.00	$0.0000	—
16	Llama 4 Scout	Meta	45.5	$0.29	$0.76	60.3
17	Meta: Llama 3.1 405B Instruct	Meta	45.4	$4.00	$12.00	3.78
18	OpenAI: GPT-3.5 Turbo 16k	OpenAI	45.1	$3.25	$9.50	4.75
19	Qwen: Qwen3 8B	Alibaba	45.1	$0.14	$0.32	139
20	GPT-4 Turbo	OpenAI	45.1	$15.00	$40.00	1.13
21	Meta: Llama 3.3 70B Instruct (free)	Meta	44.8	$0.00	$0.0000	—
22	Meta: Llama 3.3 70B Instruct	Meta	44.8	$0.15	$0.41	109
23	Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)	Google	44.1	$0.18	$0.45	98.0
24	Arcee AI: Trinity Large Preview (free)	Arcee Ai	44.1	$0.00	$0.0000	—
25	GPT-4o (ChatGPT)	OpenAI	44.1	$0.00	$0.0000	—
26	Claude 3 Opus	Anthropic	43.9	$32.81	$84.38	0.52
27	Qwen3 8B (Reasoning)	Alibaba	43.9	$0.37	$0.85	51.6
28	Qwen2 Instruct 72B	Alibaba	43.6	$0.00	$0.0000	—
29	Meta: Llama 3.1 70B Instruct	Meta	43.4	$0.40	$1.20	36.2
30	Qwen: Qwen3 14B	Alibaba	43.2	$0.09	$0.23	184
31	GPT-4.1 mini	OpenAI	42.7	$0.70	$1.80	23.7
32	Reka Flash 3	Reka AI	42.2	$0.35	$0.90	46.8
33	Qwen3 8B (Non-reasoning)	Alibaba	41.7	$0.18	$0.55	75.8
34	AI21: Jamba Large 1.7	AI21 Labs	41.6	$3.50	$9.00	4.62
35	QwQ 32B	Alibaba	41.6	$0.74	$2.15	19.3
36	NVIDIA Nemotron 3 Super 120B A12B (Reasoning)	NVIDIA	41.5	$0.41	$1.13	36.8
37	Mistral Small 3.1	Mistral	41.5	$0.14	$0.38	109
38	Qwen3 0.6B (Reasoning)	Alibaba	41.3	$0.40	$0.91	45.7
39	Magistral Small 1	Mistral	41.1	$0.00	$0.0000	—
40	Gemini 1.5 Flash (Sep '24)	Google	41.1	$0.00	$0.0000	—
41	gpt-oss-20B (low)	OpenAI	41.0	$0.10	$0.25	164
42	Mistral Large 2 (Jul '24)	Mistral	40.8	$3.00	$8.00	5.10
43	Granite 4.0 H Small	IBM	40.7	$0.11	$0.28	148
44	gpt-oss-20B (high)	OpenAI	40.6	$0.09	$0.23	181
45	Claude 2.0	Anthropic	40.6	$0.00	$0.0000	—
46	Gemma 3 27B Instruct	Google	40.5	$0.14	$0.40	101
47	Gemini 1.5 Flash (May '24)	Google	40.5	$0.00	$0.0000	—
48	Grok 2 (Dec '24)	xAI	40.3	$0.00	$0.0000	—
49	Meta: Llama 3.1 405B (base)	Meta	40.0	$4.00	$12.00	3.33
50	Qwen2.5 Coder 32B Instruct	Alibaba	39.9	$0.05	$0.13	307