LiveBench Overall

LiveBench Overall tracks broad model performance on recently refreshed questions with verifiable answers. It is used as a lower-contamination snapshot across major capability categories.

aggregatelivebench

Strengths

Monthly refresh with verifiable answers reduces contamination relative to static sets; aggregate blends math, coding, reasoning, language, and data analysis for a balanced top-line view.

Caveats

Monthly test-set changes make cross-month deltas imperfectly comparable; aggregate score can hide category-specific weaknesses that matter for deployment fit.

How to interpret scores

Contamination risk: lowFreshness: periodic

Higher is better. Global average across all categories. Top models score 60-80%.

Max score: 100

Relevant use cases

Overall model evaluationContamination-free assessmentCapability trackingMulti-Domain ReasoningDiverse Capability Assessment

Caveat to keep in mind

Monthly test-set changes make cross-month deltas imperfectly comparable.

Leaderboard

Last synced: Jul 10, 2026, 8:59 PM

#	Model	Provider	Score	Price/M
1	Gemma 3 270M	Google	98.5	$0.00
2	Qwen3.5 0.8B (Reasoning)	Alibaba	97.3	$0.00
3	Llama 2 Chat 7B	Meta	96.2	$0.10
4	AlfredPros: CodeLLaMa 7B Instruct Solidity	Alfredpros	86.8	$0.90
5	Mistral: Ministral 3 3B 2512	Mistral	86.8	$0.10
6	Gemma 3 1B Instruct	Google	84.6	$0.00
7	Mistral: Ministral 3 8B 2512	Mistral	84.4	$0.15
8	Granite 4.0 350M	IBM	83.1	$0.00
9	LFM2.5-1.2B-Instruct	Liquid AI	82.6	$0.00
10	Meta: Llama 3.2 1B Instruct	Meta	82.3	$0.07
11	Qwen3.5 0.8B (Non-reasoning)	Alibaba	82.0	$0.00
12	Mistral: Ministral 3 14B 2512	Mistral	81.7	$0.20
13	Ministral 3 3B	Mistral	81.4	$0.10
14	Llama 2 Chat 70B	Meta	80.8	$0.00
15	Gemini 3 Deep Think	Google	80.7	$0.00
16	GPT-5.4 (Non-reasoning)	OpenAI	80.3	$5.63
17	GPT-5.4 (xhigh)	OpenAI	80.3	$5.63
18	Gemini 3.1 Pro Preview	Google	79.9	$4.50
19	Google: Gemini 3.1 Pro Preview Custom Tools	Google	79.5	$4.50
20	LFM2 1.2B	Liquid AI	79.4	$0.00
21	Cohere: Command R+ (08-2024)	Cohere	78.3	$4.38
22	Cohere: Command R (08-2024)	Cohere	78.3	$0.26
23	Claude Opus 4.6 (Non-reasoning, High Effort)	Anthropic	78.2	$10.00
24	TNG: DeepSeek R1T2 Chimera (free)	Tngtech	78.1	$0.00
25	Cohere: Command R7B (12-2024)	Cohere	77.9	$0.07
26	LFM2.5-VL-1.6B	Liquid AI	77.7	$0.00
27	Mistral 7B Instruct	Mistral	77.0	$0.25
28	Qwen3.5 9B (Reasoning)	Alibaba	77.0	$0.11
29	Llama 3.2 Instruct 1B	Meta	76.8	$0.05
30	Llama 65B	Meta	76.6	$0.00
31	Mistral: Mistral 7B Instruct v0.1	Mistral	76.3	$0.13
32	GLM-5.2 (max)	Z AI	76.2	$2.15
33	Claude Opus 4.5 (Reasoning)	Anthropic	76.0	$10.00
34	Claude Opus 4.5 (Non-reasoning)	Anthropic	76.0	$10.00
35	KAT-Coder-Pro V1	KwaiKAT	76.0	$0.53
36	GPT-5.4 Pro (xhigh)	OpenAI	75.6	$67.50
37	Qwen3 Max Thinking	Alibaba	75.3	$1.56
38	DeepSeek R1 Distill Qwen 1.5B	DeepSeek	75.2	$0.00
39	Granite 4.0 H 350M	IBM	75.1	$0.00
40	Gemini 3.5 Flash (high)	Google	75.0	$3.38
41	Claude Opus 4.6 (Adaptive Reasoning, Max Effort)	Anthropic	75.0	$10.00
42	Aurora Alpha	Openrouter	74.7	$0.00
43	Claude Instant	Anthropic	74.6	$0.00
44	DeepSeek R1 Distill Llama 8B	DeepSeek	74.4	$0.00
45	GPT-5.2 Codex (xhigh)	OpenAI	74.3	$4.81
46	xAI: Grok 4.20 Multi-Agent Beta	SpaceXAI	74.2	$3.00
47	Nanbeige4.1-3B	Nanbeige	74.2	$0.00
48	Apertus 8B Instruct	Swiss AI Initiative	74.0	$0.13
49	Apertus 70B Instruct	Swiss AI Initiative	73.9	$1.34
50	DeepSeek V4 Pro (Reasoning, Max Effort)	DeepSeek	73.6	$0.54