WeirdML

Tests LLMs on novel, unusual machine learning tasks — models must understand data properties and generate working PyTorch solutions, iterating over 5 debugging rounds within computational constraints.

codingweirdml

Strengths

Measures genuine ML problem-solving ability with novel tasks that can't be memorized. Tests the full pipeline: understanding, coding, debugging, and optimizing under constraints.

Caveats

Heavily coding-focused (PyTorch specifically). 11 of 17 tasks are hidden, so scores can't be fully analyzed. Cost varies widely across models.

How to interpret scores

Contamination risk: moderateFreshness: periodic

Higher is better. Average accuracy across 17 tasks, displayed as percentage. Top models score 70-80%.

Max score: 100

Relevant use cases

ML engineeringCode generationScientific computingData analysis

Caveat to keep in mind

Heavily coding-focused (PyTorch specifically).

Leaderboard

Last synced: Jun 1, 2026, 9:46 AM

#	Model	Provider	Score	Price/M
1	GPT-5.5 (xhigh)	OpenAI	83.9	$11.25
2	Claude Opus 4.6 (Adaptive Reasoning, Max Effort)	Anthropic	78.0	$10.94
3	GPT-5.3 Codex (xhigh)	OpenAI	77.9	$4.81
4	GPT-5.2 Codex (xhigh)	OpenAI	77.8	$4.81
5	GPT-5.4 (Non-reasoning)	OpenAI	77.7	$5.63
6	Claude Opus 4.7	Anthropic	75.5	$11.67
7	Qwen3 Max Thinking	Alibaba	74.1	$2.40
8	Gemini 3 Flash Preview (Reasoning)	Google	73.5	$1.13
9	KAT-Coder-Pro V1	KwaiKAT	72.5	$0.53
10	GPT-5.2 (xhigh)	OpenAI	72.2	$4.81
11	Google: Gemini 3.1 Pro Preview Custom Tools	Google	72.1	$4.50
12	Gemini 3.1 Pro Preview	Google	72.1	$4.50
13	GLM-5-Turbo	Z AI	71.9	$0.00
14	Qwen3.5 122B A10B (Reasoning)	Alibaba	71.4	$1.10
15	Gemini 3 Pro Preview (high)	Google	70.6	$4.50
16	AlfredPros: CodeLLaMa 7B Instruct Solidity	Alfredpros	70.1	$0.90
17	Google: Gemini 3 Pro Preview	Google	69.9	$4.50
18	Gemini 3 Pro Preview (low)	Google	69.9	$4.50
19	GPT-5.1 Codex (high)	OpenAI	69.6	$3.44
20	OpenAI: GPT-5.3-Codex	OpenAI	69.0	$4.81
21	MiMo-V2-Flash (Reasoning)	Xiaomi	69.0	$0.15
22	Grok 4.20 0309 (Reasoning)	xAI	68.7	$3.00
23	MiMo-V2-Pro	Xiaomi	68.2	$1.50
24	Llama 65B	Meta	68.2	$0.00
25	Sao10K: Llama 3.1 70B Hanami x1	Sao10k	68.1	$3.00
26	Qwen3.5 35B A3B (Reasoning)	Alibaba	67.4	$0.69
27	Google: Gemma 2 9B	Google	66.8	$0.04
28	Gemma 4 31B (Reasoning)	Google	66.2	$0.00
29	Claude Sonnet 4.6 (Non-reasoning, High Effort)	Anthropic	66.1	$6.56
30	Anthropic: Claude Sonnet 4	Anthropic	66.1	$6.00
31	Claude Opus 4.6 (Non-reasoning, High Effort)	Anthropic	65.9	$10.94
32	Anthropic: Claude Opus 4	Anthropic	65.9	$30.00
33	Qwen3.5 397B A17B (Reasoning)	Alibaba	65.6	$1.35
34	KAT Coder Pro V2	KwaiKAT	65.5	$0.53
35	MiMo-V2-Flash (Feb 2026)	Xiaomi	65.3	$0.15
36	Gemma 4 26B A4B (Reasoning)	Google	64.8	$0.20
37	Gemini 3 Flash Preview (Non-reasoning)	Google	64.8	$1.13
38	Google: Gemma 2 27B	Google	63.9	$0.65
39	Doubao Seed 2.0 lite (Reasoning)	ByteDance Seed	63.9	$0.00
40	Claude Opus 4.5 (Reasoning)	Anthropic	63.7	$10.94
41	Claude Opus 4.5 (Non-reasoning)	Anthropic	63.7	$10.94
42	Step 3.5 Flash	StepFun	63.5	$0.15
43	MiMo-V2-Omni	Xiaomi	63.1	$0.00
44	Mistral: Ministral 3 3B 2512	Mistral	63.1	$0.10
45	Gemini 3.5 Flash (high)	Google	62.6	$3.38
46	Qwen Chat 14B	Alibaba	62.5	$0.00
47	Claude 4.5 Sonnet (Reasoning)	Anthropic	62.4	$6.56
48	MiniMax-M2.5	MiniMax	62.0	$0.52
49	Google: Gemini 3 Flash Preview	Google	61.6	$1.13
50	GLM-4.7 (Reasoning)	Z AI	61.5	$1.00