Tests LLMs on novel, unusual machine learning tasks — models must understand data properties and generate working PyTorch solutions, iterating over 5 debugging rounds within computational constraints.
Measures genuine ML problem-solving ability with novel tasks that can't be memorized. Tests the full pipeline: understanding, coding, debugging, and optimizing under constraints.
Heavily coding-focused (PyTorch specifically). 11 of 17 tasks are hidden, so scores can't be fully analyzed. Cost varies widely across models.
Higher is better. Average accuracy across 17 tasks, displayed as percentage. Top models score 70-80%.
Max score: 100
Heavily coding-focused (PyTorch specifically).