LiveBench Overall tracks broad model performance on recently refreshed questions with verifiable answers. It is used as a lower-contamination snapshot across major capability categories.
Monthly refresh with verifiable answers reduces contamination relative to static sets; aggregate blends math, coding, reasoning, language, and data analysis for a balanced top-line view.
Monthly test-set changes make cross-month deltas imperfectly comparable; aggregate score can hide category-specific weaknesses that matter for deployment fit.
Higher is better. Global average across all categories. Top models score 60-80%.
Max score: 100
Monthly test-set changes make cross-month deltas imperfectly comparable.