AI Evals & Benchmarks

What each benchmark measures, why it matters, and how to interpret scores.

AgMoBench

Transparent composite indices computed by AgMoDB using percentile-rank normalization across core domain benchmarks, with both full and confidence-oriented variants.

AgMoBench Overall

AgMoDB

Weighted blend of Reasoning (22%), Coding (22%), Math (14%), Agentic (14%), Robustness (18%), and Document Intelligence (10%) domain indices using observed benchmark data only (BenchPress predictions excluded). Uses percentile-rank normalization across the full model population.

Components

AgMoBench ReasoningAgMoBench CodingAgMoBench MathAgMoBench AgenticAgMoBench RobustnessAgMoBench Document Intelligence

AgMoBench Trust

AgMoDB

Trust-weighted blend of domain scores that emphasizes reasoning and coding signal. Uses prediction-inclusive data. Weights: Reasoning (27%), Coding (27%), Math (14%), Agentic (9%), Robustness (13%), and Document Intelligence (10%).

Components

AgMoBench ReasoningAgMoBench CodingAgMoBench MathAgMoBench AgenticAgMoBench RobustnessAgMoBench Document Intelligence

AgMoBench Predicted

AgMoDB

Prediction-inclusive variant of AgMoBench that includes BenchPress ML-predicted benchmark cells alongside observed data. Uses confidence adjustment, shrinking each domain's weight toward a 25% floor when that domain is prediction-heavy.

Components

AgMoBench ReasoningAgMoBench CodingAgMoBench MathAgMoBench AgenticAgMoBench RobustnessAgMoBench Document Intelligence

AgMoBench Reasoning

AgMoDB

Composite reasoning score averaging percentile ranks across knowledge and reasoning benchmarks.

Components

MMLU ProGPQA DiamondHLELiveBench OverallIFBenchARC-AGI-2LCRArena ELO

AgMoBench Coding

AgMoDB

Composite coding score averaging percentile ranks across code generation and software engineering benchmarks.

Components

LiveCodeBenchSciCodeTerminal-Bench HardAider PolyglotBigCodeBench CompleteArena ELO: Coding

AgMoBench Math

AgMoDB

Composite math score averaging percentile ranks across mathematical reasoning benchmarks.

Components

AIME 2025MATH-500FrontierMath

AgMoBench Agentic

AgMoDB

Composite agentic score averaging percentile ranks across tool use, web browsing, and autonomous task benchmarks.

Components

SWE-bench VerifiedGAIATauBench AirlineWebArenaMLEBenchBFCLBrowseComp

AgMoBench Robustness

AgMoDB

Composite factual accuracy score averaging percentile ranks across factuality and critical thinking benchmarks — how well models resist nonsense, avoid fabrication, and know what they don't know.

Components

BullshitBenchTruthfulQASimpleQAAA-Omniscience Index

AgMoBench Document Intelligence

AgMoDB

Composite document processing score averaging percentile ranks across OCR, parsing, table understanding, and document retrieval benchmarks — how well models handle enterprise document workflows.

Components

IDP OverallIDP OlmOCRIDP OmniDocIDP CoreOmniDocBenchOCRBench v2MMTUViDoRe v3

Artificial Analysis Indices

Proprietary composite scores from Artificial Analysis. Methodology is not publicly disclosed.

Intelligence Index

Artificial Analysis

Artificial Analysis's proprietary composite intelligence score aggregating multiple benchmarks. Methodology is not publicly disclosed.

Coding Index

Artificial Analysis

Artificial Analysis's proprietary composite coding ability score.

Math Index

Artificial Analysis

Artificial Analysis's proprietary composite mathematical reasoning score.

Other Aggregate Benchmarks

Third-party composite scores and leaderboard aggregates.

AA Intelligence Index (Matrix)

benchmark_matrix

Artificial Analysis's composite Intelligence Index as reported in the LLM Benchmark Matrix. Aggregates multiple evals into a single capability score.

Apex Agents

epoch_ai

Apex Agents benchmark scores from Epoch AI's benchmark data collection.

Arc Agi 2

epoch_ai

Arc Agi 2 benchmark scores from Epoch AI's benchmark data collection.

Chatbot Arena ELO

chatbot_arena

Chatbot Arena ELO measures relative user preference from blind head-to-head votes in LMArena. It is a live human-judgment signal for conversational quality under real prompts.

Epoch Capabilities Index

epoch_ai

Epoch Capabilities Index benchmark scores from Epoch AI's benchmark data collection.

GDP-Val AA

benchmark_matrix

Artificial Analysis's GDP-Val metric — an aggregate measure combining quality and value. Specific methodology not publicly documented.

Gdpval

epoch_ai

Gdpval benchmark scores from Epoch AI's benchmark data collection.

Hle

epoch_ai

Hle benchmark scores from Epoch AI's benchmark data collection.

LiveBench Overall

livebench

LiveBench Overall tracks broad model performance on recently refreshed questions with verifiable answers. It is used as a lower-contamination snapshot across major capability categories.

Open LLM Average

open_llm_leaderboard

Average score across HuggingFace's Open LLM Leaderboard benchmarks (IFEval, BBH, MATH Level 5, GPQA, MUSR, MMLU-PRO) — the community standard for open-source model evaluation.

Parameter Count

epoch_ai

Total number of trainable parameters in the model, as catalogued by Epoch AI's frontier model database.

Posttrainbench

epoch_ai

Posttrainbench benchmark scores from Epoch AI's benchmark data collection.

Training Compute

epoch_ai

Total floating-point operations (FLOP) used during model training, as estimated by Epoch AI from public disclosures, hardware counts, and training duration.

Training Cost (USD)

epoch_ai

Estimated total cost to train the model in 2024 US dollars, including compute hardware rental or amortization.