What each benchmark measures, why it matters, and how to interpret scores.
Transparent composite indices computed by AgMoDB using percentile-rank normalization across core domain benchmarks, with both full and confidence-oriented variants.
Weighted blend of Reasoning (22%), Coding (22%), Math (14%), Agentic (14%), Robustness (18%), and Document Intelligence (10%) domain indices using observed benchmark data only (BenchPress predictions excluded). Uses percentile-rank normalization across the full model population.
Components
Trust-weighted blend of domain scores that emphasizes reasoning and coding signal. Uses prediction-inclusive data. Weights: Reasoning (27%), Coding (27%), Math (14%), Agentic (9%), Robustness (13%), and Document Intelligence (10%).
Components
Prediction-inclusive variant of AgMoBench that includes BenchPress ML-predicted benchmark cells alongside observed data. Uses confidence adjustment, shrinking each domain's weight toward a 25% floor when that domain is prediction-heavy.
Components
Composite reasoning score averaging percentile ranks across knowledge and reasoning benchmarks.
Components
Composite coding score averaging percentile ranks across code generation and software engineering benchmarks.
Components
Composite math score averaging percentile ranks across mathematical reasoning benchmarks.
Components
Composite agentic score averaging percentile ranks across tool use, web browsing, and autonomous task benchmarks.
Components
Composite factual accuracy score averaging percentile ranks across factuality and critical thinking benchmarks — how well models resist nonsense, avoid fabrication, and know what they don't know.
Components
Composite document processing score averaging percentile ranks across OCR, parsing, table understanding, and document retrieval benchmarks — how well models handle enterprise document workflows.
Components
Proprietary composite scores from Artificial Analysis. Methodology is not publicly disclosed.
Artificial Analysis's proprietary composite intelligence score aggregating multiple benchmarks. Methodology is not publicly disclosed.
Artificial Analysis's proprietary composite coding ability score.
Artificial Analysis's proprietary composite mathematical reasoning score.
Third-party composite scores and leaderboard aggregates.
Artificial Analysis's composite Intelligence Index as reported in the LLM Benchmark Matrix. Aggregates multiple evals into a single capability score.
Apex Agents benchmark scores from Epoch AI's benchmark data collection.
Arc Agi 2 benchmark scores from Epoch AI's benchmark data collection.
Chatbot Arena ELO measures relative user preference from blind head-to-head votes in LMArena. It is a live human-judgment signal for conversational quality under real prompts.
Epoch Capabilities Index benchmark scores from Epoch AI's benchmark data collection.
Artificial Analysis's GDP-Val metric — an aggregate measure combining quality and value. Specific methodology not publicly documented.
Gdpval benchmark scores from Epoch AI's benchmark data collection.
Hle benchmark scores from Epoch AI's benchmark data collection.
LiveBench Overall tracks broad model performance on recently refreshed questions with verifiable answers. It is used as a lower-contamination snapshot across major capability categories.
Average score across HuggingFace's Open LLM Leaderboard benchmarks (IFEval, BBH, MATH Level 5, GPQA, MUSR, MMLU-PRO) — the community standard for open-source model evaluation.
Total number of trainable parameters in the model, as catalogued by Epoch AI's frontier model database.
Posttrainbench benchmark scores from Epoch AI's benchmark data collection.
Total floating-point operations (FLOP) used during model training, as estimated by Epoch AI from public disclosures, hardware counts, and training duration.
Estimated total cost to train the model in 2024 US dollars, including compute hardware rental or amortization.