Data Sources
Where the numbers come from and how I organize them.
Collection
I collect benchmark results manually from primary sources. No scrapers, no APIs. Just me reading announcements and updating the data. This means:
- Model providers: Official blogs, papers, and posts from OpenAI, Anthropic, Google, xAI, etc.
- Benchmark platforms: LiveBench, Arena (Chat and Code), Aider, SWE-Bench. Data comes direct from their official results pages.
Each model shows a "last updated" timestamp so you know how fresh the data is. New models are added as soon as I see them announced.
Benchmarks
I focus on benchmarks that measure capabilities people actually care about. The key ones:
- Coding: SWE-Bench (real GitHub issue resolution), Aider Polyglot (practical code editing), Arena Code, LiveBench Coding, Terminal-Bench
- Reasoning: Humanity's Last Exam (very hard reasoning), GPQA Diamond (graduate-level science), LiveBench Reasoning
- Math: AIME competitions, MathArena leaderboards, LiveBench Math
- General: Arena Chat (human preferences), LiveBench Average, MMLU
- Multimodal: MMMU (multimodal understanding), LiveBench multimodal categories
- Agentic: BrowseComp (web browsing agents), SWE-Bench Pro, agentic coding benchmarks
Plus instruction following evaluations, hallucination detection, and more. See the leaderboard for the full list. I skip older benchmarks that have been superseded (like HumanEval) and avoid benchmarks that are easily gamed or contaminated.
RankedAGI Scores
I also calculate composite scores that blend related benchmarks. The RankedAGI Coding score combines leading coding benchmarks results normalized and with different weights. These are experimental and I'm still tuning the formulas to reflect real world performance
Corrections
I make mistakes. If you spot incorrect data, missing models, or outdated scores, let me know:
@tavlean