Sources and Methodology

Where RankedAGI's model data comes from, how benchmarks are selected, and how composite scores should be interpreted.

Data collection

I collect benchmark results manually from primary public sources. That usually means official model-provider blogs, papers, technical reports, release posts, model cards, and benchmark leaderboards. Benchmark platforms include sources such as LiveBench, Arena, Aider, SWE-Bench, Terminal-Bench, MathArena, and similar public result pages.

RankedAGI does not own the underlying benchmark data. The site organizes public results for comparison, research, and discovery.

Benchmark inclusion criteria

Benchmarks are included when they help answer practical model-comparison questions and have publicly checkable results. I prioritize current evaluations for coding, reasoning, math, general preference, multimodal understanding, and agentic tasks.

Older benchmarks may be hidden or de-emphasized when they become saturated, superseded, hard to compare across releases, or less useful for ranking current frontier models.

Benchmark index

The public table columns are generated from benchmark metadata. Similar names can represent different evaluations or releases, so the subtitle and description matter. For example, MMLU and MMLU-Pro are separate benchmarks, not interchangeable aliases.

MMLU-Pro is tracked separately from standard MMLU when usable public scores are available. If a harder or pro benchmark variant is not shown for a model, it means RankedAGI does not currently have a usable public score recorded for that model on that benchmark.

coding

Code RankedAGI

RankedAGI Coding Score

SWEBench Pro

Diverse Agentic Coding

SWEBench Multilingial

SWEBench Verified

Agentic Coding

Source

Terminal Bench 2.1

Agentic Terminal Coding

Source

Terminal Bench 2.0

Agentic Terminal Coding

Source

Code Arena

Code Arena Elo Score

Source

DeepSWE

Measuring frontier coding agents on original, long-horizon engineering tasks

Source

Svelte Bench

SvelteBench - Benchmark for Svelte

Source

Agents

Agentic RankedAGI

RankedAGI Agentic Score

Browse Comp

A benchmark for browsing agents

Source

OSWorld Verified

Vending Bench 2

Benchmark for measuring AI model performance on running a business over long time horizons. Models are tasked with running a simulated vending machine business over a year and scored on their bank account balance at the end.

Source

𝜏²-Bench Telecom

Agentic Tool Use

GDPval AA

Office Tasks (Artificial Analysis)

safety

Cyber Gym

reasoning

Reason RankedAGI

RankedAGI Reasoning Score

HLE

Multidisciplinary Reasoning (no tools)

HLE w/ Tools

Multidisciplinary Reasoning (with tools)

GPQA Diamond

Generalized Prefix Question Answering Score (Reasoning) PhD Level Reasoning

NYT Connections

NYT Connections Extended Version

Source

ARC AGI 2.0

Abstract Reasoning Puzzles (Public)

general

Text Arena

ChatArena (LMSYS) ELO Score

Source

RAGI RankedAGI

Overall RankedAGI score

math

AIME 2026

AIME 2026 Competition Math

Source

imaging

MMMU

Multimodal Understanding College-level visual problem-solving

MMMU Pro

Multimodal Understanding

MMMU Pro w/ Tools

Design

Composite scores and simulated data

The RankedAGI composite benchmarks for Coding, Agentic, Reasoning, Math, and Overall, together with the simulated benchmark data that makes them accurate despite sparse public coverage, are documented in detail on The RankedAGI Engine.

Thinking model variants

For older model families with explicit thinking and non-thinking variants, RankedAGI records separate model entries when public benchmark data distinguishes those modes. If a benchmark source publishes a thinking-mode score, that score is recorded for the relevant thinking variant; if it publishes a non-reasoning or standard-mode score, that is recorded separately when the model naming makes the distinction clear.

RankedAGI records the public benchmark result as reported by the source. Reasoning traces are not scored separately unless the benchmark or source itself publishes a separate trace-level score.

Human preference benchmark uncertainty

Arena-style human preference benchmarks are useful signals, but they are not treated as a dominant penalty or top-priority input in composite scores. Their contribution is weighted by practical relevance and by how well the benchmark appears to translate to real-world model quality.

RankedAGI checks public Arena-style leaderboards weekly and updates changed scores when the source updates. Confidence intervals and sample sizes are not independently recalculated unless the source publishes them in a way that can be represented consistently on the site.

Model update frequency

New public model data is generally updated within 24 hours of release, often within 5-6 hours when the release and benchmark sources are clear. Each model carries a last-updated value so freshness is visible on the site.

Model versioning

Models are listed as public names change over time. Duplicate-looking names can exist when a provider releases a new dated build, preview, thinking variant, or API-visible version with different benchmark behavior.

A suffix such as -old means the row represents an earlier model entry kept for comparison or historical continuity, not the preferred current listing. The older pattern was to move a replaced same-name model to an -old slug and keep the newer release at the main slug.

The newer standard is to add a date or version marker to the slug when it avoids ambiguity. Weight or size is added for open-model families that release multiple sizes under the same model name. If a model has only one relevant size and there is no naming ambiguity, the size is usually omitted from the slug.

Context window sourcing

Context-window values are provider- or developer-reported unless otherwise marked. They may represent API limits, product limits, or documented model-card limits depending on what the source publishes.

Custom comparison behavior

Users can compare models by filtering the model table and showing or hiding benchmark columns. These controls change the view in the browser; they do not change the underlying public data.

Corrections and revisions

When source data changes or errors are found, RankedAGI updates the current public data rather than preserving a separate public revision log for every score. Per-score source attribution is planned as a larger provenance/data-model project.

Model pages usually include one or more global source links, such as an official release blog, model card, Hugging Face page, or public announcement. For third-party benchmarks, data is collected from the official leaderboard or result page for that benchmark, and many of those benchmark pages are linked from the benchmark index above.

Data access FAQ

Is there a public API for RankedAGI model data?

No. RankedAGI does not currently provide a public API for programmatic access to model data. The public website is the supported interface for browsing, filtering, and comparing models.

What are the API rate limits?

There are no API rate limits because there is no public API. The site is static and crawlable through normal public pages, robots.txt, llms.txt, and the sitemap.

Can I export the model table as CSV or JSON?

Not currently. RankedAGI does not yet offer CSV or JSON export for the full comparison table. Users can filter models and show or hide benchmark columns in the browser, but downloadable exports are not part of the current public site.

Are entries like DeepSeek 2.5, DeepSeek 2.5-old, and DeepSeek 2.5-236B-old duplicates?

They are separate RankedAGI entries used to preserve release or size distinctions. A -old suffix marks an earlier entry kept for continuity, while a size marker such as 236B identifies a specific open-model weight when the same family has multiple sizes or naming would otherwise be ambiguous.

Do individual benchmark scores show provider versus independent source type?

Not yet at the score level. Current model pages usually include model-level source links, and benchmark-specific third-party scores are collected from official benchmark leaderboards. Per-score source attribution is planned so each benchmark value can point to the exact provider, independent lab, model card, or leaderboard source used for that value.

Planned data improvements

A price-performance view is in progress so users can compare capability against API pricing without calculating every ratio manually.

Per-score source attribution is also planned. The goal is to link each benchmark value for each model to the specific source used for that value, rather than relying only on model-level and benchmark-level source links.