The RankedAGI Engine

RankedAGI publishes its own composite benchmarks for Coding, Agentic, Reasoning, Math, and an Overall score. The engine behind them turns sparse public benchmark data into a score that every model can be compared on, by carefully estimating the results that no lab has published.

16%

of all model-and-benchmark pairs have been benchmarked. The engine estimates the rest, so every model is scored on the same footing.

212 models · 72 benchmarks

Benchmarked (16%) Simulated (84%)

The five RankedAGI benchmarks

Each composite blends the public benchmarks most relevant to one capability into a single comparable score. Public testing is sparse, so much of each score is completed by simulation, which is exactly why the simulation has to be trustworthy.

Coding
19% benchmarked
Agentic
16% benchmarked
Reasoning
24% benchmarked
Math
18% benchmarked
Overall
52% benchmarked

How the engine estimates

To estimate a missing value, the engine combines two views. The first is local: it finds the models most similar to this one, judged by how closely they agree on the benchmarks they share, and learns from how those models performed on the target benchmark. The second is global: it fits one model across the entire grid at once, capturing a model's overall strength along with the patterns that make some families stronger on some kinds of benchmarks than others.

The global view sets a sensible level, so a clearly strong model is never dragged down by weaker neighbours, while the local view sharpens it. Every estimate also carries a confidence: higher when many similar models and a well-covered benchmark support it, lower when the evidence is thin.

Evidence

Actual benchmark results, plus simulated estimates for the gaps.

Normalize and weight

Benchmark results take priority, estimates count for less, low-confidence ones are left out.

Composite score

One comparable score per capability, plus an overall.

Details for nerds

The data

Picture a grid with one row per model and one column per benchmark. Only about a sixth of the cells hold an actual result; the engine fills the rest. Benchmarks use different formats such as percentages, rating scales, and currencies, so every value is first mapped into a common normalized space before anything is compared.

Two estimators

Local, from the nearest models. For a missing cell, the engine ranks every other model by how closely it agrees with the target model on the benchmarks they both have. The closest models, weighted by agreement, effectively vote on the target benchmark, with a signed correction for how much stronger or weaker the target tends to be. This is sharpest where a model has many results and the benchmark is widely covered.

Global, from matrix completion. Separately, one model is fit to the whole grid at once. Writing z for a normalized score, it learns z[model, benchmark] ≈ bias[model] + θ[model] · V[benchmark]. The bias term is the model's general strength and is barely regularized, so even a thinly tested model gets a sensible level. The short vectors θ and V capture interactions through their dot product, for example that one family tends to over-perform on a particular kind of benchmark; they are regularized more strongly so a model with little data cannot be thrown to an extreme. The fit runs by alternating least squares with a fixed seed and a fixed number of passes, so the output is fully reproducible, and it works in per-benchmark percentile space with an empirical-quantile inverse, which handles saturated and bimodal benchmarks without per-benchmark tuning. A final pass adds a small correction learned from the residuals of the global fit.

Combining them

The global view sets the level and the local view sharpens it. The blend leans on the local view when a benchmark is widely covered and on the global view when it is sparse. A floor stops a high-strength model from being pulled below its global estimate, which was the weakness of the older nearest-models-only method, where weaker neighbours used to drag strong models down.

Confidence and how estimates enter a score

Each estimate gets a confidence between 0 and 1 from how much evidence supports it: how many similar models exist, how well the target benchmark is covered, and how cleanly the model fits. Confidence then governs how the estimate is used. Each source is normalized to a 0 to 1 score; an actual result has trust 1.0, while a simulated value must clear a confidence of 0.35 and is used with trust = confidence × 0.75, so a real result always counts for more than an estimate. The composite then combines its sources:

signal_weight    = (source_relevance / 50) ^ 1.25
effective_weight = signal_weight * trust * sampling_reliability

family_score     = sum(source_score * effective_weight) / sum(effective_weight)
family_evidence  = min(sum(effective_weight), strongest_source_weight * 1.2)

observed_score   = sum(family_score * family_evidence)
evidence         = sum(family_evidence)
final_score      = (prior_score * prior_weight + observed_score) / (prior_weight + evidence)

The family cap stops several versions of one benchmark from counting as if they were independent evidence. In the leaderboard, each simulated value shows its confidence as a colored dot, green, amber, or red, and anything below the 0.35 threshold is left out of scoring entirely.

How we know it works

Accuracy is measured by hiding values we already have and predicting them back, across several patterns: a few values hidden, many values hidden, a whole benchmark thinned out, a single model thinned out, and the hardest case, hiding a benchmark's top scorer to see whether the engine can still place it near the top. A change ships only when it lowers these held-out errors. Actual results always override estimates, a simulated value is never used to generate another, and the full set of estimates regenerates deterministically whenever the data changes.

How we keep it honest

Estimates are validated the way you would test any prediction. We hide a result RankedAGI already has, ask the engine to guess it, and compare against the truth, across thousands of held-out results, including the hardest case of recovering a benchmark's very top score. Changes ship only when they measurably improve accuracy.

It remains an estimate. It is most reliable for models with a solid base of results and for benchmarks that behave like general capability, and least reliable for brand-new models with little data or for unusual benchmarks. That is why a confidence value travels with every estimate, real results always outweigh estimates, and missing data is never read as a zero.

FAQ

What is simulated benchmark data?

A simulated value is RankedAGI's best estimate of how a model would score on a benchmark it has not been publicly tested on, inferred from the results it does have. It is always shown as an estimate and never presented as an actual benchmark result.

Why estimate at all? Why not just leave gaps?

No lab tests every model on every benchmark. Across everything we track, only about a sixth of model-and-benchmark pairings have an actual result. A composite built only on whatever happens to be published would swing wildly depending on which benchmarks a model was tested on. Estimating the gaps lets every model be compared on the same footing.

Does simulated data inflate scores?

No. Benchmark results always take priority over estimates, estimates carry less weight, and low-confidence estimates are left out of scoring entirely. A model cannot be pushed above what its results on related benchmarks support.

Can a model rank first on simulated data alone?

Not on thin air. The engine places a model using all of its benchmark results at once, so a strong estimate has to be earned by genuinely strong performance on related benchmarks. A model with few results carries low confidence and contributes weakly.

How accurate is it?

We test it directly. We hide a result we already have, ask the engine to predict it, and compare the prediction against the truth, across thousands of held-out values. The current engine beats the previous approach on every test slice while covering more of the grid. It is a careful estimate rather than an oracle, so a confidence value travels with every estimate.

Why does a value look faint or italic in the table?

That styling marks a simulated value, with a confidence dot beside it. Actual benchmark results render normally. Simulated cells appear only when you turn on the simulated-data toggle.

Is this just made-up data?

No. Estimates are derived only from actual benchmark results, and they are never written back into the model database as benchmark results. A simulated value is never used as input to generate another simulated value.

How often is it refreshed?

The estimates are regenerated whenever the underlying data changes, such as new models, new benchmark results, or scoring updates, so they stay consistent with the latest evidence.