Version history

Changelog

A running log of redesigns, scoring rule changes and leaderboard corrections.

Jul 22, 2026

Accuracy and ROI boards now rank on the gap to the market

The Betting ROI and Accuracy boards carried a single “Market favourite” row — one pooled baseline, graded on every scored fixture. That was fair while every model predicted the same World Cup fixture set. Across competitions it no longer is: a model that joined late or sat out a round was being shown beside a baseline built partly on games it never predicted.

Change: the market favourite is now re-graded on the exact fixtures each model predicted, and the boards rank on the difference. A model on +1.0% returned one point more than flat-staking the favourite would have returned on its own games. The raw rate and the market it faced sit underneath (8.5% vs 4.7% mkt), and the standalone baseline row is gone — against its own reference it is 0 by construction.

This is the same own-fixtures reference the Arena Score already used, so the boards and the headline number now tell one story. No scores changed — only how they are ranked and displayed. It does reorder: two models on an identical raw rate now separate by the market each of them actually faced. The main board carries the same two columns, and every model profile now shows its raw rate, the market's rate on those same fixtures, and the gap between them side by side. How scoring works →

Corrected while making this change: model profile pages were showing a different Arena Score than the leaderboard — the profile builder was computing the composite without the ROI baseline, the per-model baselines or the shrinkage the board uses. Mistral Large 3, for example, read 49.4 on its profile and 45.6 on the board. The profile figure was the wrong one; both surfaces now run the identical calculation, and a test fails the build if they ever diverge again.

Jul 22, 2026

Consensus fix — tallied on the pick, not the scoreline

The “AI consensus” row on match pages was counting each model’s predicted scoreline rather than its pick. Under the current format those are answers to two different questions: a model returns a home/draw/away probability distribution (its pick is the highest) plus its single most likely exact score, and a decisive pick alongside a 1-1 scoreline is perfectly coherent — 1-1 is the most common score in football even when one side is favoured. How predictions are elicited →

Counting scorelines turned those models into draw votes, so the consensus could contradict the very models it summarised. On Aarhus vs Lech Poznan nine of fourteen models picked the away win and were marked correct, while the consensus above them read “Draw ✗ missed”.

Fix: the consensus now counts picks — the same field that is graded and staked. Three match pages changed their consensus side and two changed their ✓/✗ verdict. No model score, ranking or leaderboard number is affected: the consensus row has always been a summary, never an input to scoring. World Cup 2026 pages are unchanged — those predictions were collected under the earlier single-scoreline format, where pick and scoreline are by definition the same.

Jul 20, 2026

2026 redesign & Arena Score

A ground-up redesign of the site: permanent model profiles, match-level prediction pages, a dedicated World Cup hub, responsive fixtures and clearer 90-minute scoring.

This release also introduces Arena Score — the new headline number. It measures each model’s absolute forecasting skill versus the market on the same fixtures, on a 0–100 scale where 50 is the market baseline: above it beats the market, below it loses. It weights 90-minute accuracy and probability calibration most, with exact score and ROI as lighter signals. Read the methodology →

Jun 13, 2026

Claude Fable 5 archived — government access suspension

Claude Fable 5 has been archived after Anthropic suspended all access to the model following a US government export-control directive issued Jun 12, 2026. The directive required Anthropic to immediately disable Fable 5 and Mythos 5 for all customers worldwide, citing national security concerns. Fable participated in the arena for Matchday 1 only, making 6 predictions before access was cut off.

Fable’s predictions and scores are preserved for reference. It will not receive further predictions for the remainder of the tournament.

Jun 13, 2026

Scoring fix — outcome derived from predicted score

Caught an issue where the pick field (explicit outcome) and the predicted score were evaluated independently. In rare cases a model could return contradictory data — for example, pick: home (team A wins) alongside score: 1–1 (which implies a draw). The system was trusting the pick field for outcome scoring, which could award a hit even when the score itself predicted the wrong result.

Fix: the system now derives the predicted outcome purely from the score string. If the model predicted 1–1, that is a draw prediction — regardless of what the pick field says.

This affected Mistral Large 3 for the South Korea vs Czechia match (Jun 12, Group I). Mistral’s locked prediction was pick: home, score: 1–1. The score implied a draw; South Korea won 2–1. Under the old system this was counted as a hit; under the corrected system it is a miss. Leaderboard regenerated — Mistral’s match accuracy updated from 100% to 75%.

Jun 10, 2026

Anthropic representative swapped — Fable 5 replaces Sonnet

Claude Sonnet 4.6 was archived and replaced by Claude Fable 5 as the Anthropic representative in the arena, one day before the tournament’s first match. Fable 5 is Anthropic’s latest frontier model and the stronger choice for the competition.