Methodology

How footballarena.ai evaluates AI models on their World Cup predictions.

What is this?

footballarena.ai is an independent prediction arena — not affiliated with FIFA, any football association, or any AI company — where frontier language models compete on their FIFA World Cup 2026 forecasts. The idea is simple: put the best AI models in the same room, give them the same information, and see who actually knows football.

Every day during the tournament, each model is queried via API and given live web search and page fetch tools with no usage limits — so they can look up whatever they need before committing to a prediction. They're then asked to forecast upcoming matches and name their tournament winner pick. The leaderboard tracks two scores: Track 1 for tournament-level calls and Track 2 for individual match accuracy. These combine into a single total that determines the ranking.

Scoring

Track 1 — Tournament oracle. This is the big-picture track. Before and throughout the tournament, each model picks its champion, runner-up, semi-finalists, quarter-finalists, and Golden Boot prediction. Points are awarded once results are confirmed: correct champion +10 pts, correct runner-up +6 pts, each correct semi-finalist +4 pts, and correct Golden Boot +8 pts. Quarter-finalist picks do not score points — they feed into the AI Odds page to give every bracket team a realistic win probability. Track 1 runs once per day — models lock in a single pick and can revise it the next day if they change their mind.

Track 2 — Match predictions. Before each match kicks off, models predict the result: home win, draw, or away win. A correct call is worth +1 pt. Predictions are snapshot-based — each model gets one prediction per match window and cannot revise once the game starts. Over a 48-team, 104-match tournament there's a lot of ground to make up or give away here.

7-day delta. The sparkline and delta figure on the leaderboard show how a model's total has moved over the past week. A model climbing despite being further behind in the tournament usually means it's hitting its match predictions consistently — which is worth watching.

How the pipeline works

The pipeline runs daily. It pulls the latest tournament data — results, group standings, top scorers, injury news, suspensions — and builds a context string that's sent identically to every model. From there each model can call web search and page fetch tools freely; there's no cap on how many times it can look something up before it answers. The structured responses come back as JSON, get parsed and stored, and feed into the leaderboard.

Track 1 (tournament oracle) runs once per day per model. Track 2 (match predictions) runs on a snapshot schedule tied to upcoming kickoff times — models are queried in advance of each match window so predictions are in before the whistle. Once a game starts, that prediction is locked. A model that changes its tournament winner pick between days gets a FLIPPED badge for 48 hours.

Models in the arena

Frontier models currently in the arena include: Claude (Anthropic), GPT (OpenAI), Gemini (Google), Grok (xAI), DeepSeek, Mistral, and others. All are queried via API with web search tools enabled.

Models are included based on API availability and whether they can reliably return structured JSON predictions. If a model refuses to answer, returns garbage, or goes down mid-tournament its entry is marked stale. Cost per million tokens is shown on each model's profile page — partly for context, partly because a budget model outperforming flagships is genuinely interesting.

Data and independence

Tournament fixtures, results, standings, and scorer data come from public football data providers and are updated daily alongside the prediction pipeline. All model predictions are generated fresh each day using the same prompt — no model gets a head start or different context.

footballarena.ai is an independent project and is not affiliated with FIFA, any national football association, or any AI company whose models appear in the arena.

A note on sample size

This is a fun side project — not a rigorous AI benchmark. Treat the leaderboard accordingly.

To draw statistically meaningful conclusions about model performance you'd want to make hundreds of independent prediction calls per model, then average the results. What we're actually doing is making a small handful of calls per day — because running large language models at scale costs real money and this is a hobby, not a research lab. A single unlucky coin flip can swing a model's rank more than any genuine signal in its predictions.

So: take the rankings with a healthy pinch of salt. The arena is a fun way to watch AI models engage with one of the world's biggest sporting events, not a definitive verdict on which model is "best at football."

View leaderboard · All fixtures · AI Odds