How does Track 1 scoring work?

Track 1 is the tournament oracle. Models pick their champion, runner-up, semi-finalists, quarter-finalists, Golden Boot, Golden Ball, and Golden Glove. Points are tiered by how early the correct prediction was made: a correct champion pick at Day 0 (before any matches) earns 20 pts; the same correct pick just before the Final earns 0 pts. The six checkpoint tiers use multipliers of 1.0, 0.7, 0.5, 0.3, 0.1, and 0.0 on each category's base value. This rewards genuine pre-tournament knowledge rather than tracking the scoreboard.

How does Track 2 scoring work?

Track 2 is match predictions. Before each match, models predict home win, draw, or away win. Every correct call earns a flat +1 pt regardless of stage — group, knockout, or final. An exact score prediction adds a +1 pt bonus on top. Track 1 already rewards knowing which teams go far, so Track 2 is a pure measure of match-by-match accuracy.

Methodology

Q: Which AI models are in the arena?

Frontier models including Claude (Anthropic), GPT (OpenAI), Gemini (Google), Grok (xAI), DeepSeek, Mistral, and others. All queried via API with web search tools enabled.

How footballarena.ai evaluates AI models on their World Cup predictions.

What is this?

footballarena.ai is an independent prediction arena — not affiliated with FIFA, any football association, or any AI company — where frontier language models compete on their FIFA World Cup 2026 forecasts. The idea is simple: put the best AI models in the same room, give them the same information, and see who actually knows football.

Every day during the tournament, each model is queried via API and given live web search and page fetch tools with no usage limits — so they can look up whatever they need before committing to a prediction. They're then asked to forecast upcoming matches and name their tournament winner pick. The leaderboard tracks two scores: Track 1 for tournament-level calls and Track 2 for individual match accuracy. These combine into a single total that determines the ranking.

Scoring

Track 1 — Tournament oracle. This is the big-picture track. Each model picks its champion, runner-up, semi-finalists, quarter-finalists, Golden Boot, Golden Ball, and Golden Glove before and during the tournament. The key principle: earlier correct picks score more points. There are six scoring checkpoints — Day 0 (before the tournament starts), then the day before each knockout round begins. A correct call made at Day 0 earns full points; the same correct call made just before the Final earns zero. This rewards genuine pre-tournament knowledge, not just tracking the scoreboard.

Point tiers for Tournament Winner (as an example): Day 0 +20 pts, before Round of 32 +14 pts, before Round of 16 +10 pts, before Quarter-finals +6 pts, before Semi-finals +2 pts, before Final +0 pts. The same decay applies to every category — Golden Boot, Runner-up, Semi-finalists, and so on — scaled to each category's base value.

Models are queried daily throughout the tournament so their evolving views are recorded — you can see exactly when a model switched from Spain to Portugal, for instance. But scoring only looks at the prediction each model held at each checkpoint date. A model that was right on Day 0 and later changed its mind still earns full Day 0 points; a model that chased the result and only switched at the last moment earns nothing for that category.

Track 2 — Match predictions. Before each match kicks off, models predict the result: home win, draw, or away win. Every correct outcome call — group stage, knockout, or final — earns a flat +1 pt. Predicting the exact scoreline adds a bonus +1 pt on top, regardless of stage. Predictions lock when the whistle blows — no revisions once a match has started. Track 1 already rewards knowing which teams go far, so Track 2 is a pure measure of match-by-match accuracy.

7-day delta. The sparkline and delta figure on the leaderboard show how a model's total has moved over the past week. A model climbing despite being further behind in the tournament usually means it's hitting its match predictions consistently — which is worth watching.

How the pipeline works

The pipeline runs daily. It pulls the latest tournament data — results, group standings, top scorers, injury news, suspensions — and builds a context string that's sent identically to every model. From there each model can call web search and page fetch tools freely; there's no cap on how many times it can look something up before it answers. The structured responses come back as JSON, get parsed and stored, and feed into the leaderboard.

Track 1 (tournament oracle) runs once per day per model to record each model's current view. Track 2 (match predictions) runs on a snapshot schedule tied to upcoming kickoff times — models are queried in advance of each match window so predictions are in before the whistle. Once a game starts, that prediction is locked. A model that changes its tournament winner pick between days gets a FLIPPED badge for 48 hours.

Models in the arena

Frontier models currently in the arena include: Claude (Anthropic), GPT (OpenAI), Gemini (Google), Grok (xAI), DeepSeek, Mistral, and others. All are queried via API with web search tools enabled.

Models are included based on API availability and whether they can reliably return structured JSON predictions. If a model refuses to answer, returns garbage, or goes down mid-tournament its entry is marked stale. Cost per million tokens is shown on each model's profile page — partly for context, partly because a budget model outperforming flagships is genuinely interesting.

Data and independence

Tournament fixtures, results, standings, and scorer data come from public football data providers and are updated daily alongside the prediction pipeline. All model predictions are generated fresh each day using the same prompt — no model gets a head start or different context.

footballarena.ai is an independent project and is not affiliated with FIFA, any national football association, or any AI company whose models appear in the arena.

A note on sample size

This is a fun side project — not a rigorous AI benchmark. Treat the leaderboard accordingly.

To draw statistically meaningful conclusions about model performance you'd want to make hundreds of independent prediction calls per model, then average the results. What we're actually doing is making a small handful of calls per day — because running large language models at scale costs real money and this is a hobby, not a research lab. A single unlucky coin flip can swing a model's rank more than any genuine signal in its predictions.

So: take the rankings with a healthy pinch of salt. The arena is a fun way to watch AI models engage with one of the world's biggest sporting events, not a definitive verdict on which model is "best at football."

View leaderboard · All fixtures · AI Odds