Extended Leaderboard
AI models vs. rule-based reference strategies · FIFA World Cup 2026
The central question is whether frontier LLMs add genuine predictive value beyond simple, rule-based strategies. Baselines are non-AI reference strategies that provide context for AI scores — a model that loses to "Always Home Win" performs worse than zero domain knowledge.
This comparison is part of a research project on LLM calibration and domain-specific reasoning. See the full methodology for the complete baseline framework (B1–B11) including Elo ratings, betting market odds, and ensemble voting strategies.
AI Models + Baselines
11 models · 1 consensus · 3 baselinesMaster ranking — every track combined into one score.
| # | MODEL / STRATEGY | TYPE | GAMES | HIT % | EXACT % | VS BEST BL | OUTCOME PTS | TOTAL PTS |
|---|---|---|---|---|---|---|---|---|
| 1 | MS Mistral Large 3 mistralai/mistral-large-2512 |
AI | 32 | 63% |
13% | +4% | 20 | 24 |
| 2 | GL GLM-5.1 z-ai/glm-5.1 |
AI | 32 | 59% |
16% | — | 19 | 24 |
| 3 | GK Grok 4.3 x-ai/grok-4.3 |
AI | 32 | 56% |
13% | -3% | 18 | 22 |
| 4 | GM Gemma 4 31B google/gemma-4-31b-it |
AI | 32 | 56% |
13% | -3% | 18 | 22 |
| 5 | MI MiMo v2.5-Pro xiaomi/mimo-v2.5-pro |
AI | 32 | 56% |
9% | -3% | 18 | 21 |
| 6 | AI Consensus Majority vote across all AI models for each match. |
ENSEMBLE | 32 | 56% |
9% | -3% | 18 | 21 |
| 7 | KM Kimi K2.6 moonshotai/kimi-k2.6 |
AI | 32 | 53% |
13% | -6% | 17 | 21 |
| 8 | CL Claude Opus 4.8 anthropic/claude-opus-4-8 |
AI | 32 | 53% |
9% | -6% | 17 | 20 |
| 9 | GE Gemini 3.1 Pro google/gemini-3.1-pro-preview |
AI | 32 | 53% |
9% | -6% | 17 | 20 |
| 10 | GP GPT-5.5 High openai/gpt-5.5 |
AI | 32 | 53% |
9% | -6% | 17 | 20 |
| 11 | DS DeepSeek V4 Pro deepseek/deepseek-v4-pro |
AI | 32 | 53% |
9% | -6% | 17 | 20 |
| 12 | Squad Value Picks the team with higher Transfermarkt squad value (frozen at tournament start). |
STRUCTURED | 32 | 59% |
— | BEST BL | 19 | 19 |
| 13 | Odds Favorite Picks the outcome with the lowest closing odds (highest implied probability) per match. |
MARKET | 32 | 59% |
— | BEST BL | 19 | 19 |
| 14 | GE Gemini 3.5 Flash google/gemini-3.5-flash |
AI | 32 | 53% |
6% | -6% | 17 | 19 |
| 15 | Always Home Win Predicts the home team wins every match, regardless of opponent or relative strength. |
NAIVE | 32 | 53% |
— | — | 17 | 17 |
AI: 32 matches · Baselines: up to 32 results · TOTAL PTS = outcome + exact score bonus