1. Headline & intro
AI has started to write code, pass professional exams, and draft legal documents. But put it up against a Premier League bookmaker and it still looks surprisingly clumsy. A new study that pitted leading AI models from Google, OpenAI, Anthropic, and xAI against a season of English football betting ends with one clear verdict: the house still wins.
In this piece, we’ll look at what this experiment really tells us about the limits of current large language models, why long‑term decision‑making is so hard for them, and what this means for finance, European regulators, and anyone dreaming of an “AI hedge fund in a box.”
2. The news in brief
According to Ars Technica, summarising research first reported by the Financial Times, London startup General Reasoning ran a benchmark called “KellyBench” that simulated betting through the 2023–24 Premier League season. Eight leading AI systems were given detailed historical statistics and match data. Their task: design betting strategies to maximise returns while managing risk, without live Internet access or external help.
Each model started with a virtual £100,000 bankroll and could place bets on match outcomes and goal counts. They were allowed three full attempts at the season. The best performer, Anthropic’s Claude Opus 4.6, still lost money overall, with an average loss of about 11 percent and coming close to break‑even in its best run.
OpenAI’s GPT‑5.4 also lost money on average. Google’s Gemini models swung wildly between healthy profits and total bankruptcy. xAI’s Grok, along with another system, managed to burn through its entire bankroll in every completed attempt. The paper is not yet peer‑reviewed, but the pattern is clear: in this setup, frontier models consistently underperformed competent human bettors.
3. Why this matters
On the surface, this is a fun story about AI losing to bookmakers. Underneath, it is a sharp stress test of one of AI’s weakest points: acting sensibly in a messy, dynamic world over long time horizons.
Most of the benchmarks that dominate AI leaderboards are static. You get a question, you answer it once, you’re scored, and you move on. Real work in finance, operations, or strategy looks nothing like that. It’s closer to sports betting: you make a decision, wait for noisy feedback, the world changes, you update your beliefs, and you repeat this hundreds of times while trying not to go bust.
In that environment, the failure modes seen in KellyBench are extremely relevant:
- Overconfidence and poor calibration. LLMs are trained to be decisive and fluent. In betting, that often means taking prices that look “obvious” but are actually slightly negative‑EV, again and again, until ruin.
- Weak risk management. Maximising expected return is easy; surviving a long season with drawdowns, streaks, and variance is much harder. Humans with experience instinctively hedge; generic models clearly struggled.
- Non‑stationary reality. Player injuries, tactical shifts, mid‑season form swings – the underlying process keeps changing. A model that largely relies on static patterns from the past is at a structural disadvantage.
Who wins and who loses from this result?
- Winners: Human domain experts in finance, sports analytics, and operations. Their jobs are not about to be handed to a chatbot. Bookmakers and trading desks can also breathe easier: their edge is not trivially automated.
- Losers: The narrative that “GPT‑level models will run hedge funds by themselves any day now.” The study is a public reminder that general‑purpose LLMs are not plug‑and‑play decision engines.
In short, this experiment attacks the most overhyped claim in corporate slide decks: that you can simply drop a frontier model into a complex, adversarial environment and let it run the show.
4. The bigger picture
KellyBench slots into several broader storylines in AI.
First, it pushes back against the current fashion for “AI agents” that can supposedly trade, negotiate, and operate companies with minimal oversight. Many of those demos are either simulated in friendly sandboxes or quietly hand‑held by humans. Football betting over a whole season is closer to how real markets behave: information is partial, adversaries adapt, and luck dominates short runs.
Second, it highlights the difference between symbolic success and economic value. Recent models from OpenAI, Google, and Anthropic genuinely are much better at coding, reasoning on exams, and passing synthetic benchmarks. But making money is a different benchmark entirely. It requires calibration, patience, and deep modelling of uncertainty – traits that standard next‑token prediction does not guarantee.
Historically, we’ve seen similar illusions. Early quant trading systems looked unbeatable in backtests, only to blow up when regimes shifted. Here, the backtest is done for the models, and they still go bust. That’s a sign that generative models – even when wrapped in clever “agent” scaffolding – are not yet robust controllers for open‑ended, high‑stakes systems.
Third, the results quietly vindicate the more boring, engineering‑heavy approach many serious financial institutions already take: hybrid systems. Those combine traditional quantitative models (probabilities, Kelly criteria, risk limits) with AI for specific tasks like news summarisation or scenario generation. KellyBench essentially shows what happens when you skip the hard quant work and rely on generic, text‑only intelligence: inconsistent performance and a high chance of ruin.
Finally, the model‑to‑model spread is revealing. Claude Opus’s relatively graceful losses and Gemini’s wild swings suggest that training choices – particularly around conservatism, calibration, and tool‑use – can significantly change behaviour under risk. Expect vendors to quietly optimise for this benchmark in the coming year.
5. The European / regional angle
For European readers, this is not just an AI curiosity; it touches on three sensitive areas: gambling, finance, and regulation.
Europe hosts one of the world’s largest regulated sports‑betting markets, with the UK, Italy, Spain, and others generating billions annually. If general‑purpose AI struggles this much in a well‑studied, data‑rich domain like Premier League football, that has implications for how EU financial institutions and betting operators should deploy these tools.
Under the upcoming EU AI Act, AI systems used to influence financial decisions or interact with consumers in high‑risk contexts face strict requirements around transparency, robustness, and human oversight. While sports betting itself may sit at the edge of these categories, any automated “AI tipster” that encourages consumers to gamble based on supposedly superior predictions is likely to draw regulatory attention, especially in countries with strong consumer‑protection cultures like Germany or France.
For European banks, insurers, and asset managers experimenting with LLMs, KellyBench is a timely warning. Using models for research, document analysis, or generating scenarios is one thing; giving them direct control over trading or risk allocation is quite another.
There is also an opportunity side. European startups in sports analytics and fintech can position themselves with domain‑specific AI, built on top of solid probabilistic modelling rather than generic chatbots. In markets from London and Berlin to Ljubljana and Zagreb, that combination of hard quant plus careful AI could become a distinctive regional strength.
6. Looking ahead
Where does this leave the AI roadmap?
In the short term, expect two reactions. First, major labs will likely treat KellyBench and similar long‑horizon benchmarks as new scoreboards to climb. We will see models fine‑tuned explicitly for calibration, risk‑sensitive decision‑making, and tool‑use with proper statistical back‑ends. Some of the gap exposed here is fixable engineering, not a fundamental limit.
Second, regulators and risk officers will have fresh ammunition. When a vendor claims their model can autonomously trade, allocate capital, or set odds, studies like this will be Exhibit A in the pushback: show us real‑world evidence, not just cherry‑picked demos.
Over the next two to five years, the most realistic trajectory is hybridisation, not replacement. LLMs will sit alongside human experts and traditional models, providing ideas, explanations, and alternative scenarios – but not pressing the “bet” button alone. The winners will be organisations that treat AI as a fallible junior partner with encyclopedic recall, not as an oracle.
The big open questions are uncomfortable ones:
- Can a text‑first architecture ever be reliably trusted with capital without heavy guardrails?
- How do we benchmark systems in ways that truly mirror messy human reality, not exam sheets?
- And socially, will retail users still over‑trust “AI betting bots” even when the evidence says they shouldn’t?
There is also a risk of overcorrecting. Dismissing AI entirely in quantitative domains would be a mistake; specialised models already support trading and risk analysis today. The lesson is not “AI can’t handle finance,” but “this specific family of generic models is not yet a full‑stack decision maker.”
7. The bottom line
KellyBench doesn’t prove that AI is stupid; it proves that being eloquent is not the same as being consistently right under risk. For now, the bookmaker, the portfolio manager, and the operations director are safer than the hype suggests. The smart move is to treat LLMs as powerful but error‑prone tools, not autonomous gamblers.
The next time someone tries to sell you an AI that “can’t lose” on the markets or on matchday, a simple question is in order: show me the season‑long track record – and what happens when the form table flips.



