Google’s AI Overviews Are 90% Accurate – And That’s the Problem

April 7, 2026
5 min read
Google search results page showing an AI-generated overview above traditional blue links

Headline & intro

Google quietly turned search into an AI product, and most people never explicitly agreed to be part of the experiment. Now, new testing suggests that Google’s AI Overviews are wrong about one in ten times. At Google scale, that’s not a bug count – it’s an information climate. The question is no longer whether AI makes mistakes (it does) but whether a system that confidently produces millions of wrong answers per hour deserves to sit above the “blue links” we used to trust. This piece looks at what 90% accuracy really means when one company is effectively the front door to the web.


The news in brief

According to Ars Technica, citing analysis from The New York Times and AI startup Oumi, Google’s AI Overviews currently answer factual questions correctly roughly 90–91% of the time on a widely used benchmark.

Oumi used SimpleQA, a public dataset of over 4,000 questions with verifiable answers originally released by OpenAI in 2024, to automatically query AI Overviews and score the responses. When tested last year on an earlier model, Google’s system hit about 85% accuracy; with Gemini 3 in the mix, it improved to around 91%.

Ars notes that if you extrapolate that miss rate to Google’s search volume, AI Overviews could be generating tens of millions of incorrect statements per day. Google disputes the methodology, arguing that SimpleQA contains errors and does not represent typical user searches. The company says it prefers a smaller, internally vetted variant called SimpleQA Verified and emphasizes that AI Overviews dynamically select different Gemini models (from fast “Flash” versions to Pro) depending on the query and performance constraints.


Why this matters

A 90% score sounds impressive on a model leaderboard. It is disastrous as the default interface to reality.

Search is no longer just another app; it is infrastructure. In many countries, including most of Europe, “searching the web” is synonymous with “using Google.” When infrastructure is wrong 10% of the time, that error rate ripples into health decisions, financial choices, political opinions, and everyday safety.

Crucially, AI Overviews sit above everything else. They inherit Google’s hard‑won trust and visual authority: big box, confident prose, neat citations. Most users will not treat that as a probabilistic guess from a fallible model; they will treat it as the answer. If you’re tired, on mobile, or in a hurry, you will not scroll down and compare sources every time.

Who benefits? Google, primarily. AI Overviews keep users on Google’s page, potentially increase ad impressions, and help the company counter the narrative that OpenAI, Perplexity, or others are the real “answer engines.” There’s also a strategic moat: if the default way to access information becomes a proprietary AI summary, the open web is reduced to a training and citation layer.

Who loses? Users who mistake fluency for accuracy. Websites that invest in quality content yet see less traffic and more misrepresentation. And, more broadly, an information ecosystem in which we can no longer easily distinguish between retrieval (showing us documents) and invention (hallucinating plausible-sounding facts).

This is the core problem: Google is treating generative AI’s current reliability level as “good enough” for mass deployment, largely because the competitive and business incentives push in that direction.


The bigger picture

The AI industry has spent years dancing around the hallucination problem. Every model release comes with benchmark slides: MMLU scores, math reasoning charts, factuality percentages. Each company increasingly uses its own tests, tuned to show its models in the best possible light. SimpleQA vs SimpleQA Verified is just one example of this benchmarking fragmentation.

But the deeper shift is product‑level, not model‑level. We’re moving from search engines that pointed you at documents to answer engines that synthesize an opinionated narrative on your behalf. Bing tried this with its AI chat integration; Perplexity builds its entire identity on it. OpenAI has signalled it wants to be a “default interface to the internet.” Google, the incumbent, cannot sit still.

Historically, when Google surfaced a wrong fact – say, via featured snippets – the damage was limited by the feature’s narrow scope. Now that generative summaries are stapled to the top of a huge proportion of queries, the scale of impact changes. A 10% error rate across billions of interactions a day is not a corner case; it’s a systemic characteristic.

There’s also a tricky psychological effect. Classic search results are visibly messy: multiple links, different headlines, conflicting answers. That messiness invites skepticism and comparison. AI Overviews and similar tools create a single, polished, conversational answer – and that smoothness hides uncertainty. We humans are wired to trust confident storytellers.

Compare that with truly safety‑critical domains: aviation, medicine, industrial control. In those fields, a 90% reliability level would be laughable. We demand redundancies, audits, certification, and logs. Search doesn’t directly kill people, but it meaningfully shapes behaviour in domains that do affect life and death. That’s why “90% is pretty good, given how hard the problem is” is not an adequate justification.

The industry trend is clear: deploy first, improvise guardrails later. And Google, which once built its reputation on conservative, incremental changes to search quality, is now following the same playbook.


The European / regional angle

For European users, this is not an abstract US‑centric debate. In many EU countries, Google’s search share hovers around or above 90%. When AI Overviews go wrong, they don’t just mislead “some users” – they mislead almost everyone who searches for that topic.

European regulators have already labelled platforms like Google as “gatekeepers” under the Digital Markets Act (DMA) and imposed new responsibilities under the Digital Services Act (DSA). On top of that, the incoming EU AI Act specifically targets high‑impact AI systems with obligations around risk management, transparency, and human oversight.

AI Overviews sit awkwardly across these regimes. They are part of a core gatekeeper service (search), they have obvious systemic‑risk potential (DSA territory), and they are powered by general‑purpose AI models (AI Act territory). At some point, Brussels will have to answer a concrete question: is a mass‑market AI answer box with a 10% factual error rate acceptable on a dominant platform?

There’s also a linguistic and cultural dimension. Most benchmarks, including SimpleQA, are English‑heavy. For smaller European languages, training data is thinner and local context more nuanced. It’s reasonable to suspect that accuracy in, say, Slovene, Croatian, or Basque is worse than in English – yet those users see the same confident UI.

European publishers, already locked in long‑running battles with Google over snippets and neighbouring rights, face a new challenge: if AI Overviews summarize or distort their reporting and users never click through, the economic model of independent journalism in Europe weakens further. Alternatives like Qwant, Ecosia, or regional search projects may get more attention – but they lack Google’s scale and default status.


Looking ahead

Three things are likely to happen next.

First, Google will quietly tighten the reins. We already see hints of the company limiting AI Overviews for sensitive queries (health, finance, elections) and adding more hedging language and disclaimers. Expect more invisible rules about when Overviews appear at all, and more robust retrieval‑augmented setups that lean heavily on trusted sources for high‑risk topics.

Second, we should anticipate regulatory interest, especially in Europe. The DSA obliges very large platforms to assess and mitigate systemic risks, including disinformation and impacts on fundamental rights. An AI layer that demonstrably produces large volumes of incorrect statements is an obvious target for risk‑mitigation plans, audits, or even design changes mandated by regulators.

Third, users will adapt faster than Google expects. We are already seeing a split behaviour pattern: AI chatbots for brainstorming and coding, traditional search or specialist sites for critical facts. If enough people start to treat AI Overviews as “nice to skim, but not to trust,” then Google will have injected a layer of skepticism into its own product. That erosion of trust is hard to reverse.

Watch for a few concrete signals over the next 12–24 months:

  • Does Google publish independent, third‑party‑audited metrics on AI Overview accuracy by domain and language?
  • Do browsers or OS vendors give users more explicit toggles to disable AI layers in search?
  • Do vertical services (health portals, finance tools, legal databases) market themselves explicitly as “non‑AI, verified” alternatives?

The biggest open question is liability. When AI Overviews confidently misstate a medical dose or defame an individual, who is ultimately responsible – and under which jurisdiction’s rules? Courts and regulators have not yet fully answered that.


The bottom line

A system that is wrong 10% of the time has no business presenting itself as the definitive answer on the world’s most powerful search engine. Google’s AI Overviews turn inevitable model errors into a structural feature of how billions of people access information. Unless Google – and regulators – treat this as an infrastructure‑level risk rather than a product experiment, users will keep paying the price in quiet, invisible ways. The real question is: how many confident mistakes are you willing to accept from the box at the top of your screen?

Comments

Leave a Comment

No comments yet. Be the first to comment!

Related Articles

Stay Updated

Get the latest AI and tech news delivered to your inbox.