AI Overviews hallucinate millions of times per hour: how to become the trusted reference source that doesn’t lie

Summarize this article with AI

In short: Oumi’s April 2026 study tested 4,326 queries on Google AI Overviews using OpenAI’s SimpleQA benchmark. Verdict: 9 to 15% false answers, meaning tens of millions of errors per hour across the 5 trillion annual searches Google processes. An e-commerce brand that structures its product sheets as verifiable factual data (specs, precise FAQs, Schema.org, cited sources, versioning) becomes the source that LLMs prefer to cite. This article details the method to transform your product sheets into mini-encyclopedias of fact that ChatGPT, Perplexity, and AI Overviews recommend as priority.
9 to 15%of AI Overviews responses containing factual errors (Oumi study, April 2026)
4,326queries tested by Oumi using the SimpleQA benchmark on <a href= »https://www.hi-commerce.fr/glossaire/#gemini » class= »hc-gloss-link » title= »Definition: Gemini »>Gemini</a> 2 then Gemini 3
56%of correct answers but not anchored in cited sources (Gemini 3, February 2026)

The scope of the problem: AI Overviews invent at industrial scale

On April 7, 2026, the New York Times publishes Oumi’s analysis, an open source AI startup. The team tested 4,326 queries on Google AI Overviews using SimpleQA, a methodology created by OpenAI in 2024 to measure the factuality of generative models. Slashdot, Popular Science, Yahoo Tech, Hacker News picked up the study. A Reddit post on r/nottheonion accumulated 9,613 upvotes in days.

The numbers are stark. And brutal.

  • 9 to 10% factual errors on Gemini 3 responses (February 2026), versus 15% on Gemini 2 (October 2025)
  • 56% of correct answers aren’t actually anchored in the cited sources — the model is right by accident, not by reasoning
  • Scaled to the 5 trillion annual searches Google processes, this represents tens of millions of false answers per hour, hundreds of thousands per minute
  • Among the 5,380 sources cited in the study, Facebook ranks 2nd and Reddit 4th — two platforms where factual verification is, at best, patchy

Google’s spokesperson, Ned Adriance, responded that « AI search features rely on the same ranking and safety protections that block the vast majority of spam. » The phrasing says everything: we’re talking about spam filtering, not factual validation.

The Bob Marley case, the Yo-Yo Ma case

The examples documented by Oumi are telling. Question: what year did Bob Marley’s house become a museum? The model picks the wrong year from Wikipedia, ignoring a primary source that gave the correct one. Question: was Yo-Yo Ma inducted into the Classical Music Hall of Fame? The model answers « there is no Classical Music Hall of Fame »… while citing the official Classical Music Hall of Fame page that confirms the induction. The model contradicts itself in the same sentence. Without noticing.

This isn’t a model bug. It’s the nature of LLMs themselves: they produce statistically plausible text, not factually verified text. As Lily Ray showed in January 2026 with « The AI Slop Loop » on Substack, you only need to publish a fictional article on a personal blog for AI Overviews to pick up the information as factual once a handful of AI sites repeat it. The citation threshold is terribly low.

The key number: Gemini 3.1 Pro Preview shows a hallucination rate of 50% on open-ended questions, versus 88% for Gemini 3 Pro. Even with the largest single-update improvement of 2025-2026, a mainstream LLM still gets it wrong once out of twice on open-ended questions according to Artificial Analysis Intelligence Index.

Why this is a strategic opportunity for your brand

The dominant reading on LinkedIn and in tech media is panic: « AI Overviews lie, we need to flee. » That’s consumer thinking. The thinking of an e-commerce operator is radically different.

Here’s the market reality in April 2026:

  • Consumers consult AI Overviews, ChatGPT, Perplexity, and Gemini before clicking on a merchant site. 90% accuracy is enough for most to stop clicking the organic result.
  • LLMs need sources they can cite. When a brand publishes precise factual data, it becomes the source to cite, not to compete with.
  • The public errors of AI Overviews (« eat rocks, » « put glue on your pizza ») have created operational distrust: users increasingly seek to verify AI answers. When an LLM cites your product sheet, the verification click comes to you.

The market is dividing into two catégories

On one side, sites that produce generic, approximate content, filled with « generally, » « roughly, » « in most cases. » These sites become invisible to LLMs because they offer nothing the LLM can’t already generate itself.

On the other side, sites that publish verifiable, versioned, structured data with sources backing it up. These sites become the ground truth of the web for LLMs. They’re cited in first position, referenced in long-form answers, used as proof.

What I observe across the 650+ clients I support is that this transition is happening now. Brands that industrialized their product data structuring in 2024-2025 are reaping 30 to 50% of their traffic from LLM citations today. The others watch their organic visibility collapse without understanding why.

Neuroscientific mechanism: A user who verifies an AI answer and lands on your factual source expériences a micro-dopamine reward (« I was right to verify ») associated with your brand. This association strengthens with each verification. Trust, chemically, is oxytocin released by repeated accuracy. You’re not selling a product; you’re building a verification reflex.

Seven techniques to transform your content into ground truth

Here’s the method I apply. No theory: 1,300+ semantic clusters deployed since 2016, 650 clients, direct observation of what LLMs cite—or ignore.

1. Spec data in structured tables, never in prose

An LLM reads a table better than a paragraph. Dimensions, compositions, tolerances, compatibilities, charge times, consumption rates, compatible references—everything in a semantic HTML table with explicit headers. No marketing paragraph saying « approximately 12 hours of battery life. » Precision: « 11h47 in mixed use (protocol X, 50% brightness, internal measurement April 15, 2026). »

2. FAQs with surgical answers

Zero « generally, » zero « roughly, » zero « in most cases. » FAQs that become ground truth answer with a number, a date, a precise condition. Question: « Is this product compatible with model X? » Bad answer: « It’s compatible with most recent models. » Good answer: « Compatible with models X-100 to X-240 manufactured after January 2024. Incompatible with X-90 (different sensor) and X-300 (proprietary connector). »

3. Explicit fact-checks « what’s true, what’s false »

Create dedicated blocks on each sheet: « Common misconceptions. » List false claims circulating about the product and correct them with sources. LLMs love citing these blocks—they resolve the factual ambiguity the LLM is trying to clear up.

4. Cited sources visible and clickable

Every contestable claim comes with an external source—scientific paper, ISO standard, manufacturer datasheet, independent test. Not in invisible footer: in the body text, with a link. This signals to the LLM that your content is itself anchored in verifiable sources.

5. Versioning and update dates explicit

Each page carries a visible update date. Each sensitive piece of data (price, spec, compatibility) carries a version note: « data valid for 2026 model, edition 3. » LLMs heavily weight freshness: content dated 3 months ago beats content dated 2 years ago, all else equal.

6. Comparison tables against alternatives

Systematically create a « this product vs alternatives » table. Honestly. If your product is weaker on one dimension, say it. Radical honesty in comparison is the strongest trust signal for an LLM. I often say in client meetings: « retention is my weak point, » and what should cost me a contract creates the opposite effect. Same mechanism for content.

7. Mini-encyclopedia per product (not product sheet)

Stop thinking « product sheet 500 words. » Think « encyclopedia page 2,500 to 4,000 words » with history, context, use cases, limitations, alternatives, exhaustive FAQ. This is the only way to become the reference source LLMs cite when asked about the category. Your competitors doing 300-word sheets become invisible.

Real case: An e-commerce audio-pro hardware client transformed 180 generic product sheets into encyclopedia pages of 3,000 words in 4 months. Measured result in GSC: +287% Perplexity citations, +42% organic traffic, and 23 B2B inquiry requests per month from queries where an LLM had cited the page as the reference source.

Schema.org: the language LLMs read first

LLMs don’t read your page like a human. They parse structured data (JSON-LD) first, then semantic HTML, then text content. A product sheet without Schema.org? The LLM guesses. Sometimes well. Often less well.

Here are the schema types that signal « ground truth » to LLMs, in priority order:

Product + ProductSpecification

The basic Product schema is insufficient. What you need: enrich with additionalProperty for each measurable spec. Height, weight, supply voltage, temperature range, certification standard. Each property with its name, value, unit. LLMs reproduce these properties as-is in their answers.

QAPage for product FAQs

Use QAPage rather than FAQPage when each question is standalone and complete. The QAPage schema signals to the LLM that the answer has been validated—typically by the brand itself. Higher priority than unverified user content.

HowTo for procedures

Installation, maintenance, care, troubleshooting: each procedure becomes a HowTo schema with HowToStep, HowToTool, HowToSupply. When a user asks ChatGPT « how to install X, » the LLM cites pages with this schema first. It can structure its answer directly from your data.

ClaimReview for fact-checks

Underused in e-commerce. Mistake. The ClaimReview schema lets you formally state: « here’s a claim circulating (example: this product contains lead), here’s our évaluation (false), here’s the source. » LLMs treat ClaimReview with near-absolute priority. It’s literally the schema designed to fight misinformation.

Dataset for public technical data

If you publish benchmarks, comparative tests, compatibility grids, wrap them in Dataset schema with explicit reuse license. An LLM finding a reusable Dataset cites it systematically. It knows it can use it without legal risk.

Organization with extended sameAs

The entity publishing the content must be traceable. Organization schema with sameAs pointing to Wikidata, official LinkedIn, SIREN registry, LEI financial if applicable. A brand identified without ambiguity is a brand the LLM cites without hesitation. A brand with fuzzy entity gets replaced by « a specialized site » in AI answers. You lose the credit.

Concretely, on hi-commerce.fr, I deployed a full schema layer via the Hi-Commerce AI Search plugin (FAQPage, Article, BreadcrumbList, Person, Organization with sameAs Wikidata). Measured result: passage from 0 to over 80 Perplexity citations monthly in 6 months, without changing a line of text content.

DOSE and trust: the chemistry of the reference source

Trust isn’t an abstract marketing concept. It’s a precise neurochemical process I’ve studied for several years in the context of the DOSE model (Dopamine, Oxytocin, Serotonin, Endorphin) applied to SEO and conversion.

Oxytocin: the molecule of repeated reliability

Oxytocin releases during repeated, confirmed trust expériences. Applied to content: each time a user verifies information from you and finds it correct, their brain releases a micro-dose of oxytocin associated with your brand. Over 10-15 successful verifications, a reflex builds: « if I want reliable info on this category, I go to X. »

This reflex is infinitely more solid than any branding campaign. It’s not based on a slogan, it’s based on a verifiable history of accuracy. Result: in technical B2B, some sites have become near-monopolistic references in their category. The ecosystem—clients, journalists, ChatGPT, trainers—cites them reflexively.

Dopamine: the reward of finding precision

Dopamine releases on anticipation of reward AND on its realization. A user seeking « what’s the exact difference between X and Y » who finds a precise, numbered, sourced comparison table expériences a dopamine spike. This expérience imprints strongly. They’ll return. They’ll recommend.

The asymmetric effect: one detected lie kills trust

Here’s the trap: oxytocin builds slowly (10-15 positive expériences to anchor the reflex), but it destroys instantly. A single « detected lie »—a false spec, wrong date, outdated price—triggers the inverse mechanism: cortisol, distrust, avoidance.

Factual rigor is non-negotiable. A site aiming for reference source status can’t afford 1% accepted error. The standard must be zero. To hold this standard, you need process: review, versioning, automatic notification of supplier changes, quarterly audit of sensitive pages.

Serotonin: the status of recognized source

Serotonin regulates the sense of status. When your content is cited by a mainstream LLM—ChatGPT, Perplexity, Gemini—your customers, prospects, and partners see it. The mechanism is the same as academic citations: being cited validates status. E-commerce leaders underestimate this effect because it’s diffuse but cumulative.

Endorphin: effort rewarded

Publishing 180 encyclopedia sheets of 3,000 words is work. It’s exactly why it’s a moat. Your competitors who keep publishing 300-word sheets won’t catch you in 6 months. The effort you invest today creates a barrier to entry that protects you for years.

Measuring your trusted source status: KPIs that actually matter

Traditional SEO measured Google rankings. LLM visibility is measured differently. Most SaaS dashboards lag on the subject. Here are the metrics I use to track reference source status.

Perplexity and ChatGPT citations monthly

Tools: Otterly.ai, Profound, Peec.ai, or custom script querying Perplexity API and ChatGPT Search API against 50 to 200 pivot queries. The KPI: number of citations per month, and average position in the sources cited list. A site in growth phase targets 20-50 citations/month. A mature site in a technical category can exceed 1,000.

Answer anchoring rate

The hidden KPI that makes all the difference. An LLM can cite you without reproducing your data faithfully—it cites your URL but generates an answer not actually drawn from your page. To measure this, compare your content with the LLM's generated answer and calculate what percentage of the answer comes literally from your page. Target: above 60%.

Verification traffic

Google Search Console gives you organic queries. But visits from post-LLM verification are different: they arrive with no visible query, often via empty referrer or directly from Perplexity/ChatGPT. Check GA4 for pages with high direct traffic + high time on page + low bounce: these are often post-LLM verifications.

Volume of newly acquired editorial backlinks

When your content becomes ground truth, other sites start citing you as a source. Not via paid linking: via natural linking, because you're the best public source on the topic. Monitor monthly evolution of unique referring domains in Haloscan or Ahrefs. A typical takeoff: from 12 domains/month to 40-60/month in 6 months post-implementation.

Branded queries tied to factual questions

In GSC, filter queries containing your brand name. Spot technical questions: "brand X spec Y," "brand X compatibility Z." Growth in this query volume is the most reliable indicator you're becoming the technical reference in your category.

Verifier NPS

In post-purchase or post-contact surveys, add: "Did you verify information found on our site with another source before buying? If yes, what source, and what was your conclusion?" The verbatims you gather show exactly when your content won or lost the trust battle.

Consolidated dashboard

In practice, I consolidate these six metrics into a monthly Google Sheet with 3 and 12-month rolling evolution, qualitatively commented. The e-commerce leader tracking these numbers has clear vision of their LLM trajectory, far beyond what Google Search Console alone can show.

Conclusion: the shift now, or progressive invisibility

AI Overviews hallucinate because their architecture drives them to: predict the next token, not verify reality. Not a bug you fix in 2026. Not a detail you solve in 2027. It's structural to mainstream LLMs—hallucination rates drop slowly (88% to 50% on open-ended questions in one model generation), but will never disappear entirely.

Facing this reality, two paths:

Path 1: wait for stabilization. Keep shipping short sheets, generic content, no deep Schema.org, no versioning, no cited sources. In 18 months, your brand will be absent from LLM answers or mentioned vaguely ("a specialized site"). Your competitors who made the shift will have built an unbreachable moat.

Path 2: become ground truth now. Transform each product sheet into a mini-encyclopedia of fact. Deploy Product, ProductSpecification, QAPage, HowTo, ClaimReview, Dataset schemas. Install a process for versioning and factual audit. Measure monthly progress in LLM citations. In 18 months, you'll be the source ChatGPT, Perplexity, and Gemini cite first in your category.

The good news? The timing is perfect. 90% of your competitors haven't started—most think "AI SEO" means adding a ChatGPT prompt to their workflow. While they waste time generating mediocre content at scale, you build—slowly, rigorously—the factual reputation that will make you the reference in 3 years.

Precision is a strategic choice. Accents matter. Numbers matter. Dates matter. Sources matter. Across 1,300+ semantic clusters deployed, I see the same dynamic repeat: brands that respect precision win; others fade slowly from the AI vision.

Make the shift while the competition cost is still free.

Factuality audit and reference source positioning

I conduct a live 30-minute audit of your site: Schema.org structure, factual quality of sheets, current LLM citations, priorities to become ground truth in your category. No pitch, no slides. Direct demonstration on your site with a numbered action plan.

Book a strategic call — 45 min

Frequently Asked Questions

How long does it take for a site to become "ground truth" for LLMs?

Between 4 and 9 months depending on work density. First signals (Perplexity citations, AI Overviews appearance) typically arrive 60-90 days after Schema.org structure and factual content deployment. Consolidation as reference source in a technical category takes 6 to 9 months, sometimes longer in highly competitive sectors. Speed depends mainly on three variables: density of published factual content, rigor of Schema.org implementation, and pre-existing E-E-A-T signals of the publishing entity.

Do we need to rewrite everything, or can we enrich existing content?

Enrich, in 90% of cases. On product sheets with existing content, the approach is adding sections (detailed specs, surgical FAQ, honest comparison, cited sources, versioning) rather than full rewrite. On category pages and guides, though, an encyclopedia rewrite is often necessary to move from 500-800 words to 2,500-4,000 structured words. Initial audit is essential to decide page by page.

Will AI Overviews stop hallucinating soon?

Hallucinations will reduce (88% to 50% on Gemini in one generation), but won't disappear. LLM-type models predict the next token on statistical grounds; this mechanism mechanically produces plausible but false statements. RAG (Retrieval Augmented Generation) architectures improve the rate without hitting zero. Planning your strategy on the assumption of perfectly factual LLMs by 2027 is risky. Planning on the assumption that factuality remains rare is much more prudent.

What budget for transforming 100 product sheets into encyclopedia pages?

Realistic budget: €15,000 to €45,000 ex-tax depending on sector technicality and desired Schema.org depth. This covers initial audit, rewriting (3,000-4,000 words per page), Schema.org deployment, versioning, and 90 days KPI tracking. The false economy of "we do it in-house" is illusory: total cost (internal time, training, factual errors) typically exceeds specialized support budget.

How do we verify if a brand is actually cited by LLMs?

Three cumulative methods. One: manually test 30 to 50 pivot queries on ChatGPT, Perplexity, and Gemini, note citations. Two: deploy a specialized tool (Otterly.ai, Profound, Peec.ai) that automates tracking across hundreds of queries. Three: analyze referrers in GA4 to spot visits from direct Perplexity or ChatGPT Search. The combination of all three gives reliable vision that no single tool provides.

Stéphane Jambu

Stéphane Jambu

SEO & AI Engineer

I build growth systems / AI / Neuroscience | 650+ clients · 80 LinkedIn testimonials · 30 years of expertise · 15 years of systems running without me.

Follow on LinkedIn
Étiqueté

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *