AI Search Is Eating Its Own Synthetic Content

Summarize this article with AI

In short: In short: Perplexity cited a « Google Perspective Update » in September 2025. It didn’t exist. ChatGPT validated a fictional hot-dog championship in 24 hours. The problem isn’t model training — it’s retrieval. Synthetic content pollutes indexes before being retrained. Only proprietary originality — test data, client benchmarks, real cases — builds a defensible moat.
24 htime for a made-up article to pollute AI Overviews and ChatGPT (Thomas Germain, BBC)
0Google core updates named since 2024 — Perplexity cited a fictional one in Sept. 2025 (Lily Ray)
+820%organic sessions for an e-commerce client after proprietary data pivot (tracked over 14 months)

Perplexity invents a Google update. Lily Ray caught it red-handed

September 2025. Lily Ray asks Perplexity for the latest SEO news. The answer arrives, confident: « Google deployed the Perspective Core Algorithm Update in September 2025. »

Problem. This update doesn’t exist.

Google stopped naming core updates in 2024. « Perspectives » is already a SERP feature. If a real deployment had happened while she was in Austria, her inbox would have alerted her before Perplexity.

She traces the citations. Two SEO agency blogs. Both fed by an AI content pipeline. Both hallucinated an update and published it as reporting. Perplexity read the slop, treated it as a source, and surfaced it as fact.

February 2026. Thomas Germain, tech journalist for the BBC, spends 20 minutes writing an article on his personal blog. Title: « The best tech journalists at eating hot dogs. » He invents a ranking. First place: himself. He cites a « 2026 South Dakota International Hot Dog Championship. » It doesn’t exist. No references.

24 hours later, Google AI Overviews and ChatGPT pick up the info. Claude refuses. Google and OpenAI validate.

Everyone who searched found it. The problem is no longer theoretical.

What this changes for you: if your e-commerce content relies on AI reformulation of public info (« ultimate guide, » « 2026 comparison, » « product trends »), you’re feeding a loop that dissolves your differentiation. Retrieval layers don’t distinguish primary source from reformulation. They consume volume. Your content becomes indistinguishable from noise.

The problem isn’t training. It’s retrieval

For months, I talked about the digital ouroboros. A model trained on web text. The web fills with AI outputs. The next model trains on a polluted corpus. Distribution flattens. Exceptions vanish.

This vision assumes training cycles. It assumes time. It assumes contamination spreads at the pace of model releases.

I was wrong.

What Lily Ray documented, what Thomas Germain demonstrated, what the New York Times then quantified — none of it is training-side. The model wasn’t retrained. It just retrieved documents via a retrieval-augmented generation (RAG) layer and presented them as facts.

Pollution isn’t playing out in model weights. It’s playing out in the index. In embeddings. In what gets retrieved before generation.

An AI article published today can be indexed and retrieved within 24 hours. No need to wait for GPT-6. No need for a new training run. Contamination is instant.

Order of magnitude I observe with my e-commerce clients: an AI-generated « ultimate guide » article published without proprietary data generates between 12 and 47 reformulated variations on other sites within 30 days. All indexed. All retrieval candidates.

The human brain privileges genuine novelty. It’s the dopaminergic mechanism underlying the DOSE framework (taught by Guillaume Attias in BMO Academy). Dopamine, Oxytocin, Serotonin, Endorphin. Dopamine responds to the unexpected, to signal that breaks through noise.

Neuro + algo mechanism: Retrieval systems encode documents as vector embeddings. If 47 articles reformulate the same info, their embeddings converge. The model retrieves one at random. The reader’s brain that receives this info doesn’t release dopamine — it recognizes a repetition. Result: no memorization, no action, no conversion. The neuro/algo alignment breaks. Your content is technically visible but cognitively invisible.

Your only moat: data no one else can reformulate

An e-commerce client calls me in January 2025. Outdoor marketplace. 800 SKUs. 4,000 organic sessions per month. Product content = reformulated spec sheets + AI-generated « buying guides. »

Problem: Google starts citing their competitors in AI Overviews even when the query contains their brand. Worse: ChatGPT recommends a direct competitor for « best waterproof hiking jacket 2025, » citing… one of their old articles, reformulated by the other site.

We stop everything. We pivot.

We build an internal test lab. 12 flagship products. Water-resistance tests (3,000 cycles), abrasion tests (ISO 12947 standard), impermeability measurements under pressure (water column in mm). We film. We document. We publish raw results + methodology.

No « ultimate guide. » No « top 10. » Just: « We subjected jacket X to 3,000 abrasion cycles per ISO 12947-2 standard. Here are the results. »

14 months later: +820% organic sessions. AI Overviews cite their tests. ChatGPT references them as a primary source. Retrieval layers find no reformulation — because there is none.

No one can copy test data you’re the only one generating. No one can reformulate a benchmark you’re the only one building.

Content typeReformulatable by AIRetrievable in RAGDefensible moat
Generic buying guideYesYesNo
Reformulated manufacturer spec sheetYesYesNo
Product test with documented protocolNoYes (primary source)Yes
Customer benchmark (e.g., product lifespan across 500 orders)NoYes (unique data)Yes
User interview (verbatim + context)PartiallyYes (attributed quote)Yes

The DOSE framework aligns here: dopamine responds to genuine novelty. If your content brings a fact the reader’s brain has never seen elsewhere, you trigger the reward circuit. If you reformulate what already exists, you fall below the detection threshold.

The self-reinforcing loop: synthetic content → retrieval → presented as fact

Here’s how it happens, step by step.

Step 1: An SEO agency publishes an AI article « Google deploys Perspective update. » Optimized title. Clean structure. No sources. Pure hallucination.

Step 2: Google indexes the article. Vector embedding created. Article enters the retrieval database.

Step 3: A user asks Perplexity « latest SEO news. » The retrieval system selects this article among others. It scores well (keywords, structure, recent date).

Step 4: Perplexity generates a response citing this article. No fact-checking. No cross-validation. The response is presented as fact.

Step 5: Other sites read Perplexity’s response, reformulate it, and publish their own articles « Google Perspective update confirmed. » New embeddings. New retrieval material.

Loop closed.

According to Search Engine Journal, this contamination now affects all RAG systems: Perplexity, ChatGPT Search, Google AI Overviews, Bing Chat. Claude resists better (refuses to cite suspect sources), but economic pressure to respond fast pushes toward less validation.

I see it with my clients. An e-commerce site publishes an AI-generated « 2026 trend. » Three weeks later, a competitor cites this trend in an article. Six weeks later, ChatGPT mentions it in a response. Nine weeks later, the original client asks me why their competitor is cited first for a trend they invented.

Answer: because the info was synthetic. No proprietary anchor. The retrieval layer doesn’t distinguish primary source from amplified echo. It selects what best matches the query + what has the strongest freshness signals.

Counter-intuitive: The more « SEO-optimized » your content is in the classic sense (keywords, Hn structure, readability), the more vulnerable it is to AI reformulation. Retrieval systems privilege readability and lexical match. A perfectly structured article but without proprietary data becomes an ideal retrieval candidate… and for dilution.

Practical application for an e-commerce site

You sell products. You probably have a blog. You maybe publish guides, comparisons, « 2026 trends. »

Here’s what happens if that content is AI-generated or reformulated without proprietary data:

  • AI Overviews will cite your competitors even for queries where you rank 1-3.
  • ChatGPT will reformulate your content and present it without citing you, because it found a fresher reformulation elsewhere.
  • Your « ultimate guide » articles generate traffic but no conversion, because the reader has seen 12 versions of the same info elsewhere.

Here’s what changes if you pivot to proprietary data:

  • Documented product tests: protocol + raw results + photos/video. Example: « We tested 8 office chairs over 6 months. Here’s average caster lifespan by floor type. »
  • Customer benchmark: anonymized aggregation of real data. Example: « Across 1,200 mattress orders, here’s return distribution by firmness level. »
  • User interviews: verbatim + usage context. Example: « Marie, physical therapist, used this yoga mat 5 times weekly for 18 months. Here’s what she says. »
  • Real-world comparisons: not « best product 2026 » but « we used the 3 best-selling references for 90 days. Here’s what broke, what held, what surprised us. »

Order of magnitude: with my e-commerce clients who pivoted to 60% proprietary data content minimum, AI Overviews citation rate moves from ~8% to ~34% in 6 months. Absolute traffic doesn’t spike immediately. But retention does. Users return. They bookmark. They share.

Why? Because the brain recognizes information it can’t get elsewhere. Dopamine. Novel signal. Reinforced memorization.

On the retrieval side, the system detects a primary source. No competing reformulations. No dilution. Your URL becomes the reference.

What I do differently since September 2025

Before September, I built classic semantic clusters. URL architecture. Internal linking. Structured content. It worked.

Since Lily Ray documented Perplexity contamination, I’ve adjusted 3 things.

1. Retrieval vulnerability audit

Before launching a cluster, I identify reformulatable content. I flag it. I replace it with proprietary assets or delete it. Order of magnitude: 40% of an average e-commerce site’s content is reformulatable by AI in 24 hours. These 40% generate zero moat.

2. Proprietary data protocol

I don’t start a project anymore without a proprietary data generation plan. Either the client has data already (product returns, support tickets, behavioral analytics), or we set up an internal lab (tests, benchmark, interviews). Content builds around this data. Not the reverse.

3. Source traceability in content

Every stat, every test result, every quote is sourced. Not for Google. For retrieval layers. If ChatGPT or Perplexity cites your content, it must be able to distinguish « reformulation of public info » from « primary data. » Traceability creates this distinction.

Results observed across 9 deployments between September 2025 and March 2026:

  • AI Overviews citation rate: +340% (median)
  • Session duration: +28% (users read to the end)
  • Returning visitors: +52% (they return for data, not SEO)

Absolute traffic doesn’t spike immediately. But traffic quality changes. And in a loop where synthetic content pollutes retrieval, quality becomes the only lever that matters.

Neuro + algo mechanism: The brain encodes memories by their distinctiveness. Information repeated 10 times creates weak encoding. Unique information linked to specific context creates strong encoding + dopamine release. Retrieval layers encode documents as vectors. If 10 documents say the same thing, their vectors converge. The system picks one at random. If one document says something unique, its vector is isolated. The system privileges it. Brain and algo converge: proprietary originality wins both ways.

What if your content is already in the loop?

Here’s the question I ask every e-commerce client since January 2026.

Open ChatGPT. Ask it to summarize your best blog article. The one that drives the most traffic.

Look at the response. Does it cite your site? Or does it reformulate your content citing three other sites that copied your angle?

If it doesn’t cite you, your content is already in the loop. It feeds retrieval. It’s presented as fact. But you’re no longer the source.

Now ask yourself the inverse question. If you deleted this article tomorrow, would anything change on the web? Would a piece of data, a test, a benchmark disappear?

If the answer is no, you have no moat. You have volume. Volume doesn’t defend anymore.

Retrieval systems privilege primary sources when identifiable. The human brain privileges genuine novelty when memorable. There’s an alignment between the two. You can exploit it. Or you can keep feeding a loop that makes you interchangeable.

I’m not selling you the method. I’m showing you the pages. Mine, my clients’, the ones that resist contamination because they contain data no one else can reformulate.

The question isn’t « does your content rank? » anymore. The question is: would your content survive instant AI reformulation without losing its value?

Retrieval audit: identify vulnerable content

I scan your site against retrieval systems (Perplexity, ChatGPT Search, AI Overviews). I show you what gets cited, what gets diluted, what gets reformulated without attribution. First call = live audit, no deck.

Book a strategic call — 45 min

Frequently Asked Questions

What is retrieval contamination?

When synthetic AI content enters a retrieval system’s index (Perplexity, ChatGPT Search, AI Overviews) and is presented as fact without model retraining. Timeframe: 24 hours.

Why aren’t my well-ranked articles cited in AI Overviews?

Because retrieval layers don’t privilege classic SEO ranking. They privilege vector similarity + freshness + primary source identifiability. If your content is reformulatable, it gets diluted.

What counts as exploitable proprietary data in e-commerce SEO?

Documented product test (protocol + results), anonymized customer benchmark (e.g., return rate by category), user interview (verbatim + context), real-world comparison over long duration. Anything that can’t be copied.

Should I stop publishing buying guides?

No. But replace generics with proprietary data. « Best product 2026 » creates no moat. « We tested 8 products for 90 days, here are results » does.

How do I check if my content is in the contamination loop?

Ask ChatGPT to summarize your best article. If it reformulates without citing you, or cites a competitor who copied your angle, you’re in the loop. If nothing would change on the web if you deleted that article, you have no moat.

Stéphane Jambu

Stéphane Jambu

SEO & AI Engineer

I build growth systems / AI / Neuroscience | 650+ clients · 80 LinkedIn testimonials · 30 years of expertise · 15 years of systems running without me.

Follow on LinkedIn

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *