AI Search Is Eating Its Own Synthetic Content
Summarize this article with AI
Perplexity invents a Google update. Lily Ray caught it red-handed
September 2025. Lily Ray asks Perplexity for the latest SEO news. The answer arrives, confident: « Google deployed the Perspective Core Algorithm Update in September 2025. »
Problem. This update doesn’t exist.
Google stopped naming core updates in 2024. « Perspectives » is already a SERP feature. If a real deployment had happened while she was in Austria, her inbox would have alerted her before Perplexity.
She traces the citations. Two SEO agency blogs. Both fed by an AI content pipeline. Both hallucinated an update and published it as reporting. Perplexity read the slop, treated it as a source, and surfaced it as fact.
February 2026. Thomas Germain, tech journalist for the BBC, spends 20 minutes writing an article on his personal blog. Title: « The best tech journalists at eating hot dogs. » He invents a ranking. First place: himself. He cites a « 2026 South Dakota International Hot Dog Championship. » It doesn’t exist. No references.
24 hours later, Google AI Overviews and ChatGPT pick up the info. Claude refuses. Google and OpenAI validate.
Everyone who searched found it. The problem is no longer theoretical.
The problem isn’t training. It’s retrieval
For months, I talked about the digital ouroboros. A model trained on web text. The web fills with AI outputs. The next model trains on a polluted corpus. Distribution flattens. Exceptions vanish.
This vision assumes training cycles. It assumes time. It assumes contamination spreads at the pace of model releases.
I was wrong.
What Lily Ray documented, what Thomas Germain demonstrated, what the New York Times then quantified — none of it is training-side. The model wasn’t retrained. It just retrieved documents via a retrieval-augmented generation (RAG) layer and presented them as facts.
Pollution isn’t playing out in model weights. It’s playing out in the index. In embeddings. In what gets retrieved before generation.
An AI article published today can be indexed and retrieved within 24 hours. No need to wait for GPT-6. No need for a new training run. Contamination is instant.
Order of magnitude I observe with my e-commerce clients: an AI-generated « ultimate guide » article published without proprietary data generates between 12 and 47 reformulated variations on other sites within 30 days. All indexed. All retrieval candidates.
The human brain privileges genuine novelty. It’s the dopaminergic mechanism underlying the DOSE framework (taught by Guillaume Attias in BMO Academy). Dopamine, Oxytocin, Serotonin, Endorphin. Dopamine responds to the unexpected, to signal that breaks through noise.
Your only moat: data no one else can reformulate
An e-commerce client calls me in January 2025. Outdoor marketplace. 800 SKUs. 4,000 organic sessions per month. Product content = reformulated spec sheets + AI-generated « buying guides. »
Problem: Google starts citing their competitors in AI Overviews even when the query contains their brand. Worse: ChatGPT recommends a direct competitor for « best waterproof hiking jacket 2025, » citing… one of their old articles, reformulated by the other site.
We stop everything. We pivot.
We build an internal test lab. 12 flagship products. Water-resistance tests (3,000 cycles), abrasion tests (ISO 12947 standard), impermeability measurements under pressure (water column in mm). We film. We document. We publish raw results + methodology.
No « ultimate guide. » No « top 10. » Just: « We subjected jacket X to 3,000 abrasion cycles per ISO 12947-2 standard. Here are the results. »
14 months later: +820% organic sessions. AI Overviews cite their tests. ChatGPT references them as a primary source. Retrieval layers find no reformulation — because there is none.
No one can copy test data you’re the only one generating. No one can reformulate a benchmark you’re the only one building.
| Content type | Reformulatable by AI | Retrievable in RAG | Defensible moat |
|---|---|---|---|
| Generic buying guide | Yes | Yes | No |
| Reformulated manufacturer spec sheet | Yes | Yes | No |
| Product test with documented protocol | No | Yes (primary source) | Yes |
| Customer benchmark (e.g., product lifespan across 500 orders) | No | Yes (unique data) | Yes |
| User interview (verbatim + context) | Partially | Yes (attributed quote) | Yes |
The DOSE framework aligns here: dopamine responds to genuine novelty. If your content brings a fact the reader’s brain has never seen elsewhere, you trigger the reward circuit. If you reformulate what already exists, you fall below the detection threshold.
The self-reinforcing loop: synthetic content → retrieval → presented as fact
Here’s how it happens, step by step.
Step 1: An SEO agency publishes an AI article « Google deploys Perspective update. » Optimized title. Clean structure. No sources. Pure hallucination.
Step 2: Google indexes the article. Vector embedding created. Article enters the retrieval database.
Step 3: A user asks Perplexity « latest SEO news. » The retrieval system selects this article among others. It scores well (keywords, structure, recent date).
Step 4: Perplexity generates a response citing this article. No fact-checking. No cross-validation. The response is presented as fact.
Step 5: Other sites read Perplexity’s response, reformulate it, and publish their own articles « Google Perspective update confirmed. » New embeddings. New retrieval material.
Loop closed.
According to Search Engine Journal, this contamination now affects all RAG systems: Perplexity, ChatGPT Search, Google AI Overviews, Bing Chat. Claude resists better (refuses to cite suspect sources), but economic pressure to respond fast pushes toward less validation.
I see it with my clients. An e-commerce site publishes an AI-generated « 2026 trend. » Three weeks later, a competitor cites this trend in an article. Six weeks later, ChatGPT mentions it in a response. Nine weeks later, the original client asks me why their competitor is cited first for a trend they invented.
Answer: because the info was synthetic. No proprietary anchor. The retrieval layer doesn’t distinguish primary source from amplified echo. It selects what best matches the query + what has the strongest freshness signals.
Practical application for an e-commerce site
You sell products. You probably have a blog. You maybe publish guides, comparisons, « 2026 trends. »
Here’s what happens if that content is AI-generated or reformulated without proprietary data:
- AI Overviews will cite your competitors even for queries where you rank 1-3.
- ChatGPT will reformulate your content and present it without citing you, because it found a fresher reformulation elsewhere.
- Your « ultimate guide » articles generate traffic but no conversion, because the reader has seen 12 versions of the same info elsewhere.
Here’s what changes if you pivot to proprietary data:
- Documented product tests: protocol + raw results + photos/video. Example: « We tested 8 office chairs over 6 months. Here’s average caster lifespan by floor type. »
- Customer benchmark: anonymized aggregation of real data. Example: « Across 1,200 mattress orders, here’s return distribution by firmness level. »
- User interviews: verbatim + usage context. Example: « Marie, physical therapist, used this yoga mat 5 times weekly for 18 months. Here’s what she says. »
- Real-world comparisons: not « best product 2026 » but « we used the 3 best-selling references for 90 days. Here’s what broke, what held, what surprised us. »
Order of magnitude: with my e-commerce clients who pivoted to 60% proprietary data content minimum, AI Overviews citation rate moves from ~8% to ~34% in 6 months. Absolute traffic doesn’t spike immediately. But retention does. Users return. They bookmark. They share.
Why? Because the brain recognizes information it can’t get elsewhere. Dopamine. Novel signal. Reinforced memorization.
On the retrieval side, the system detects a primary source. No competing reformulations. No dilution. Your URL becomes the reference.
What I do differently since September 2025
Before September, I built classic semantic clusters. URL architecture. Internal linking. Structured content. It worked.
Since Lily Ray documented Perplexity contamination, I’ve adjusted 3 things.
1. Retrieval vulnerability audit
Before launching a cluster, I identify reformulatable content. I flag it. I replace it with proprietary assets or delete it. Order of magnitude: 40% of an average e-commerce site’s content is reformulatable by AI in 24 hours. These 40% generate zero moat.
2. Proprietary data protocol
I don’t start a project anymore without a proprietary data generation plan. Either the client has data already (product returns, support tickets, behavioral analytics), or we set up an internal lab (tests, benchmark, interviews). Content builds around this data. Not the reverse.
3. Source traceability in content
Every stat, every test result, every quote is sourced. Not for Google. For retrieval layers. If ChatGPT or Perplexity cites your content, it must be able to distinguish « reformulation of public info » from « primary data. » Traceability creates this distinction.
Results observed across 9 deployments between September 2025 and March 2026:
- AI Overviews citation rate: +340% (median)
- Session duration: +28% (users read to the end)
- Returning visitors: +52% (they return for data, not SEO)
Absolute traffic doesn’t spike immediately. But traffic quality changes. And in a loop where synthetic content pollutes retrieval, quality becomes the only lever that matters.
What if your content is already in the loop?
Here’s the question I ask every e-commerce client since January 2026.
Open ChatGPT. Ask it to summarize your best blog article. The one that drives the most traffic.
Look at the response. Does it cite your site? Or does it reformulate your content citing three other sites that copied your angle?
If it doesn’t cite you, your content is already in the loop. It feeds retrieval. It’s presented as fact. But you’re no longer the source.
Now ask yourself the inverse question. If you deleted this article tomorrow, would anything change on the web? Would a piece of data, a test, a benchmark disappear?
If the answer is no, you have no moat. You have volume. Volume doesn’t defend anymore.
Retrieval systems privilege primary sources when identifiable. The human brain privileges genuine novelty when memorable. There’s an alignment between the two. You can exploit it. Or you can keep feeding a loop that makes you interchangeable.
I’m not selling you the method. I’m showing you the pages. Mine, my clients’, the ones that resist contamination because they contain data no one else can reformulate.
The question isn’t « does your content rank? » anymore. The question is: would your content survive instant AI reformulation without losing its value?
Retrieval audit: identify vulnerable content
I scan your site against retrieval systems (Perplexity, ChatGPT Search, AI Overviews). I show you what gets cited, what gets diluted, what gets reformulated without attribution. First call = live audit, no deck.
Book a strategic call — 45 minFrequently Asked Questions
What is retrieval contamination?
When synthetic AI content enters a retrieval system’s index (Perplexity, ChatGPT Search, AI Overviews) and is presented as fact without model retraining. Timeframe: 24 hours.
Why aren’t my well-ranked articles cited in AI Overviews?
Because retrieval layers don’t privilege classic SEO ranking. They privilege vector similarity + freshness + primary source identifiability. If your content is reformulatable, it gets diluted.
What counts as exploitable proprietary data in e-commerce SEO?
Documented product test (protocol + results), anonymized customer benchmark (e.g., return rate by category), user interview (verbatim + context), real-world comparison over long duration. Anything that can’t be copied.
Should I stop publishing buying guides?
No. But replace generics with proprietary data. « Best product 2026 » creates no moat. « We tested 8 products for 90 days, here are results » does.
How do I check if my content is in the contamination loop?
Ask ChatGPT to summarize your best article. If it reformulates without citing you, or cites a competitor who copied your angle, you’re in the loop. If nothing would change on the web if you deleted that article, you have no moat.

