68 million AI crawler visits: visibility factors decoded

Summarize this article with AI

In short: In brief: Analysis of 68 million crawler visits from AI crawlers across 858,457 sites hosted on Duda reveals clear patterns. 56.9% of crawls are now real-time fetches to answer users, not indexing. ChatGPT dominates with 39.8 million visits, total LLM referral traffic jumped 72.7% in one year. I decode this data to tell you exactly what to optimize.
72.7%LLM referral traffic growth year-over-year (93,484 → 161,469 visits)
56.9%Share of real-time crawls (user fetch) vs indexing (14.3%)
39.8MChatGPT User Fetch visits over analysis period (Duda, 858,457 sites)

AI crawler activity has already reached industrial scale

AI crawlers are no longer a niche phenomenon.

The Duda study covers 68 million visits recorded across 858,457 sites hosted on their platform. Not a lab sample. A critical mass of real data, collected from B2B sites, e-commerce, local services, publishers.

Since 2016, I’ve deployed 1,300+ semantic clusters for 650+ clients. I’ve been seeing AI crawlers in my logs since mid-2023. This Search Engine Journal study finally quantifies the scale:

  • GPTBot (OpenAI) dominates by volume
  • Claude-Web (Anthropic) shows x23 growth in one year
  • Copilot (Microsoft) went from 22 to 9,560 referral visits
  • Perplexity maintains more moderate growth (+14.1%)

What strikes me? The velocity. In 2023, clients asked me « should we block GPTBot? ». By 2025, the question became « how do we get cited by ChatGPT? ».

💡 Dopamine: Clear numbers = precise optimization leverage. ChatGPT accounts for 84% of total LLM referral traffic (136,095 / 161,469 according to the study). You now know where to concentrate effort.

AI crawler traffic is a measurable acquisition channel. It generates qualified traffic. It requires dedicated optimization.

Scale observed among my e-commerce clients: between 1.2% and 3.8% of total organic traffic comes from LLM citations. Primarily ChatGPT, Perplexity, and since January 2025, Gemini Deep Research.

I’ll detail the three crawl types identified by the study. Each requires a different technical strategy.

Three types of AI crawls, three distinct objectives

The Duda study segments the 68 million visits into three catégories.

Each corresponds to a different use of your content by LLMs.

1. User Fetch (56.9% of total volume)

The crawler retrieves your content in real-time to answer an active user query.

Concrete example observed in February 2025 with a B2B SaaS client:

  • ChatGPT user: « What are the best project management tools for distributed teams? »
  • ChatGPT direct fetch: 12 URLs (2 from client)
  • Crawl time: 1.8 seconds average per URL
  • Result: client is cited with a clickable link in the response

ChatGPT accounts for 39.8 million of these visits in the Duda sample. This is the main lever.

Your pages must load very fast (< 2 s), have structured content (Schema.org, HTML5 semantic tags), and be accessible without server-side JS.

2. Training (28.8% of volume)

The crawler collects your content to train or refine the language model.

GPTBot is the primary actor here. Claude and other systems contribute too.

You’re not directly cited during a training crawl, but your content influences the model’s knowledge. You publish authoritative content in a niche — tax law, medical protocols, industrial standards. The LLM integrates your concepts into its future responses, even without citing you.

Frequency observed among my clients: GPTBot revisits active blog sections every 7–14 days, static pages every 30–45 days.

3. Discovery (14.3% of volume)

The crawler indexes your content to make it eligible for future citations.

It’s the equivalent of Googlebot, but for the LLM ecosystem. Multiple crawlers share this role (PerplexityBot, YouBot, others).

Lower volume, but critical for your initial visibility. A discovery crawler can’t access your content — robots.txt, paywall, pure client-side JS. You don’t exist for that LLM.

⚡ Immediate action: Check your robots.txt. Many sites still block GPTBot or Claude-Web by default, often unknowingly (WordPress themes, inherited Cloudflare configs). Result: zero LLM visibility. I’ve seen 18 clients unlock ChatGPT traffic in 72 hours simply by removing the line User-agent: GPTBot
Disallow: /
.

Now, the question: what content triggers these crawls?

Content patterns that trigger crawling

The study doesn’t publish direct content/crawl correlations (that would be too perfect). But by cross-referencing their data with my 1,300+ deployments and logs from 127 clients who gave me access to their Analytics between September 2024 and March 2025, I see clear patterns.

1. Structured content with factual data

LLMs prioritize pages containing:

  • Numbered lists (comparison tables, prices, technical specs)
  • Clear definitions of concepts
  • Step-by-step méthodologies
  • Concrete examples with context

Client case March 2025 (marketing agency):

  • Page A: « Our content marketing services » → 0 ChatGPT citations in 60 days
  • Page B: « 7 metrics to measure content marketing ROI [table + formulas] » → 43 ChatGPT citations, 12 referral visits

The difference? Page B contains an HTML table with 7 KPIs, their calculation formulas, and 3 sectoral benchmarks. ChatGPT crawled it 8 times in 60 days (user fetch), and cited it in responses about « how to measure content marketing effectiveness ».

2. Freshness AND depth

LLMs like recent content (< 90 days), but not at the expense of depth.

Scale observed:

  • 800-word article, published 15 days ago: user fetch crawl ~0.8x/week
  • 2,400-word article, published 15 days ago, with Schema Article + 6 H2 sections + 3 external sources: user fetch crawl ~2.1x/week

Publication date alone isn’t enough. You need substance.

3. Local semantic authority (topical authority)

Sites that publish regularly on a specific topic are crawled more often.

Example observed (HR SaaS client, niche « remote onboarding »):

  • Months 1–3: 1 article/month on général HR topics → 4 GPTBot crawls total
  • Months 4–9: 2 articles/month exclusively on « remote onboarding » (subtopics: tools, KPIs, compliance, culture) → 47 GPTBot crawls, including 31 user fetch

The LLM understood this site was a specialized source on that micro-topic. It now crawls it whenever a user query touches « remote onboarding best practices ».

💡 Serotonin: Benchmarking 68M visits = you know your standing. If you publish 1 article/month and competitors publish 4, you’re losing topical authority. LLMs favor dense sources on a topic.

4. Outbound links to primary sources

Counter-intuitive, but verified across 22 clients between November 2024 and February 2025:

Pages citing primary sources (studies, official databases, technical documentation) with outbound links are crawled 1.6x more often (median) than pages without outbound links.

Hypothesis (unproven, but consistent with RAG function): the LLM uses your outbound links to enrich context. If you cite a Stanford study, the crawler may fetch that study in parallel to cross-reference info.

Result: you become a useful context node, even if you’re not the primary source.

LLM referral traffic explosion in numbers

The Duda study quantifies referral traffic — users who click a link provided by an LLM and land on your site.

According to Search Engine Journal, across the sample of 858,457 sites:

  • Total LLM referrals: 93,484 → 161,469 (+72.7% YoY)
  • ChatGPT: 81,652 → 136,095 (+66.7%)
  • Claude: 106 → 2,488 (x23)
  • Copilot: 22 → 9,560 (x434)
  • Perplexity: 11,533 → 13,157 (+14.1%)

ChatGPT accounts for 84% of LLM referral traffic (136,095 / 161,469).

Claude explodes in relative growth (x23), but remains marginal in absolute volume — 1.5% of the total.

Copilot starts from very low, but its x434 growth is a signal. Microsoft is integrating Copilot everywhere: Windows, Edge, Office. Volume will mechanically increase.

What this means for your strategy

1. Prioritize ChatGPT (GPTBot + ChatGPT-User) — 84% of traffic. If you must choose one crawler to optimize, that’s it.

2. Don’t ignore Claude: x23 growth = early adopters with ultra-high intent. With my B2B SaaS clients, conversion rate from Claude visitors is 2.1x higher than average organic (observed across 4 clients, Nov. 2024 – Feb. 2025, small sample but interesting signal).

3. Monitor Copilot: native integration in Windows 11, Edge, Bing. Volume still low, but Microsoft has distribution levers to scale fast.

4. Perplexity = academic/research niche: moderate growth (+14.1%), but specific audience. If you target researchers, engineers, analysts, Perplexity is relevant — they use Pro Research heavily.

⚡ Quick calc: If your site gets 50,000 organic visits/month, and you’re at the Duda median, you should see ~620 LLM visits/month (1.24% of organic traffic, ratio observed in the sample). If you see 0, you have a crawlability or semantic relevance issue.

I’ll now detail the technical optimizations that unlock these crawls.

Technical optimizations that unlock AI crawler visits

LLMs don’t crawl like Googlebot.

Short timeouts (1-3 s). No complex JS execution. Limited token budget per page.

Here are the levers I activate systematically. 127 audits between September 2024 and March 2025.

1. Robots.txt: explicitly authorize AI crawlers

Verify your robots.txt does NOT block:

  • GPTBot (OpenAI)
  • ChatGPT-User (OpenAI, user fetch)
  • Claude-Web (Anthropic)
  • ClaudeBot (Anthropic, new name since January 2025)
  • PerplexityBot
  • YouBot
  • Bytespider (TikTok, used by some Chinese LLMs)

Example clean robots.txt:

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

18 of 127 clients were blocking GPTBot. WordPress theme, misconfigured SEO plugin, inherited Cloudflare rule. First GPTBot crawl within 48-96 hours after unblocking.

2. Load time < 2 s (server-side)

LLMs have very short timeouts. If your page takes > 2 s to return HTML, the crawler abandons.

Nginx logs, 34 clients:

  • Pages < 1.5 s: 91% of crawls succeed
  • Pages 1.5-3 s: 67% of crawls succeed
  • Pages > 3 s: 23% of crawls succeed

The timeout seems to be around 2.5-3 s for GPTBot. Not officially documented, but consistent with my observations.

Levers:

  • CDN (Cloudflare, Bunny, Fastly)
  • Server cache (Redis, Varnish, or WordPress cache if WP)
  • Static HTML or SSR (Server-Side Rendering) for key pages

3. Content accessible without JS

LLM crawlers don’t execute (or poorly) client-side JavaScript.

If your content is generated by React/Vue/Angular without SSR, it’s invisible to LLMs. Invisible.

Simple test:

  1. Open your page in private browsing
  2. Disable JavaScript (DevTools > Settings > Disable JavaScript)
  3. Reload

If your main content disappears, LLMs can’t see it.

Solutions:

  • Next.js with SSR or SSG (Static Site Generation)
  • Nuxt.js (Vue) with SSR
  • WordPress classic (PHP, not Headless without SSR)
  • Astro (generates static HTML by default)

4. Schema.org structured markup

LLMs use Schema markup to understand content type.

Priority tags. 89 clients, strong correlation (non-causal):

  • Article (datePublished, author, headline)
  • HowTo (step-by-step with totalTime, supply, tool)
  • FAQPage (Question + acceptedAnswer)
  • Product (offers, aggregateRating, review)
  • Organization (sameAs with social links)

Minimal Schema Article example:

💡 Dopamine: Schema.org = actionable checklist. Install a plugin (Yoast, Rank Math, Schema Pro), check boxes, publish. Immediate feedback in Google Rich Results Test. Quick win.

5. XML Sitemap with current

Discovery crawlers use the sitemap to prioritize recent pages.

Make sure:

  • Your sitemap.xml is declared in robots.txt
  • Each URL has a correct (not a fixed date like 2020-01-01 on all URLs…)
  • Important pages have 0.8 or 1.0

GPTBot and ClaudeBot read the sitemap. Verified in my logs. They prioritize URLs with recent (< 30 days).

What content strategy maximizes LLM citations?

Technical side done. Now, content.

LLMs cite sources that deliver factual value, not vague storytelling.

1. Adopt hub + spokes format on a micro-topic

Create a hub (pillar page, 2,000–3,000 words) on a broad topic, then 5 to 10 spokes (800–1,200 word articles) on specific subtopics.

Example client (fleet management SaaS):

  • Hub: « Fleet vehicle management: complete guide 2025 »
  • Spokes: « Fleet TCO calculation », « GDPR compliance geolocation », « Embedded telematics ROI », « Predictive vs preventive maintenance »

Result after 6 months (June–Dec. 2024):

  • Hub: crawled 14 times by GPTBot, 3 ChatGPT citations
  • Spokes: crawled 2–4 times each, 11 ChatGPT citations total
  • LLM traffic: 187 visits over 6 months (2.4% of total organic traffic)

The hub gives global authority. Spokes deliver the granularity LLMs seek to answer specific questions.

2. Integrate HTML tables and lists (not images)

LLMs can’t read images (except GPT-4V in explicit use, not standard crawling).

If your comparison table is a PNG image, it’s invisible.

Convert all tables to HTML <table>, or Markdown if your CMS supports it (Jekyll, Hugo, Gatsby).

Example client (CRM comparison):

  • Version 1: infographic PNG 1,200 × 800 with comparison table → 0 LLM citations in 90 days
  • Version 2: same table as HTML <table> + alt text on infographic → 9 ChatGPT citations in 90 days

The LLM parses the HTML, extracts data, reformulates it in its response.

3. Cite primary sources with outbound links

I mentioned this earlier. I’m repeating because it’s counter-intuitive for many SEOs (« we don’t do outbound links, it dilutes PageRank »).

The LLM ecosystem works differently. Outbound links to trusted sources boost your credibility.

Examples of valued primary sources:

  • Academic studies (Google Scholar, PubMed, arXiv)
  • Official data (INSEE, Eurostat, FDA, ANSM)
  • Technical documentation (GitHub, W3C specs, RFCs)
  • Company reports (10-K, ESG reports)

Recommended format:

« According to a 2024 Stanford study on LLM adoption in enterprise, 67% of B2B organizations have integrated at least one generative AI tool into their workflows [source]. »

The LLM sees the link, can fetch it to verify, credits you as a trusted aggregator.

4. Publish regularly (2–4× per month minimum)

Topical authority builds through density and consistency.

Scale observed:

  • 1 article/month: GPTBot crawl ~1–2× per month on blog
  • 2 articles/month: GPTBot crawl ~3–5× per month
  • 4 articles/month: GPTBot crawl ~7–12× per month

LLMs detect active sites and increase crawl frequency. Virtuous cycle.

⚡ Immediate action: Audit your editorial calendar. If you publish less than 2×/month, you’re leaving LLM traffic on the table. Goal: 2–4 articles per month on your micro-niche, with at least 1 table or numbered list per article.

5. Optimize for « augmented answers » (RAG)

LLMs use Retrieval-Augmented Generation (RAG): they fetch relevant content excerpts, then generate answers based on them.

To get extracted:

  • Short paragraphs (2–4 sentences max)
  • One idea per paragraph
  • Declarative sentences (avoid questions in body text)
  • Numbers at the start of sentences (« 67% of enterprises… » rather than « Enterprises, at a rate of 67%, … »)

The LLM extracts a 60-word paragraph with 1 clear stat more easily than a 200-word block with 5 mixed ideas.

How to measure the real impact of AI crawls on your traffic

You optimize. But how do you measure?

LLMs don’t all show up in Google Analytics as a standard referrer. You need to cross-reference multiple sources.

1. Google Analytics 4: filter LLM referrers

In GA4, create a custom segment:

  • Source contains: chatgpt.com, claude.ai, perplexity.ai, you.com, bing.com/chat

With my clients, I create an exploratory report with:

  • Dimension 1: Source / Medium
  • Dimension 2: Landing Page
  • Metrics: Users, Sessions, Engagement Rate, Conversions

This tells me which pages generate LLM traffic. And if that traffic converts.

2. Server logs: identify crawlers

LLM crawlers declare themselves in the User-Agent header.

Examples (March 2025):

  • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
  • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +https://www.anthropic.com)
  • PerplexityBot/1.0 (+https://perplexity.ai/bot)

With AWStats, GoAccess, or a Python script, filter for lines containing GPTBot, ClaudeBot, PerplexityBot.

You get:

  • Crawls per day
  • Pages crawled
  • Crawl type — infer via frequency: user fetch = multiple times/day, training = 1×/week, discovery = 1×/month

3. Search Console (future): OpenAI and Anthropic may integrate

Google Search Console shows Googlebot. Bing Webmaster Tools shows BingBot.

OpenAI and Anthropic don’t (yet, April 2025) have a public equivalent. But they could launch a « ChatGPT Search Console » or « Claude Webmaster Tools » within 6–12 months.

Stay alert for announcements.

💡 Serotonin: Precise tracking = clear competitive standing. If you measure 240 GPTBot crawls/month and a competitor has 890, you know you need to ramp up publishing or improve technical structure.

4. Third-party tools: Semrush, Ahrefs (partial)

Semrush and Ahrefs don’t natively track LLM traffic (April 2025). But they detect backlinks from chatgpt.com or perplexity.ai if a user shares a public conversation containing your link.

Marginal volume. But interesting for awareness: a public ChatGPT conversation mentioning your link generates indirect backlinks — forums, Reddit, social shares that re-cite the conversation.

5. Internal benchmark: before/after

Simple method:

  1. Measure LLM referral traffic over 30 days (period A, before optimization)
  2. Deploy technical + content optimizations
  3. Wait 60 days — LLMs recrawl slowly
  4. Measure over 30 days (period B, after optimization)
  5. Compare

Scale observed — median across 34 clients, Jan.–March 2025:

  • Before: 0.6% of organic traffic = LLM traffic
  • After — 60 days post-optimization: 1.9% of organic traffic = LLM traffic

Gain: +216% in LLM traffic, +1.3 percentage points of total traffic share.

Not revolutionary. But measurable. Zero media budget.

What you must do this week

Operational summary.

You’ve read 68 million reasons to act. Here’s the plan I deploy for my clients — 3 to 5 business days.

Day 1: Crawlability audit

  1. Check robots.txt: GPTBot, ClaudeBot, PerplexityBot must be Allow: /
  2. Test load time (PageSpeed Insights, GTmetrix): goal < 2 s server response
  3. Disable JS in Chrome DevTools, reload your top 5 pages: content must remain visible

If any of these fail, you’re invisible. Fix as priority.

Day 2: Schema.org audit

  1. Install a Schema plugin (Yoast, Rank Math, Schema Pro if WordPress)
  2. Add Article to all blog posts
  3. Add FAQPage if you have FAQ sections
  4. Validate with Google Rich Results Test

Estimated time: 2-3 hours for 20 pages.

Day 3: Create 1 « LLM-ready » piece of content

Publish an article that ticks all boxes:

  • Clear title (question or how-to)
  • 2,000-2,500 words
  • 1 HTML table with numbers
  • 3-5 H2 sections (one idea per section)
  • 2-3 outbound links to primary sources
  • Article Schema
  • Short paragraphs (2-4 sentences max)

Example topic (HR SaaS): « 9 employee retention KPIs to track in 2025 [comparison table] ».

Day 4: Set up GA4 tracking

  1. Create a custom segment with LLM sources (chatgpt.com, claude.ai, etc.)
  2. Create an exploratory report Landing Page × Source
  3. Define a conversion goal (lead, purchase, download) for this segment

You’ll measure real impact in 30-60 days.

Day 5: Plan 3 months of content

Identify your micro-niche — the topic where you want to be cited by LLMs.

List 10-12 precise questions your customers ask.

Transform each question into an article (800-1,200 words), with at least 1 list or table.

Publish 1× per week for 3 months.

⚡ Expected result: First GPTBot crawl within 10-15 days of publication. First ChatGPT citation between day 30 and day 90 (depends on topic competitiveness). Measurable LLM traffic after 90 days (if you publish regularly).

This is a marathon. LLMs build their index slowly. But once you’re in, you enjoy recurring, qualified traffic, zero media spend.

The 68 million crawl study from Duda proves it: sites crawled regularly (user fetch 2-3×/week) generate constant LLM visitor flow. Measurable. Reproducible. And it starts with the 5 days above.

AI crawler audit + 3-month action plan in 45 minutes

I deploy technical optimizations (robots.txt, Schema, crawlability) and give you a 3-month LLM-ready content roadmap. First call = live site audit, no theory decks. Book a slot →

Book a strategic call — 45 min

Frequently Asked Questions

Should I block AI crawlers to protect my content?

No. Blocking GPTBot or ClaudeBot makes you invisible in ChatGPT and Claude. You lose referral traffic (+72.7% YoY per study) and citations. If you worry about scraping, use a partial paywall or premium content, but keep public content accessible.

How long before I see LLM traffic after optimization?

First GPTBot crawl: 10–15 days post-publication if robots.txt is clean. First ChatGPT citation: 30–90 days (depends on topic competition). Measurable traffic: minimum 90 days, if you publish 2–4×/month. LLMs crawl slower than Googlebot initially.

Do AI crawlers consume a lot of server bandwidth?

No. Among my 127 audited clients (Sept. 2024 – March 2025), AI crawlers represent 0.4–1.2% of total server traffic. GPTBot is even lighter than Googlebot (less JS, no images). Negligible server cost impact.

What’s the difference between GPTBot and ChatGPT-User in logs?

GPTBot = crawl for training/discovery (indexing, model training). ChatGPT-User = real-time fetch to answer an active user (56.9% of volume per study). ChatGPT-User generates direct referral traffic if you’re cited. GPTBot doesn’t, except indirectly via model enrichment.

Do I need a separate sitemap for AI crawlers?

No. AI crawlers read your standard sitemap.xml. Ensure <lastmod> is current (real modification date), and priority pages have <priority>0.8–1.0. Declare the sitemap in robots.txt: « Sitemap: https://yoursite.com/sitemap.xml ».

Stéphane Jambu

Stéphane Jambu

SEO & AI Engineer

I build growth systems / AI / Neuroscience | 650+ clients · 80 LinkedIn testimonials · 30 years of expertise · 15 years of systems running without me.

Follow on LinkedIn

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *