Claude Code, Uber, and AI budgets exploding: lessons for e-commerce 2026

Summarize this article with AI

In short: Uber burned through its entire 2026 annual AI budget in four months due to massive Claude Code adoption across 5,000 engineers. AI agents consume 5 to 100 times more tokens than typical human usage. For an e-commerce deploying support, search, or recommendation agents, budget arbitrage becomes a profitability lever as critical as customer acquisition cost.
4 monthsto burn through Uber’s annual AI budget on Claude Code
5 to 100xtokens consumed by an agent vs human usage
95%savings possible with caching + batch combined

The Uber case: an annual budget burned in 4 months

April 15, 2026. Praveen Neppalli Naga, Uber’s CTO, announces the annual AI budget is exhausted. In four months. The cause: Claude Code and Cursor deployed to 5 000 engineers. December 2025: stable usage. April: doubled, then doubled again. Uber’s AI costs have climbed sixfold since 2024, reaching 3.4 billion dollars in R&D — an AI share never seen before.

The CTO is clear: « I’m back at the drawing board, because the budget I thought necessary is already blown to pieces. » Not a failure. A signal. Agentic tools are reshaping the physics of software costs — and standard budget plans aren’t keeping pace.

Internal figures (The Information, Yahoo Finance): 95% of engineers use AI every month. 70% of committed code comes from AI assistance. Individual API bill: 500 to 2,000 dollars per developer per month. For 5,000 engineers, that’s 30 to 120 million dollars per year — just coding assistants alone, excluding production agents.

Clarification. The original anecdote comes from a viral Reddit post (April 20, 2026, r/GenAI4all), later confirmed by The Information, Benzinga, Yahoo Finance. The precise figures (« 3.4 billion », « 6x since 2024 ») come from the CTO himself. Reliable order of magnitude, not audited accounting.

For an e-commerce, the parallel holds. An AI assistant hooked to your catalog, support, or product search can follow exactly the same curve: slow adoption for two months, viral explosion on team and customer side, bill doubling every 30 days if nobody’s watching.

Why AI agents consume 100 times more than a human

A standard human conversation with an LLM consumes between 500 and 3,000 tokens per exchange. An AI agent — Claude Code, Cursor, Devin, Factory, or a custom agent hooked to your store — consumes between 5,000 and 1,000,000 tokens per task. The ratio ranges from 5 to 1 for a simple agent to 100 or even 1,000 for a multi-step agent.

The reason is structural. An agent performs five operations a human never does:

  • It re-reads its full context at each step. An agent chaining 20 tool calls re-reads the previous conversation 20 times. The human remembers. The machine must receive context in plain text.
  • It consults documentation. Claude Code reads the CLAUDE.md file at each session, plus the project content. On an average repo: 30,000 to 80,000 tokens before the first response.
  • It executes and re-reads its outputs. A failed unit test generates 200 lines of stack trace that the agent re-ingests to fix. A human visually scans three lines.
  • It reasons aloud. Reasoning models expose their thinking as thinking tokens, billed at the same price as final output. A 15-step plan can burn 8,000 tokens in pure thought before producing one line of code.
  • It retries. Recent academic work (OpenReview, 2026) shows an agent consumes an average of 1 to 3.5 million tokens per task including retries. Retries are the norm.

This mechanic changes the economic nature of an LLM. A chat at 0.01 dollars per interaction becomes an agent at 1 dollar per task. Multiply by an e-commerce’s traffic serving 200,000 sessions a month, and you shift from a line item to a strategic cost center.

Concrete e-commerce case: a semantic search assistant on a 15,000 SKU catalog calling Claude Sonnet on every visitor request typically represents 4,000 to 8,000 tokens per session. At 200,000 sessions a month, you’re between 800 million and 1.6 billion tokens monthly. At 3 dollars per million input: 2,400 to 4,800 dollars per month — before the recommendation layer.

That’s what happened to Uber at engineering scale. That’s what awaits any e-commerce deploying agents to the customer side without monitoring from day one.

The 5 levers that actually change the bill

The goal isn’t to reduce usage — AI creates value. The goal is to ensure one euro invested in tokens generates three euros of additional margin. Here are the five levers I apply systematically since the Uber explosion.

1. Model routing by task complexity

The first lever is simplest and most underused: not all prompts deserve Opus. A request rewrite passes through Haiku at 0.25 dollars per million tokens. Abandoned cart analysis runs on Sonnet at 3 dollars. Only strategic synthesis deserves Opus at 15 dollars.

A well-tuned router divides the bill by four with no perceived degradation. The trap: developers tend to put everything on Opus « to be safe ». Discipline means defining three task catégories at project start and auditing routing monthly.

2. Anthropic prompt caching: -90% on system prompts

Anthropic bills cache reads at 10% of standard input price. For an e-commerce sending its system prompt on each request (tone guidelines, product catalog summary, business rules), the gain is immediate. An 8,000-token system prompt repeated 100,000 times monthly costs 2,400 dollars without cache, 240 dollars with cache.

The condition: structure your prompt so the stable part is first, followed by dynamic content. Cache lasts 5 minutes by default, 1 hour in extended config. On continuous e-commerce traffic, cache holds permanently.

3. Context compression

Agents maintaining conversation history see context balloon with each turn. At 50 turns, context exceeds 100,000 tokens with 90% redundancy. Summarizing history every ten turns drastically cuts cost with zero expérience damage. Claude Code does this natively via auto-compaction; for in-house agents, it’s a two-hour implementation.

4. Self-hosted for repetitive volume

Gemma 3, Llama 4, Mistral Small: 2026’s open-source models run on an A10 GPU rented for 0.50 dollars per hour. For repetitive, low-ambiguity tasks (review classification, spam detection, query rewriting), an 8B self-hosted model costs 10 to 50 times less than Claude or GPT calls — provided you have volume to amortize the GPU.

The empirical breakeven rule: below 2 million tokens per day, stay on managed API. Above that, seriously evaluate self-hosted.

5. Batch API for non-real-time

Anthropic and OpenAI offer 50% discount on batch processing (response within 24 hours). For anything not needing real-time — product description generation, catalog enrichment, bulk customer review analysis, daily support ticket sorting — batch halves the bill. Combined with caching, Anthropic claims up to 95% savings on eligible workloads.

Neuroscience angle: the urge to upgrade to the premium model fires dopamine from the new toy feeling. Haiku does 80% of the work, Sonnet 95%, Opus 99%. Opus’s 4-point gain over Sonnet costs five times more. Budget discipline means resisting the auto-upgrade reflex and asking: is that 4-point gain worth the 5x bill?

Prompt caching + Batch API: the combo for 95% savings

Anthropic publishes precise documentation on combining both mechanisms. The point most teams miss: the two discounts stack. Caching alone: -90% on repeated system tokens. Batch alone: -50% overall. Combined on an eligible workload, theoretical savings reach 95%.

The table below summarizes Anthropic’s pricing grid on Claude Sonnet 4 at April 2026 rates (public data from platform.claude.com/docs):

Token typeStandard priceWith cache hitWith batchCache + batch
Input3 $ / M tokens0.30 $ / M1.50 $ / M0.15 $ / M
Output15 $ / M tokens15 $ / M7.50 $ / M7.50 $ / M
Cache write 5 min3.75 $ / M1.875 $ / M

Concrete example: a nightly pipeline enriching 10,000 product sheets with AI (SEO descriptions + technical specs + variants). Each sheet consumes 4,000 input tokens (3,000 of which are identical system prompt) and generates 1,500 output tokens.

  • Without optimization: 40 M input × 3 $ + 15 M output × 15 $ = 345 $.
  • With caching (90% input stable): 4 M × 3 + 36 M × 0.30 + 15 M × 15 = 248 $.
  • With caching + batch: 4 M × 1.5 + 36 M × 0.15 + 15 M × 7.5 = 124 $.

Same volume, same output quality, you drop from 345 to 124 dollars. That’s 64% savings on a real task. Across a monthly catalog refresh, it’s thousands of euros annually recovered. Infrastructure unchanged.

The trap that erases everything

Caching only works if your system prompt is byte-for-byte identical across requests. A timestamp in the prompt, a variable user name, a dynamic date: any of that breaks the cache. You repay full price. The correct structure puts stable parts first, dynamic parts last. It’s an implementation detail that means the difference between 90% savings and 0%.

Self-hosted vs managed API: the real decision grid

The self-hosted debate resurfaces with every API price hike. In 2026, open-source models made a huge qualitative leap. Gemma 3 27B rivals GPT-4o-mini on most benchmarks. Llama 4 Maverick approaches Claude Sonnet on code. Mistral Small 3.1 runs on consumer GPUs.

Yet most e-commerce shops would be wrong to switch entirely. Here’s the decision grid I apply:

CriteriaManaged APISelf-hosted
Daily volume< 2 M tokens/day> 5 M tokens/day on stable task
Load variabilityUnpredictable peaksContinuous load
Data sensitivityAnthropic zero-retention OKStrict GDPR, health, finance constraints
Quality requiredPremium tier (reasoning, nuance)Discrete, bounded tasks
Team time availableZero ops0.5 FTE MLOps minimum
Latency target200-800 ms acceptable< 100 ms required

Practical rule: an e-commerce serving less than 10,000 AI sessions per day gains time staying on API. Beyond that, serious cost-benefit analysis is mandatory. The classic trap: switch to self-hosted to save 2,000 dollars monthly, then hire an engineer at 8,000 to maintain the stack.

The winning hybrid approach

Most of my clients end up on a mix. Self-hosted for the high-volume repetitive layer — query rewriting, intent detection, review classification. Managed API for the intelligent layer — final response generation, synthesis, nuanced decisions. This two-tier architecture divides total bill by three. While keeping Claude or GPT quality where it matters.

Measuring real LLM ROI in e-commerce

Cutting costs is half the job. The other half: proving AI returns more than it costs. Too many teams claim « 70% of requests use AI » without ever building the P&L.

The formula that holds in the boardroom:

LLM ROI = (Revenue from AI) − (Token API cost) − (Infrastructure cost) − (Human maintenance cost)

For e-commerce, additional revenue breaks down across four axes:

  • Conversion lift: conversion rate difference between sessions using the assistant vs sessions not using it. A/B test. Not correlation.
  • Average order value: impact of AI recommendations on mean order amount.
  • Support cost reduction: tickets resolved by AI × average cost per human ticket.
  • SEO and GEO: additional organic traffic from Google and LLM citations (ChatGPT, Perplexity) thanks to enriched descriptions.
Metric to install day one: cost per conversation and cost per active user. Only way to spot a customer turning deficient — typically free-tier power users abusing an expensive assistant. Real-time dashboard on OpenRouter or Anthropic’s dashboard prevents end-of-month surprises.

The « Claude nerfed » case: rigor vs perception

Since February 2026, controversy among Claude Opus 4.6 power users. Multiple posts on r/Anthropic and r/ClaudeAI report quality decline: shorter responses, shallower reasoning, instructions less well followed. An independent study on 6,800 Claude Code sessions notes 67% drop in reasoning depth late February.

Anthropic acknowledged tweaking default thinking budget settings to optimize latency and cost. The viral BridgeBench post claiming Claude dropped from 2nd to 10th place in hallucinations faced wide methodological criticism.

Business lesson: your internal benchmarks are the only ground truth. Build a set of 20 to 50 prompts representing your real e-commerce usage. Run them weekly on each candidate model. Track drift. If your quality drops 10% but cost drops 40%, good trade. If quality drops 30% for 10% cost savings, switch models. Without measurement, you’re just spectating lab communications.

What the Uber story changes for e-commerce in 2026

The typical misread? « Too expensive. » Uber never says that. Uber says: ROI is such that we replan, we don’t cut. 70% of code committed with AI assist. Measurable engineer productivity spike. Developer satisfaction through the roof. The problem is a growth problem.

Four actions for an e-commerce, this week:

  1. Install monitoring before deploying. Dashboard for tokens per endpoint, per user, per model — as critical as Google Analytics. Budget alerts at 50%, 80%, 100% are mandatory.
  2. Structure prompts for caching. Five minutes of refactoring yields 90% savings. Best available ROI on Claude API today.
  3. Define model routing at project open. Three task catégories, three models. Haiku for simple. Sonnet for business logic. Opus for strategic. Monthly audit.
  4. Benchmark continuously. The « Claude nerfed » controversy reminds us quality isn’t delegated to lab communications. A representative prompt set run weekly — one engineer hour per month.

AI is a standalone budget line. Strong growth for at least two years. E-commerce shops treating this line with the same rigor as Google Ads or Meta spend will have structural advantage. Those letting it drift will relive Uber — with less margin to absorb the shock.

Audit your AI consumption and agent ROI in 30 minutes

Deploying an AI assistant on your store or already running one? I’ll show you live the three levers that cut your current bill by three. Plus the ROI measurement grid tailored to your volume. No pitch — just a live analysis of your real usage.

Book a strategic call — 45 min

Frequently Asked Questions

What’s the ballpark for an AI assistant on an e-commerce store in 2026?

For a site serving 100,000 to 300,000 sessions monthly with a Claude Sonnet-based search assistant, budget 800 to 3,500 dollars monthly in API costs before optimization. With proper caching and model routing, you typically drop to 250–1,000 dollars for the same traffic. The key variable is system prompt size and average conversation length.

Should a 500K euro/month boutique host its own LLM?

Not necessarily. The switch to self-hosted becomes interesting above 5 million tokens daily on a stable, well-defined task. Below that, human operations cost exceeds API savings. The hybrid approach (self-hosted for high-volume repetitive work, managed API for intelligent layers) is the best compromise for most e-commerce shops.

How do I objectively measure if Claude Opus degraded?

Build a set of 20–50 prompts from your actual usage, with validated expected responses. Run them weekly on each model candidate and score quality on a fixed rubric. In three weeks, you have objective drift data replacing Reddit anecdotes with numbers for your decisions.

What’s the costliest mistake with Anthropic prompt caching?

Putting variable data (timestamp, user ID, today’s date) in the section meant to cache. Each variation breaks the cache and you repay full price. Rule: stable parts first (instructions, catalog, rules), dynamic parts last. Check Anthropic logs that cache hit rate exceeds 85% on recurring requests.

Can Batch API work with real-time usage?

No. Batch API gives 50% discount in exchange for processing delays up to 24 hours. Perfect for nightly catalog enrichment, bulk product description generation, customer review analysis, or daily support ticket sorting. For anything touching the visitor in-session, you need synchronous standard API.

Stéphane Jambu

Stéphane Jambu

SEO & AI Engineer

I build growth systems / AI / Neuroscience | 650+ clients · 80 LinkedIn testimonials · 30 years of expertise · 15 years of systems running without me.

Follow on LinkedIn

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *