Claude Code, Uber, and AI budgets exploding: lessons for e-commerce 2026
Summarize this article with AI
The Uber case: an annual budget burned in 4 months
April 15, 2026. Praveen Neppalli Naga, Uber’s CTO, announces the annual AI budget is exhausted. In four months. The cause: Claude Code and Cursor deployed to 5 000 engineers. December 2025: stable usage. April: doubled, then doubled again. Uber’s AI costs have climbed sixfold since 2024, reaching 3.4 billion dollars in R&D — an AI share never seen before.
The CTO is clear: « I’m back at the drawing board, because the budget I thought necessary is already blown to pieces. » Not a failure. A signal. Agentic tools are reshaping the physics of software costs — and standard budget plans aren’t keeping pace.
Internal figures (The Information, Yahoo Finance): 95% of engineers use AI every month. 70% of committed code comes from AI assistance. Individual API bill: 500 to 2,000 dollars per developer per month. For 5,000 engineers, that’s 30 to 120 million dollars per year — just coding assistants alone, excluding production agents.
Clarification. The original anecdote comes from a viral Reddit post (April 20, 2026, r/GenAI4all), later confirmed by The Information, Benzinga, Yahoo Finance. The precise figures (« 3.4 billion », « 6x since 2024 ») come from the CTO himself. Reliable order of magnitude, not audited accounting.
For an e-commerce, the parallel holds. An AI assistant hooked to your catalog, support, or product search can follow exactly the same curve: slow adoption for two months, viral explosion on team and customer side, bill doubling every 30 days if nobody’s watching.
Why AI agents consume 100 times more than a human
A standard human conversation with an LLM consumes between 500 and 3,000 tokens per exchange. An AI agent — Claude Code, Cursor, Devin, Factory, or a custom agent hooked to your store — consumes between 5,000 and 1,000,000 tokens per task. The ratio ranges from 5 to 1 for a simple agent to 100 or even 1,000 for a multi-step agent.
The reason is structural. An agent performs five operations a human never does:
- It re-reads its full context at each step. An agent chaining 20 tool calls re-reads the previous conversation 20 times. The human remembers. The machine must receive context in plain text.
- It consults documentation. Claude Code reads the
CLAUDE.mdfile at each session, plus the project content. On an average repo: 30,000 to 80,000 tokens before the first response. - It executes and re-reads its outputs. A failed unit test generates 200 lines of stack trace that the agent re-ingests to fix. A human visually scans three lines.
- It reasons aloud. Reasoning models expose their thinking as thinking tokens, billed at the same price as final output. A 15-step plan can burn 8,000 tokens in pure thought before producing one line of code.
- It retries. Recent academic work (OpenReview, 2026) shows an agent consumes an average of 1 to 3.5 million tokens per task including retries. Retries are the norm.
This mechanic changes the economic nature of an LLM. A chat at 0.01 dollars per interaction becomes an agent at 1 dollar per task. Multiply by an e-commerce’s traffic serving 200,000 sessions a month, and you shift from a line item to a strategic cost center.
That’s what happened to Uber at engineering scale. That’s what awaits any e-commerce deploying agents to the customer side without monitoring from day one.
The 5 levers that actually change the bill
The goal isn’t to reduce usage — AI creates value. The goal is to ensure one euro invested in tokens generates three euros of additional margin. Here are the five levers I apply systematically since the Uber explosion.
1. Model routing by task complexity
The first lever is simplest and most underused: not all prompts deserve Opus. A request rewrite passes through Haiku at 0.25 dollars per million tokens. Abandoned cart analysis runs on Sonnet at 3 dollars. Only strategic synthesis deserves Opus at 15 dollars.
A well-tuned router divides the bill by four with no perceived degradation. The trap: developers tend to put everything on Opus « to be safe ». Discipline means defining three task catégories at project start and auditing routing monthly.
2. Anthropic prompt caching: -90% on system prompts
Anthropic bills cache reads at 10% of standard input price. For an e-commerce sending its system prompt on each request (tone guidelines, product catalog summary, business rules), the gain is immediate. An 8,000-token system prompt repeated 100,000 times monthly costs 2,400 dollars without cache, 240 dollars with cache.
The condition: structure your prompt so the stable part is first, followed by dynamic content. Cache lasts 5 minutes by default, 1 hour in extended config. On continuous e-commerce traffic, cache holds permanently.
3. Context compression
Agents maintaining conversation history see context balloon with each turn. At 50 turns, context exceeds 100,000 tokens with 90% redundancy. Summarizing history every ten turns drastically cuts cost with zero expérience damage. Claude Code does this natively via auto-compaction; for in-house agents, it’s a two-hour implementation.
4. Self-hosted for repetitive volume
Gemma 3, Llama 4, Mistral Small: 2026’s open-source models run on an A10 GPU rented for 0.50 dollars per hour. For repetitive, low-ambiguity tasks (review classification, spam detection, query rewriting), an 8B self-hosted model costs 10 to 50 times less than Claude or GPT calls — provided you have volume to amortize the GPU.
The empirical breakeven rule: below 2 million tokens per day, stay on managed API. Above that, seriously evaluate self-hosted.
5. Batch API for non-real-time
Anthropic and OpenAI offer 50% discount on batch processing (response within 24 hours). For anything not needing real-time — product description generation, catalog enrichment, bulk customer review analysis, daily support ticket sorting — batch halves the bill. Combined with caching, Anthropic claims up to 95% savings on eligible workloads.
Prompt caching + Batch API: the combo for 95% savings
Anthropic publishes precise documentation on combining both mechanisms. The point most teams miss: the two discounts stack. Caching alone: -90% on repeated system tokens. Batch alone: -50% overall. Combined on an eligible workload, theoretical savings reach 95%.
The table below summarizes Anthropic’s pricing grid on Claude Sonnet 4 at April 2026 rates (public data from platform.claude.com/docs):
| Token type | Standard price | With cache hit | With batch | Cache + batch |
|---|---|---|---|---|
| Input | 3 $ / M tokens | 0.30 $ / M | 1.50 $ / M | 0.15 $ / M |
| Output | 15 $ / M tokens | 15 $ / M | 7.50 $ / M | 7.50 $ / M |
| Cache write 5 min | 3.75 $ / M | — | 1.875 $ / M | — |
Concrete example: a nightly pipeline enriching 10,000 product sheets with AI (SEO descriptions + technical specs + variants). Each sheet consumes 4,000 input tokens (3,000 of which are identical system prompt) and generates 1,500 output tokens.
- Without optimization: 40 M input × 3 $ + 15 M output × 15 $ = 345 $.
- With caching (90% input stable): 4 M × 3 + 36 M × 0.30 + 15 M × 15 = 248 $.
- With caching + batch: 4 M × 1.5 + 36 M × 0.15 + 15 M × 7.5 = 124 $.
Same volume, same output quality, you drop from 345 to 124 dollars. That’s 64% savings on a real task. Across a monthly catalog refresh, it’s thousands of euros annually recovered. Infrastructure unchanged.
The trap that erases everything
Caching only works if your system prompt is byte-for-byte identical across requests. A timestamp in the prompt, a variable user name, a dynamic date: any of that breaks the cache. You repay full price. The correct structure puts stable parts first, dynamic parts last. It’s an implementation detail that means the difference between 90% savings and 0%.
Self-hosted vs managed API: the real decision grid
The self-hosted debate resurfaces with every API price hike. In 2026, open-source models made a huge qualitative leap. Gemma 3 27B rivals GPT-4o-mini on most benchmarks. Llama 4 Maverick approaches Claude Sonnet on code. Mistral Small 3.1 runs on consumer GPUs.
Yet most e-commerce shops would be wrong to switch entirely. Here’s the decision grid I apply:
| Criteria | Managed API | Self-hosted |
|---|---|---|
| Daily volume | < 2 M tokens/day | > 5 M tokens/day on stable task |
| Load variability | Unpredictable peaks | Continuous load |
| Data sensitivity | Anthropic zero-retention OK | Strict GDPR, health, finance constraints |
| Quality required | Premium tier (reasoning, nuance) | Discrete, bounded tasks |
| Team time available | Zero ops | 0.5 FTE MLOps minimum |
| Latency target | 200-800 ms acceptable | < 100 ms required |
Practical rule: an e-commerce serving less than 10,000 AI sessions per day gains time staying on API. Beyond that, serious cost-benefit analysis is mandatory. The classic trap: switch to self-hosted to save 2,000 dollars monthly, then hire an engineer at 8,000 to maintain the stack.
The winning hybrid approach
Most of my clients end up on a mix. Self-hosted for the high-volume repetitive layer — query rewriting, intent detection, review classification. Managed API for the intelligent layer — final response generation, synthesis, nuanced decisions. This two-tier architecture divides total bill by three. While keeping Claude or GPT quality where it matters.
Measuring real LLM ROI in e-commerce
Cutting costs is half the job. The other half: proving AI returns more than it costs. Too many teams claim « 70% of requests use AI » without ever building the P&L.
The formula that holds in the boardroom:
LLM ROI = (Revenue from AI) − (Token API cost) − (Infrastructure cost) − (Human maintenance cost)
For e-commerce, additional revenue breaks down across four axes:
- Conversion lift: conversion rate difference between sessions using the assistant vs sessions not using it. A/B test. Not correlation.
- Average order value: impact of AI recommendations on mean order amount.
- Support cost reduction: tickets resolved by AI × average cost per human ticket.
- SEO and GEO: additional organic traffic from Google and LLM citations (ChatGPT, Perplexity) thanks to enriched descriptions.
The « Claude nerfed » case: rigor vs perception
Since February 2026, controversy among Claude Opus 4.6 power users. Multiple posts on r/Anthropic and r/ClaudeAI report quality decline: shorter responses, shallower reasoning, instructions less well followed. An independent study on 6,800 Claude Code sessions notes 67% drop in reasoning depth late February.
Anthropic acknowledged tweaking default thinking budget settings to optimize latency and cost. The viral BridgeBench post claiming Claude dropped from 2nd to 10th place in hallucinations faced wide methodological criticism.
Business lesson: your internal benchmarks are the only ground truth. Build a set of 20 to 50 prompts representing your real e-commerce usage. Run them weekly on each candidate model. Track drift. If your quality drops 10% but cost drops 40%, good trade. If quality drops 30% for 10% cost savings, switch models. Without measurement, you’re just spectating lab communications.
What the Uber story changes for e-commerce in 2026
The typical misread? « Too expensive. » Uber never says that. Uber says: ROI is such that we replan, we don’t cut. 70% of code committed with AI assist. Measurable engineer productivity spike. Developer satisfaction through the roof. The problem is a growth problem.
Four actions for an e-commerce, this week:
- Install monitoring before deploying. Dashboard for tokens per endpoint, per user, per model — as critical as Google Analytics. Budget alerts at 50%, 80%, 100% are mandatory.
- Structure prompts for caching. Five minutes of refactoring yields 90% savings. Best available ROI on Claude API today.
- Define model routing at project open. Three task catégories, three models. Haiku for simple. Sonnet for business logic. Opus for strategic. Monthly audit.
- Benchmark continuously. The « Claude nerfed » controversy reminds us quality isn’t delegated to lab communications. A representative prompt set run weekly — one engineer hour per month.
AI is a standalone budget line. Strong growth for at least two years. E-commerce shops treating this line with the same rigor as Google Ads or Meta spend will have structural advantage. Those letting it drift will relive Uber — with less margin to absorb the shock.
Audit your AI consumption and agent ROI in 30 minutes
Deploying an AI assistant on your store or already running one? I’ll show you live the three levers that cut your current bill by three. Plus the ROI measurement grid tailored to your volume. No pitch — just a live analysis of your real usage.
Book a strategic call — 45 minFrequently Asked Questions
What’s the ballpark for an AI assistant on an e-commerce store in 2026?
For a site serving 100,000 to 300,000 sessions monthly with a Claude Sonnet-based search assistant, budget 800 to 3,500 dollars monthly in API costs before optimization. With proper caching and model routing, you typically drop to 250–1,000 dollars for the same traffic. The key variable is system prompt size and average conversation length.
Should a 500K euro/month boutique host its own LLM?
Not necessarily. The switch to self-hosted becomes interesting above 5 million tokens daily on a stable, well-defined task. Below that, human operations cost exceeds API savings. The hybrid approach (self-hosted for high-volume repetitive work, managed API for intelligent layers) is the best compromise for most e-commerce shops.
How do I objectively measure if Claude Opus degraded?
Build a set of 20–50 prompts from your actual usage, with validated expected responses. Run them weekly on each model candidate and score quality on a fixed rubric. In three weeks, you have objective drift data replacing Reddit anecdotes with numbers for your decisions.
What’s the costliest mistake with Anthropic prompt caching?
Putting variable data (timestamp, user ID, today’s date) in the section meant to cache. Each variation breaks the cache and you repay full price. Rule: stable parts first (instructions, catalog, rules), dynamic parts last. Check Anthropic logs that cache hit rate exceeds 85% on recurring requests.
Can Batch API work with real-time usage?
No. Batch API gives 50% discount in exchange for processing delays up to 24 hours. Perfect for nightly catalog enrichment, bulk product description generation, customer review analysis, or daily support ticket sorting. For anything touching the visitor in-session, you need synchronous standard API.

