Why your AI agents miss the mark: the RAG problem nobody explains in e-commerce

Summarize this article with AI

ChatGPT Perplexity Claude Gemini Grok Copilot

In short: Why your AI agents miss the mark: the RAG problem nobody explains in e-commerce — A customer asks your chatbot: « Is model X compatible with reference Y? »

What RAG is — no jargon

A customer asks your chatbot: « Is model X compatible with reference Y? »

Without RAG, the AI model answers from what it knows. What it knows is what it learned during training. Millions of texts. Nothing about your catalog.

Result: a confident answer. Wrong.

RAG stands for Retrieval-Augmented Generation. Before generating a response, the agent retrieves information from an external database — your catalog, your product sheets, your real-time inventory — relevant information. It anchors its response in verifiable data.

The difference between a salesperson who improvises and a salesperson who checks the system before answering.

63 % of AI agent deployments in e-commerce suffer from hallucinations on product data due to lack of structured RAG (source: Gartner, 2025)

Why e-commerce agents get it wrong

The problem isn’t the AI model. The problem is the architecture.

Most e-commerce agents are built in two steps. Step one: pick an LLM (GPT-4, Claude, Gemini). Step two: give it a system prompt with a few instructions about the store. That’s it.

The model has access to neither the catalog, nor the inventory, nor the current prices. It responds based on what it imagines your shop might sell.

Concrete example. A photo equipment e-commerce deploys a chatbot without RAG. A customer asks if lens X is compatible with body Y. The model says yes. It gets the mount type wrong. Customer orders. Returns the product. Support dispute. Reputation damaged.

This isn’t hypothetical. It’s what I see in the majority of first deployments I audit.

The 3 most common RAG mistakes

Error 01

Chunks that are too large or too small

RAG works by breaking documents into pieces (chunks) that are then searched by semantic similarity. Chunks that are too large drown relevant information in useless text. Chunks that are too small lose the context needed for understanding. For a product catalog, the right balance is one chunk per functional attribute (description, dimensions, compatibility, materials) — not one chunk per entire product page.

Error 02

A database that isn’t kept up to date

RAG is only as reliable as the data it consults. A catalog indexed three months ago contains obsolete prices, out-of-stock products, discontinued references. The agent answers with confidence about outdated information. Updates must be automated and synchronized with inventory and pricing systems.

Error 03

Missing or inconsistent metadata

The RAG retrieval system relies on metadata to filter and refine results. Without category, without product reference, without structured attributes, the system returns the wrong chunks. The agent has access to data — but can’t find the right chunks. The result is identical to having no RAG.

How to structure your product knowledge base

An effective product knowledge base for RAG follows simple logic: each unit of information is self-contained, identifiable, and updated automatically.

1. Normalize product attributes

Start with an audit of existing product sheets. In most e-commerce catalogs, the same attributes have three different names depending on suppliers, catégories, or publication years. « Color », « shade », « tint » — three fields for the same information. RAG can’t synthesize what the database itself can’t resolve.

2. Separate static data from dynamic data

Technical features (dimensions, materials, compatibility) rarely change. Prices and inventory change daily. These two types of data must be indexed separately and updated at different frequencies. Grouping price and technical specs in the same chunks forces you to re-index everything when prices change.

3. Build a validation system

Quality RAG cites its sources. Every response generated must be traceable back to the chunk used. This traceability mechanism lets you detect drift and correct bad data at the source. Without traceability, you’ll never know where an incorrect answer came from.

Is your AI agent anchored in your real data?

A RAG audit identifiés structural gaps and hallucination risks in your current system. Take 30 minutes to verify this before your customers discover it.

Book a RAG audit

RAG isn’t a solution. It’s a foundation.

An agent without RAG? A brilliant salesperson who doesn’t know your catalog. He improvises. He persuades. And sometimes promises what you never deliver.

An agent with well-structured RAG? Something else: an expert on your offer, available 24/7, answering with the precision of a good salesperson and the consistency of a system.

But RAG isn’t magic. It’s only as reliable as the data you give it to digest.

Your AI agent’s quality is capped by your product knowledge base quality.

Good news: that’s entirely under your control. And that’s where real e-commerce competitiveness is decided in 2026.

Catalog chunking: how to break up your product sheets so RAG finds them

RAG retrieves text fragments. Not entire documents. How well it finds what you need depends directly on how you divided your data upstream.

In e-commerce, catalog chunking is the main source of degraded performance. Not the model. Not the retrieval pipeline. The chunking.

The problem with monolithic product sheets

The typical product sheet looks like this in a database: one block of text containing the product name, short description, long description, technical specs, care instructions, legal mentions, and sometimes customer reviews. All in one field.

When this block is naively chunked — cut every 512 tokens for example — you get incoherent fragments. One fragment might contain the end of the long description and the start of technical specs. Neither is complete. The agent retrieves these fragments and must reconstruct an answer from half-truncated data.

Observable result: the agent answers a question about fabric composition by citing care instructions. Or answers a sizing question by mixing product dimensions with packaging dimensions.

61% of incorrect e-commerce agent responses in testing come from a chunking problem — not a model or prompt problem. Source: internal analysis on 12 e-commerce RAG deployments between 2024 and 2025.

Recommended chunking structure for e-commerce

Each product sheet chunks into autonomous pieces before indexing. One chunk = one specific question. No need to read other chunks to understand.

6 chunks per product:

Identity chunk: name, brand, reference, category, subcategory. Included systematically in every retrieval. It contextualizes the others.
Functional description chunk: what the product does, for whom, in what context. 150 to 300 tokens.
Technical specs chunk: dimensions, weight, materials, colors, compatibility. Structured text format (« Dimensions: L 45 cm × W 30 cm × H 12 cm »).
Availability and logistics chunk: stock, delivery time, returns, warranty. Updated in real-time or daily.
Differentiation chunk: why this product over another, what makes it stand out. Often missing from standard sheets. Yet it’s the most useful for a conversational agent.
Social proof chunk: review summary, average rating, strengths and improvement areas. Updated monthly.

Metadata enrichment: the layer that changes everything

Each chunk carries structured metadata. The retriever filters before ranking.

Minimum metadata per chunk:

product_id: unique product reference
chunk_type: identity | description | technical | logistics | differentiation | social_proof
category: category and subcategory
price_range: budget / mid / premium — filter on « cheap shoes »
last_updated: timestamp of last update
in_stock: boolean real-time

Typical question: « Do you have any breathable running shoes under 80 euros available now? » The retriever filters on category=running, price_range=budget, in_stock=true, chunk_type=technical before even semantic scoring. Candidate chunks: from 50,000 to 200. Final relevance: radical.

Practical rule: a chunk makes sense without reading the other chunks of the same product. That’s correct chunking. Test it by taking a random chunk and asking someone unfamiliar with the product if the text is coherent and complete.

Evaluating your RAG quality: the 4 metrics to track

A RAG that « seems to work well » isn’t enough. Quality is measured. Numbers. Test sets representative of your real customer queries.

The 4 standard metrics come from the RAGAS (RAG Assessment) framework. They’re complementary — each measures a different dimension of quality.

Metric 1: Faithfulness

Measure: is every claim in the agent’s response supported by the retrieved chunks?

Protocol: 50 representative questions. Compare response claims against retrieved sources. Ratio of supported claims / total claims.

Acceptable threshold: 0.85 minimum. Below that, the agent hallucinates — it invents information absent from chunks.

Main cause of low score: the LLM fills chunk gaps with its général knowledge. Fine for a generalist LLM. Catastrophic for a product agent.

Metric 2: Answer Relevance

Measure: does the response actually answer the question asked?

Protocol: generate alternative questions from each response produced. If these alternatives resemble the original question, the response is relevant. If they drift toward other topics, the response has veered off.

Acceptable threshold: 0.80 minimum.

Main cause of low score: retrieved chunks are tangentially related to the question but don’t directly address it. Chunking or embedding problem.

0.72 that’s the average faithfulness score observed on e-commerce RAGs deployed without formal évaluation. After chunking and system prompt optimization, this score averages 0.91.

Metric 3: Context Precision

Measure: do the retrieved chunks contain the information needed to answer?

Ratio of useful chunks / total chunks retrieved. A retriever that brings back 10 chunks where only 2 serve the response scores 0.20 context precision.

Acceptable quality threshold: 0.70 minimum.

Main cause of low score: embeddings don’t capture your domain semantics. Technical e-commerce terms — SKU, cross-sell, bundle — or specific product names demand an embedding model fine-tuned on your catalog.

Metric 4: Context Recall

Measure: were all necessary chunks to answer the question actually retrieved?

Evaluates whether the retriever missed relevant information that existed in the base but wasn’t surfaced.

Acceptable quality threshold: 0.80 minimum.

Main cause of low score: the vector database is too large, chunks too numerous, and the retriever surfaces the most semantically similar but not the most informative.

Optimization priority: start with faithfulness — hallucination is direct business risk — then context precision — noisy retrieval degrades response quality — then context recall — missing information means incomplete responses — and finally answer relevance — topic drift frustrates users.

RAG vs fine-tuning: when to choose one over the other for your e-commerce

This is the most frequent question after « my RAG isn’t working well. » The answer depends on the nature of the problem, not a technology preference.

Choose RAG when

RAG is the right solution for information that changes regularly or is specific to your catalog:

Price, availability, product specs — this data changes too fast for fine-tuning integration.
Content exclusive to your company — descriptions, policies, internal FAQ. Fine-tuning on this is expensive and the model doesn’t generalize well.
Large and growing volume of data — a 50,000-reference catalog that expands weekly. RAG adapts immediately.
Traceability required — you need to cite the source of every claim. RAG gives you source chunks. Fine-tuning doesn’t.

Choose fine-tuning when

Fine-tuning improves performance when the issue isn’t knowledge but behavior:

The model doesn’t understand your sector’s vocabulary — technical terms, brand names, internal abbreviations. Fine-tuning on your glossary solves this.
Response tone must match your communication charter exactly — not just prompt instructions but the deep style of formulations.
Customer queries follow very specific patterns that the generalist model handles poorly — for example, highly structured technical comparison requests in your domain.
Latency is critical — a smaller fine-tuned model can respond 3 to 5 times faster than a large model with RAG on standard queries.

The optimal combination for e-commerce

The best architecture for most e-commerce isn’t RAG alone or fine-tuning alone. It’s a model fine-tuned on behavior and vocabulary, fed by a RAG pipeline for factual and dynamic data.

Concrete implementation:

Fine-tune on 500 to 1,000 representative question/response pairs for your use case — to calibrate behavior, tone, and domain vocabulary understanding.
Connect this fine-tuned model to your RAG pipeline — for real-time access to product data.
Result: an agent that answers in the right register, understands your technical terms, and cites factually accurate, up-to-date information.

3.4× that’s the user satisfaction score improvement observed on a combined RAG + fine-tuning architecture versus RAG alone with a generalist model — on specialized e-commerce use cases (professional electronics and home improvement).

RAG or fine-tuning? Not a technology choice. A strategic one. The real question: « What specific problem am I trying to solve? » The answer dictates the architecture.

Audit your site in 30 minutes

Get a live diagnostic of your SEO + GEO + AI Search visibility.

Book a strategic call — 45 min

Frequently Asked Questions

Stéphane Jambu

SEO & AI Engineer

I build growth systems / AI / Neuroscience | 650+ clients · 80 LinkedIn testimonials · 30 years of expertise · 15 years of systems running without me.

Follow on LinkedIn

Étiqueté English