← All writing

RAG Systems in Production: What Most Teams Get Wrong

Most RAG implementations fail silently. Here are the retrieval, chunking, and architecture mistakes teams make and how to fix them.

RAG Systems in Production: What Most Teams Get Wrong

Most RAG failures aren't dramatic. They're slow burns. Retrieval degrades. Hallucinations increase. Users complain. Teams investigate and discover the system was never retrieving the right documents in the first place.

The core problem is straightforward: RAG is fundamentally limited by retrieval quality. A better generation model cannot fix poor retrieval. Yet teams obsess over which LLM to use while treating retrieval as solved. It's not. This guide covers what actually works in production.

The Retrieval Quality Problem

Teams typically fail at RAG by ignoring retrieval quality as a measurement problem. They don't know if their system is actually retrieving relevant documents because they never built an evaluation framework.

Build one early. Start with 20-50 representative queries paired with documents that should be retrieved. Measure precision at k (what percentage of top-k results are relevant), mean reciprocal rank, and end-to-end generation quality. See AI evaluation and testing for how to set this up rigorously.

Once you measure, you'll discover the real problem: most teams use generic embedding models selected from MTEB leaderboards. These rank well on general-purpose benchmarks but perform poorly on your specific data. Test embedding models against your evaluation set. A smaller, domain-specific model almost always outperforms a larger general-purpose one, with the bonus of being faster and cheaper.

Chunking Strategies That Actually Work

Fixed-size chunks (512 or 1024 tokens) are convenient but destructive. They cut sentences mid-concept, fragment related information, and force arbitrary boundaries that don't align with your data structure.

Three better approaches:

Semantic chunking: Measure the distance between consecutive sentences using embeddings. Split when distance exceeds a threshold. This keeps related information together but adds preprocessing cost.

Structure-aware chunking: Align boundaries with your domain. Code repositories chunk by function. Medical records by encounter. Customer support by issue thread. Technical documentation by section and code block.

Hierarchical chunking: Create chunks at multiple abstraction levels. Short chunks for fine-grained retrieval, longer summary chunks for context. This helps the retriever find specific answers while understanding broader scope.

Include metadata in every chunk—source, document type, date, author, category. Make this metadata searchable so you can filter results by context. See Pinecone's chunking guide for implementation patterns.

Hybrid Search: When Vector-Only Fails

Semantic search alone misses domain-specific terminology, proper nouns, and exact matches. A query for "error code ORA-12514" won't retrieve well using embeddings alone.

Hybrid search combines vector search (for semantic similarity) with BM25 keyword search (for exact matches). For each query, retrieve candidates from both, combine results intelligently, and rerank. The implementation is straightforward: normalize scores from both retrieval methods and weight them (try 0.6 semantic / 0.4 keyword, then tune to your data).

The benefit is substantial. Keyword search catches terminology and names. Vector search catches conceptual matches. Together they improve recall without adding latency.

Reranking and Two-Stage Retrieval

After retrieval, you have 50-100 candidates. Most are relevant-ish. You need the best ones for your LLM.

This is reranking. Run your initial retrieval with a fast bi-encoder. Rerank the top candidates with a cross-encoder (a model specifically trained to score query-document relevance). Keep the top 5-10. Apply metadata filters. Pass to the LLM.

Cross-encoder reranking consistently outperforms bi-encoder retrieval alone. The latency cost (100-300ms) is justified by precision gains. This two-stage pattern is current best practice for production systems.

Context Length and Latency Tradeoffs

Modern LLMs have 100k+ token context windows. Resist the urge to use them. Most question-answering tasks see diminishing returns after 3,000-5,000 tokens of context. More tokens mean higher latency and cost with minimal quality improvement. Measure this on your evaluation set.

Be deliberate about what goes into context. Format clearly. Include metadata and source information. A well-formatted 4,000 tokens outperforms unstructured 8,000.

When RAG Fails and Alternatives

RAG works best for dynamic data where you want current information. It fails when:

Fine-tuning is better: You have a large, high-quality dataset and want permanent domain knowledge. More expensive upfront but potentially cheaper long-term.

Prompt engineering suffices: Your task is well-defined and the model's base knowledge is enough. Structured data extraction often falls here.

Long context replaces it: With Anthropic's 200k context windows, sometimes dumping all documents into context and letting the model find what matters is faster and cheaper than building retrieval infrastructure.

Before investing in RAG, measure the cost-quality tradeoff against alternatives. See scaling AI products for architecture decisions at scale.

Data Access and MCP

The Model Context Protocol provides a standard way to expose retrieval as discrete tools. Instead of baking retrieval into application logic, define MCP-compatible search and retrieve operations. This enables multi-step agent workflows where agents can iteratively retrieve documents, reason, and request more specific information.

Structure your retrieval operations as clear, reusable tools and you're ready for agent-based applications as MCP adoption grows. See context engineering for how RAG fits into broader context strategy.

Production Implementation

Start with proven tools. LangChain's RAG tutorial walks the pattern. Use a vector database (Weaviate, Qdrant, Pinecone) for similarity search. Implement two-stage retrieval: fast bi-encoder retrieval, then cross-encoder reranking.

Monitor retrieval quality in production. Track precision at k, mean reciprocal rank, and generation quality. Set alerts for degradation. When metrics slip, investigate chunking staleness, embedding drift, or data distribution shifts.

Optimize cost: embeddings are computed once. Generation is ongoing. Reduce context length and batch queries when possible. At scale, see AI for technical leaders for architectural choices.

Summary

The teams shipping high-quality RAG systems treat retrieval as seriously as generation. They measure from day one. They test embedding models on actual data. They implement hybrid search and reranking. They chunk thoughtfully, aligned with domain structure. They optimize context length empirically, not aspirationally. They monitor production metrics continuously.

RAG feels simple—embed documents, retrieve similar ones, generate answers. Production RAG is harder. The difference between systems that work in demos and systems that scale is taking retrieval seriously.

RAGretrieval augmented generationAI architecturevector searchproduction AI