top of page

RAG in Production: Caching, Costs, and Scaling Retrieval

  • ShiftQuality Contributor
  • May 24, 2025
  • 5 min read

You built a RAG system. It retrieves relevant documents, feeds them to an LLM, and generates grounded responses. In your demo, it works beautifully.

Production is not your demo. Production is 10,000 queries per hour, customers who notice when answers take 4 seconds, monthly bills that need justification, and failure modes you never encountered with 50 test documents.

The architecture that works at demo scale breaks at production scale in predictable ways. Understanding those failure modes before they hit is the difference between a production system and a demo that got deployed.

The Cost Architecture of RAG

Every RAG query has three cost components, and they compound.

Embedding cost. Converting the user's query into a vector for similarity search. This is cheap per query (fractions of a cent) but adds up at volume. More importantly, your document corpus needs to be embedded too — and re-embedded when documents change.

Retrieval cost. Vector similarity search against your index. Managed vector databases (Pinecone, Weaviate Cloud, Qdrant Cloud) charge by storage and queries. Self-hosted solutions (pgvector, Milvus, Qdrant) cost infrastructure. Either way, this scales with corpus size and query volume.

LLM generation cost. The dominant cost. Each retrieved document added to the context increases input token count. If you retrieve 5 documents averaging 500 tokens each, that's 2,500 tokens of context per query — before the system prompt and user question. At scale, this is where the bill lives.

The multiplier effect: If naive RAG retrieves more documents than necessary (hedging for relevance), every extra document multiplies your LLM cost. A system that retrieves 10 documents when 3 would suffice is paying 3x more on every query for marginal relevance improvement.

Cost Control Strategies

Retrieve less, retrieve better. Invest in retrieval quality so you can reduce the number of retrieved documents while maintaining answer quality. Better chunking, better embeddings, and re-ranking (using a smaller model to score and filter retrieved documents before sending to the LLM) all help.

Compress retrieved context. Summarize or extract relevant sections from retrieved documents rather than passing them whole. A 2,000-token document might contain 200 tokens of relevant information. Extracting that relevant portion before sending to the LLM cuts costs dramatically.

Use tiered models. Route simple queries to a smaller, cheaper model. Reserve the expensive model for complex queries that need stronger reasoning. A classifier that costs fractions of a cent per query can save dollars per query by routing appropriately.

Semantic Caching

If 30% of your queries are semantically similar, you're paying the full RAG pipeline 30% more than necessary.

Semantic caching stores query-response pairs and returns cached responses for queries that are semantically similar to previous ones. Unlike exact-match caching, semantic caching handles paraphrases — "How do I reset my password?" and "I forgot my password, help" hit the same cache entry.

Implementation: Embed each incoming query and check similarity against cached query embeddings. If similarity exceeds a threshold (typically 0.92-0.95), return the cached response. If not, run the full RAG pipeline and cache the result.

The tradeoff: Higher similarity thresholds mean fewer cache hits but more accurate responses. Lower thresholds mean more cache hits but risk returning irrelevant cached answers. Start conservative (0.95) and lower the threshold as you verify quality.

Cache invalidation: When underlying documents change, cached responses that were grounded in those documents become stale. Tag cache entries with the document IDs they were derived from, and invalidate when those documents update.

Scaling the Vector Index

When pgvector Isn't Enough

PostgreSQL with pgvector is the simplest starting point — your vectors live alongside your relational data, no additional infrastructure needed. It works well up to a few million vectors.

Beyond that, query latency increases because pgvector's indexing (IVFFlat, HNSW) has limits at very large scale. At this point, you're choosing between:

Managed vector databases (Pinecone, Weaviate Cloud): Operational simplicity. Someone else handles scaling, replication, and index optimization. Cost scales with usage.

Self-hosted specialized databases (Milvus, Qdrant): More control, potentially lower cost at scale, but you own the operations. Reasonable choice if your team has infrastructure expertise.

Hybrid search: Combine vector similarity with keyword search (BM25). Some queries are better served by exact keyword matching; others need semantic similarity. Hybrid search scores both and combines the results. This improves retrieval quality significantly for queries that contain specific terms, product names, or technical jargon that embedding models don't handle well.

Index Maintenance

Your vector index isn't static. Documents get added, updated, and deleted. The index needs to reflect these changes without full rebuilds.

Incremental updates: Add new document embeddings as documents are created. Delete embeddings when documents are removed. Most vector databases support this natively.

Re-embedding on document changes: When a document's content changes, the old embedding is stale. Re-embed the updated content and replace the vector. Batch these updates to avoid constant small writes.

Periodic reindexing: Some index types (IVFFlat in pgvector) benefit from periodic full rebuilds as the data distribution changes. Schedule this during low-traffic periods.

Failure Modes in Production

Retrieval Misses

The system retrieves documents that aren't relevant, and the LLM generates a confident but wrong answer grounded in irrelevant context. This is the most common failure mode and the hardest to detect automatically.

Mitigation: Add a relevance threshold — if no retrieved document scores above a minimum similarity, return "I don't have enough information to answer this" instead of generating from irrelevant context. The LLM will happily work with bad context. Your system should refuse to let it.

Context Window Overflow

As your corpus grows, you might retrieve more context than the model can handle. Long contexts also degrade generation quality — models tend to lose information in the middle of long contexts.

Mitigation: Hard-limit the total context tokens. Prioritize by relevance score — include the most relevant documents first, truncate the rest. Consider summarizing retrieved documents to fit more information in fewer tokens.

Latency Spikes

The RAG pipeline has multiple sequential steps: embed query, search index, retrieve documents, generate response. Each step adds latency, and any one of them can spike.

Mitigation: Set timeouts on each step. If embedding takes longer than 500ms, something's wrong. If generation takes longer than your acceptable latency, use streaming to show partial responses while the model generates. Monitor P95 latency, not just averages — the tail is where users experience pain.

Stale Data

Documents change but the index isn't updated. Users get answers grounded in outdated information and lose trust.

Mitigation: Build index update into your document update pipeline. When a document changes in your CMS or database, the embedding update should trigger automatically, not rely on a batch job that runs "eventually."

Monitoring RAG Quality

In production, you need ongoing quality measurement, not just launch-day evaluation.

Retrieval relevance: Sample queries regularly and evaluate whether retrieved documents are relevant. Automated metrics (NDCG, MRR) plus periodic human review.

Answer accuracy: Compare generated answers against known-good answers for a test set. Run this on a schedule to catch degradation.

User feedback: Thumbs up/down on answers. This is noisy but reveals patterns — if a particular topic consistently gets negative feedback, the retrieval or document quality for that topic needs work.

Cost per query: Track and trend it. If cost per query creeps up, something changed — more documents retrieved, longer context, more expensive model routing. Catch it before the invoice does.

Key Takeaway

RAG in production is an engineering problem, not just an AI problem. Control costs by retrieving less but better, compressing context, and using tiered models. Implement semantic caching for repeated query patterns. Scale the vector index as your corpus grows. Handle failure modes — retrieval misses, context overflow, latency spikes, stale data — with explicit mitigations. Monitor retrieval quality, answer accuracy, and cost continuously.

This completes the Building RAG Systems learning path. You've covered RAG architecture, embedding strategies, retrieval tuning, and production operations. The throughline: RAG is about getting the right information to the model — and in production, doing that reliably, affordably, and at scale.

Comments


bottom of page