Retrieval Tuning: Making RAG Actually Find the Right Stuff

Contributor
Mar 1
6 min read

Updated: Jun 22

The previous posts in this path covered RAG architecture and embedding strategies. This post covers the work that determines whether a RAG system actually works: retrieval tuning — the systematic process of ensuring that when a user asks a question, the retrieval step finds the documents that contain the answer.

A RAG system that retrieves the wrong documents produces hallucinated, irrelevant, or incomplete answers — regardless of how good the LLM is. The generation model can only work with what it is given. If the retrieval step hands it three paragraphs about the company's 2019 marketing strategy when the user asked about the 2024 refund policy, the response will be confidently wrong. The LLM is not the bottleneck. The retrieval is.

Measuring Retrieval Quality

Before tuning anything, you need a way to measure whether retrieval is working. This requires an evaluation dataset: a set of queries paired with the documents that should be retrieved for each query.

Building this dataset is the hardest part of retrieval tuning. You need real user queries (or realistic synthetic ones) and human-labeled relevant documents. "For this query, documents 4, 17, and 23 are relevant. Documents 8 and 31 are partially relevant. Everything else is irrelevant."

The standard metrics are recall@k (what fraction of relevant documents appear in the top k results) and precision@k (what fraction of the top k results are actually relevant). Mean Reciprocal Rank (MRR) measures how high the first relevant result appears.

For RAG systems, recall matters more than precision. A retrieved document that is irrelevant wastes some context window space but does not cause harm — the LLM will likely ignore it. A relevant document that is not retrieved means the answer is missing from the context, which causes hallucination or incomplete responses.

The target: recall@10 above 90% for your evaluation set. If fewer than 90% of the relevant documents appear in the top 10 results, the retrieval step is losing information that the generation step needs.

Hybrid Search: Best of Both Worlds

Pure semantic search (embedding-based) excels at finding conceptually similar documents but struggles with exact matches — specific product names, error codes, policy numbers, and jargon. Pure keyword search (BM25) excels at exact matches but misses conceptual relationships — a query about "employee termination" will not find a document about "workforce reduction."

Hybrid search combines both approaches. Each query runs through both a semantic search and a keyword search. The results are merged using a fusion algorithm that balances the strengths of each approach.

Reciprocal Rank Fusion (RRF) is the simplest and most effective fusion method. It scores each document based on its rank in each result list, giving weight to documents that appear in both lists. A document ranked 3rd in semantic search and 5th in keyword search receives a higher fused score than a document ranked 1st in semantic search but absent from keyword search results.

The weight between semantic and keyword results is a tuning parameter. For technical documentation with precise terminology, keyword search should be weighted more heavily. For conversational queries where users describe concepts rather than using exact terms, semantic search should dominate. The evaluation dataset tells you which weighting produces the best recall for your specific corpus and query patterns.

Chunking Revisited: Size Matters

The previous post covered embedding strategies, but chunking decisions directly affect retrieval quality and deserve revisitation during tuning.

Too-small chunks lose context. A chunk that says "the return period is 30 days" without specifying which product category the policy applies to is ambiguous. The LLM receives it, has no way to know which products it covers, and may apply it incorrectly.

Too-large chunks dilute relevance. A 2000-token chunk that mentions the refund policy in one paragraph but spends the rest discussing shipping procedures will have an embedding that represents the average of all topics. A query about refund policy may not surface this chunk because the embedding is diluted by the shipping content.

The tuning approach: try multiple chunk sizes (256, 512, 1024 tokens) with overlap (64-128 tokens) and measure retrieval quality on your evaluation set. The optimal size depends on your content structure — dense policy documents benefit from smaller chunks, narrative documentation benefits from larger ones.

Parent-child chunking offers a hybrid: embed small chunks for precise retrieval, but return the larger parent chunk to the LLM for generation. The small chunk gets retrieved because it is semantically precise. The parent chunk gives the LLM enough context to generate a complete answer.

Reranking: The Quality Gate

Initial retrieval casts a wide net — fetch the top 50 or 100 candidates. Reranking then applies a more sophisticated model to reorder those candidates by relevance before passing the top results to the LLM.

Cross-encoder rerankers are dramatically more accurate than the bi-encoder models used for initial retrieval. A bi-encoder encodes the query and document separately, then compares their embeddings. A cross-encoder processes the query and document together, allowing it to capture fine-grained interactions between query terms and document content. This is too expensive for searching millions of documents but perfect for reranking 50-100 candidates.

The reranking step typically improves recall@5 by 10-20% over retrieval alone. That improvement translates directly to better RAG responses, because the LLM receives more relevant context.

The practical implementation: retrieve 50-100 candidates using fast semantic + keyword search, rerank with a cross-encoder, pass the top 5-10 to the LLM. The reranking step adds 100-300ms of latency — a worthwhile trade for significantly better retrieval quality.

Query Understanding

Sometimes the problem is not the retrieval pipeline — it is the query. User queries are ambiguous, incomplete, or expressed in vocabulary that does not match the document corpus.

Query expansion reformulates the user's query to improve retrieval coverage. A simple approach: use the LLM to generate 2-3 alternative phrasings of the query and run retrieval on all of them. A query about "how to cancel my subscription" might expand to include "subscription cancellation process," "unsubscribe from service," and "end membership." Each reformulation captures different vocabulary that might match different documents.

Hypothetical Document Embeddings (HyDE) takes this further. Instead of searching with the query, ask the LLM to generate a hypothetical answer to the query, then use that hypothetical answer as the search input. The hypothesis is closer in embedding space to the actual documents than the short query is, producing better retrieval results.

Query decomposition handles complex, multi-part queries by breaking them into sub-queries. "Compare the return policies for electronics and clothing" becomes two retrieval queries: one for electronics return policy and one for clothing return policy. The results from both sub-queries are combined and passed to the LLM for synthesis.

Iterative Tuning Workflow

Retrieval tuning is not a one-time optimization. It is an iterative process driven by the evaluation dataset.

Baseline. Measure current retrieval performance on the evaluation set. Record recall@5, recall@10, MRR, and precision@5.

Identify failure cases. Examine the queries where retrieval fails — relevant documents do not appear in the top results. Categorize the failure modes: vocabulary mismatch (keyword problem), concept mismatch (embedding problem), chunking issue (relevant content split across chunks), or missing content (the answer is not in the corpus).

Apply targeted fixes. For vocabulary mismatches, add keyword search or boost keyword weights. For concept mismatches, try a different embedding model or add query expansion. For chunking issues, adjust chunk sizes or add parent-child chunking. For missing content, identify and add the missing source documents.

Measure again. After each change, re-run the evaluation. Verify that the targeted fix improved the failure cases without degrading overall performance. Accept changes that improve recall without significant precision loss. Revert changes that do not demonstrate improvement.

The Takeaway

Retrieval tuning is the highest-leverage work in a RAG system. Hybrid search covers both semantic and keyword matching. Reranking adds a quality gate that significantly improves the results sent to the LLM. Query understanding handles the gap between how users express questions and how documents express answers. And an evaluation dataset with recall metrics provides the feedback loop that makes systematic improvement possible.

The LLM can only answer with what it is given. Give it the right documents and mediocre prompts will produce good answers. Give it the wrong documents and perfect prompts will produce hallucinations. Invest in retrieval.

Next in the "Building RAG Systems" learning path: We'll cover RAG evaluation and testing — systematic approaches to measuring end-to-end RAG quality from retrieval through generation.

ShiftQuality