Embedding Strategies That Make or Break Retrieval

Contributor
Feb 20
5 min read

Updated: Jun 22

The previous post in this path covered RAG architecture — the retrieve-augment-generate loop and the components that make it work. This post goes deeper into the retrieval side, because retrieval is where RAG systems are won or lost.

A perfect LLM generating from irrelevant documents produces wrong answers. An adequate LLM generating from the right documents produces useful ones. The embedding strategy — how you convert your documents into vectors that a search system can match — is the single biggest determinant of whether the right documents are found.

What Embeddings Actually Do

An embedding model converts text into a high-dimensional numerical vector — a list of hundreds or thousands of numbers that captures the semantic meaning of the text. Texts with similar meanings produce vectors that are close together in this high-dimensional space. Texts with different meanings produce vectors that are far apart.

When a user asks a question, the question is embedded using the same model, and the resulting vector is compared against all stored document vectors. The closest matches are the most semantically similar — and ideally the most relevant.

This works remarkably well for a broad category of queries. It also fails predictably for specific categories, and understanding those failure modes is the key to building retrieval that works in practice.

Choosing an Embedding Model

Not all embedding models are created equal, and the differences matter more than most teams expect.

Domain matters. A general-purpose embedding model trained on internet text captures broad semantic relationships. But "interest rate" means something different in finance than in psychology. "Transformer" means something different in ML than in electrical engineering. If your documents are domain-specific, evaluate embedding models on your actual queries, not on general benchmarks.

Dimensionality is a tradeoff. Higher-dimensional embeddings capture more nuance. They also require more storage, more compute for similarity search, and more memory. A 1536-dimensional embedding captures more semantic detail than a 384-dimensional one, but for many use cases, the smaller embedding retrieves effectively and runs faster. Benchmark on your data before defaulting to the largest model available.

Asymmetric models outperform symmetric ones for question-answering. Symmetric embedding models treat the query and the document identically. Asymmetric models use different encoding strategies for queries (short, question-form) versus documents (long, declarative). For RAG, where queries are questions and documents are passages, asymmetric models typically retrieve more relevant results.

The practical advice: start with a well-regarded general-purpose model, evaluate on a sample of your actual queries, and switch to a domain-specific or asymmetric model only if the general-purpose model underperforms. Embedding model selection is an optimization, not a prerequisite. Good chunking with an adequate model outperforms bad chunking with the best model.

Chunking: The Most Underrated Decision

Before documents are embedded, they must be split into chunks. The chunk is the unit of retrieval — when a query matches, the system returns chunks, not whole documents. Chunk size determines the granularity and quality of retrieval.

Too large (entire documents or long sections): the embedding captures the average meaning of the entire text. A document about both contract negotiation and contract termination produces an embedding that is moderately similar to queries about either topic but highly similar to neither. The retrieved chunk contains the answer buried in paragraphs of irrelevant context, and the LLM must extract the signal from the noise.

Too small (individual sentences): the embedding captures precise meaning but loses context. A sentence like "The policy was updated in March" is meaningless without knowing which policy. The chunk matches the query semantically but fails to provide enough context for the LLM to generate a useful answer.

The sweet spot is typically 200-500 tokens, with overlap between adjacent chunks. The overlap ensures that information spanning a chunk boundary is captured in at least one chunk. A 400-token chunk with 50-token overlap means each chunk shares its first and last 50 tokens with its neighbors.

But the sweet spot varies by document type. Dense technical documentation benefits from smaller chunks — each paragraph is self-contained. Narrative documents benefit from larger chunks — context builds across paragraphs. Legal documents benefit from section-level chunking — each clause is a logical unit.

The universal recommendation: start with a reasonable default (400 tokens, 50 overlap), evaluate on your actual queries, and adjust per document type if evaluation shows retrieval gaps.

Metadata: The Retrieval Multiplier

Embeddings capture semantic meaning. They do not capture metadata: the document title, the section heading, the date published, the author, the document type, the product version. This metadata is often the difference between relevant and irrelevant retrieval.

A query about "authentication timeout configuration" should prioritize documentation for the current product version. A query about "Q3 revenue" should prioritize financial reports from Q3. Without metadata filtering, the vector search returns the most semantically similar chunks regardless of version, date, or context — and the most similar chunk from version 2.0 documentation might be wrong for a user running version 4.0.

The implementation: store metadata alongside the embedding in your vector database. At query time, apply metadata filters before or alongside the vector similarity search. "Find the most similar chunks where document_type='configuration' AND product_version='4.0'" produces dramatically better results than unfiltered similarity search.

Metadata is cheap to store and enormously valuable for precision. Every chunk should carry at minimum: source document, section heading, document date, and any domain-specific attributes that queries might need to filter on.

Evaluating Retrieval Quality

You cannot improve what you do not measure. Retrieval evaluation requires a test set: a collection of queries paired with the documents that should be retrieved for each query.

Building this test set is manual work, and it is the most valuable investment you can make in retrieval quality. Sample real user queries. For each query, identify the documents that contain the answer. This is your ground truth.

Measure retrieval with standard information retrieval metrics. Recall@k: of the relevant documents, how many appear in the top k results? Precision@k: of the top k results, how many are relevant? Mean Reciprocal Rank: how high in the results does the first relevant document appear?

Run these metrics whenever you change the embedding model, the chunking strategy, or the metadata schema. A change that improves precision but destroys recall — or vice versa — is visible in the metrics before it reaches users.

Without evaluation, embedding and chunking decisions are guesses. With evaluation, they are experiments with measurable outcomes.

The Takeaway

Retrieval quality is determined by three decisions: embedding model, chunking strategy, and metadata enrichment. Each decision has a measurable impact, and each should be evaluated against your specific queries and documents — not against benchmarks or best-practice blog posts.

Start with reasonable defaults. Build an evaluation set from real queries. Measure. Adjust. Measure again. This loop — not a one-time architectural decision — is how retrieval systems improve from demo quality to production quality.

The LLM can only generate from what retrieval gives it. Make sure retrieval gives it the right thing.

Next in the "Building RAG Systems" learning path: We'll cover production RAG infrastructure — vector database selection, indexing strategies, and the operational concerns that determine whether your RAG system scales.

ShiftQuality