RAG Architecture: When Your LLM Needs Real Data
- ShiftQuality Contributor
- Nov 13, 2025
- 5 min read
Large language models know a lot about the world in general. They know very little about your company's internal documentation, your product's release notes from last Tuesday, or the compliance policies that changed this quarter. Ask an LLM a question about your specific domain and it will either hallucinate a confident-sounding wrong answer or tell you it doesn't have that information.
Retrieval-Augmented Generation — RAG — solves this by giving the LLM real documents to reference before it generates an answer. Instead of relying on whatever the model absorbed during training, you retrieve the relevant context from your own data and inject it into the prompt. The model generates from your information, not its memory.
The concept is simple. The architecture that makes it work reliably at scale is not. This post covers the components of a production RAG system, where each one breaks, and the design decisions that separate a demo from a deployable system.
The Basic RAG Loop
Every RAG system follows the same three-step cycle:
Retrieve. Given a user's question, find the most relevant documents or passages from your knowledge base. This is a search problem.
Augment. Take the retrieved documents and insert them into the prompt alongside the user's question. This gives the LLM the context it needs to answer accurately.
Generate. The LLM produces an answer grounded in the retrieved documents rather than its general training knowledge.
In a demo, this takes thirty lines of code. Embed the documents, store them in a vector database, embed the query, find the nearest neighbors, stuff them into a prompt, and call the LLM. It works surprisingly well for simple cases. It falls apart for anything more complex, and the reasons are instructive.
The Retrieval Problem
Retrieval is where most RAG systems fail, and it is not because vector search is broken. It is because the retrieval step has to answer a surprisingly hard question: given this query, which documents are actually relevant?
Chunking matters more than you think. Before documents are embedded and stored, they need to be split into chunks. Too large and the chunks contain irrelevant context that dilutes the useful signal. Too small and the chunks lose the context needed to make sense. A paragraph about a refund policy is useless if the preceding paragraph — which defines what counts as a valid return — was split into a different chunk.
There is no universal right answer for chunk size. It depends on your documents, your queries, and your domain. The teams that get retrieval right treat chunking as an engineering problem that requires iteration, not a parameter to set once and forget.
Semantic search is not keyword search. Vector embeddings capture semantic similarity, which is powerful but imprecise. A query about "how to handle employee termination" might retrieve documents about "handling contract termination" because the embedding space puts those concepts close together. They are semantically similar. They are not the same thing.
Hybrid search — combining vector similarity with keyword matching — addresses this. The semantic component finds conceptually related documents. The keyword component ensures that documents containing the exact terms the user asked about are not missed. In practice, hybrid retrieval outperforms either approach alone for most enterprise use cases.
Re-ranking improves precision. The initial retrieval cast a wide net. A re-ranker — a separate model that scores the retrieved documents against the original query — filters the results down to the ones most likely to contain the answer. This adds latency but significantly improves answer quality, especially when the initial retrieval returns documents that are related but not directly relevant.
The Context Window Problem
LLMs have a maximum context window — the total number of tokens they can process in a single request. Your retrieved documents, the system prompt, and the user's question all need to fit within that window.
This creates a tension. More context usually means better answers. But more context also means higher latency, higher cost, and eventually hitting the window limit. And there is a subtler problem: LLMs do not attend to all parts of the context equally. Information in the middle of a long context is often given less weight than information at the beginning or end.
This means you cannot just retrieve twenty documents and dump them all into the prompt. You need to curate. The re-ranking step helps. So does summarization — condensing long documents into shorter representations that preserve the essential information while reducing token count.
The design decision is not "how much context can I fit?" It is "what is the minimum context the model needs to answer this question accurately?" Less is often more, as long as the less is the right less.
Grounding and Citation
A RAG system that generates an answer without indicating where the answer came from is only marginally better than a hallucinating LLM. The user still has to trust the output on faith.
Grounding means structuring your system so the LLM cites its sources. This is a prompt engineering challenge: you instruct the model to reference the specific documents it used, ideally with enough specificity that the user can verify the claim. "According to the Q3 2025 policy update..." is verifiable. An unsourced assertion is not.
Some teams go further and implement citation verification — programmatically checking that the claims in the generated answer actually appear in the retrieved documents. This catches the cases where the model synthesizes something plausible but not actually supported by the context. It adds complexity, but for high-stakes domains — legal, medical, compliance — it is worth the investment.
When RAG Is Not the Right Pattern
RAG is powerful. It is not universal. Knowing when not to use it is as important as knowing how.
RAG is the wrong pattern when the answer requires reasoning across your entire knowledge base, not just a few retrieved documents. "Summarize all customer complaints from the last quarter" requires processing hundreds or thousands of documents. Retrieval gives you a handful.
RAG is the wrong pattern when the data changes faster than your indexing pipeline can keep up. If documents are updated hourly but your embeddings are regenerated daily, the system returns stale results. For rapidly changing data, direct database queries or live API calls are more appropriate than pre-embedded document retrieval.
RAG is the wrong pattern when the question is not about information retrieval at all. "Write me a marketing email" does not need retrieved context. "Calculate the tax on this order" needs a function call, not a document search. Overusing RAG — treating it as a universal adapter between users and data — leads to systems that are slow, expensive, and less accurate than simpler approaches.
The Takeaway
RAG is an architecture pattern, not a product you install. Building a production RAG system means making deliberate decisions about chunking strategy, retrieval method, re-ranking, context window management, and grounding. Each decision affects accuracy, latency, and cost.
The retrieval step is where quality is won or lost. A perfect LLM generating from the wrong documents produces wrong answers. An adequate LLM generating from the right documents produces useful ones. Invest your engineering effort in retrieval first. The generation step is the easiest part to improve later.
Start with the simplest version that answers real questions from your users. Measure where it fails. Improve the specific component that is failing. This is not a system you design once. It is a system you iterate toward.
Next in the "Building RAG Systems" learning path: We'll cover embedding strategies — how the choice of embedding model, chunk size, and metadata affects retrieval quality, and how to evaluate whether your retrieval is actually finding the right documents.



Comments