How to implement a RAG system: the seven-stage pipeline, end to end

To implement RAG (retrieval-augmented generation) you stand up two pipelines: an offline index that turns your documents into searchable vectors, and an online query path that retrieves, augments, and generates a grounded answer. This guide covers every stage of a RAG implementation, the decisions that move accuracy, how to evaluate retrieval and generation separately, and the seven production failure points to budget for before launch.

By Kanika Mathur, Head of Service Delivery

Reviewed by Resourcifi engineeringPublished Mar 10, 2026Updated Mar 10, 202612 min read

Key takeaways

The short version

RAG couples a parametric LLM with an external knowledge store: at inference, relevant passages are retrieved from your corpus and supplied as context, so generation is grounded in evidence rather than weights. The architecture comes from Lewis et al., NeurIPS 2020.
Think of it as two pipelines. An offline index (ingest, chunk, embed, store) and an online query path (retrieve, augment, generate), with evaluation running across both. That is the standard LlamaIndex and LangChain framing.
The cleanest mental model for RAG versus fine-tuning: fine-tuning changes behavior, style, and format; RAG changes knowledge. Use RAG for proprietary or fast-changing facts and when citations matter; the two combine well in production.
The decisions that move accuracy are chunk size and overlap, embedding model, top-k, and reranking. A common starting point is roughly 400 to 512-token chunks with 10 to 20% overlap, retrieving 10 to 50 candidates and reranking to 3 to 5, all requiring tuning on your own data; none are fixed law.
Evaluate retrieval and generation separately: context precision and recall grade the retriever, faithfulness and answer relevancy grade the generator, e.g. with Ragas. And budget for the seven documented RAG failure points before they surface in production.

What a RAG system is, and when to use it

A RAG system couples a parametric large language model with an external, non-parametric knowledge store: at inference, relevant text passages are retrieved from your corpus and supplied to the model as context, so the answer is grounded in retrieved evidence instead of relying solely on the model's weights. The architecture comes from Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020), which paired a dense passage retriever with a sequence-to-sequence generator over a vector index and reported more specific, diverse, and factual output than a parametric-only baseline.¹

Teams reach for RAG because it injects up-to-date or proprietary knowledge without retraining, grounds answers in cited sources to reduce hallucination, lets one base model serve many domains by swapping the data source, and stays debuggable: you can inspect exactly which chunks were retrieved.² Enterprise adoption has accelerated fast: MarketsandMarkets valued the RAG market at USD 1.94 billion in 2025 and projects USD 9.86 billion by 2030 at a 38.4% CAGR, driven by demand for context-aware AI applications across every sector.¹³ The cleanest way to decide between RAG and fine-tuning is to remember that fine-tuning changes behavior, style, and format, while RAG changes knowledge.³ Use RAG when the answer depends on facts that are proprietary, large, or change often, and when provenance matters; use fine-tuning to shape tone, format, or reasoning style. They are not mutually exclusive, and production systems often do both. If you are building for an agentic use case, our RAG development team and the broader AI agents guide cover the permission-aware retrieval patterns most enterprise teams need.

How to implement RAG: the seven-stage pipeline

To implement RAG you split the work into two pipelines: an offline indexing pipeline that runs once per document change (ingest and chunk, embed, store) and an online query pipeline that runs on every request (retrieve, augment, generate), with evaluation cutting across both. That seven-stage decomposition is the standard framing in the LlamaIndex and LangChain documentation.⁴ The table below is the reference map: stage, which phase it belongs to, what happens, and the key decision each stage forces.

The seven-stage RAG pipeline

Stages one to three run offline, once per document change. Stages four to six run online, on every query. Evaluation is cross-cutting. Each stage forces one key decision, covered in the next section.

The RAG pipeline, stage by stage
Stage	Phase	What happens	Key decision
1. Ingest and chunk	Offline index	Load source documents, normalize, and split into retrievable units of meaning	Chunk size and overlap
2. Embed	Offline index	Convert each chunk to a dense vector with one consistent embedding model	Embedding model
3. Store	Offline index	Write vectors plus source metadata into a vector database or index	Vector store and index type
4. Retrieve	Online query	Embed the query, run approximate-nearest-neighbor search, optionally add keyword search and a reranker	top-k, hybrid search, reranking
5. Augment	Online query	Insert the retrieved chunks into a prompt template with the query and system instructions	Context assembly
6. Generate	Online query	The LLM answers grounded in the supplied context, ideally with inline citations	Citations, format constraints
7. Evaluate	Cross-cutting	Score retrieval and generation against a labeled set, then monitor in production	Faithfulness, relevancy, precision, recall

Source: pipeline decomposition per LlamaIndex and LangChain documentation. Architecture per Lewis et al., NeurIPS 2020.

A few stages deserve a closer read. Chunking matters because embedding a whole document averages too many concepts into one vector, so chunks should be coherent, retrievable units; fixed-size, recursive-character, sentence, and semantic splitters are all standard options.⁵ Storage is where source metadata earns its keep: doc id, title, URL, and timestamp drive filtering, citation, and freshness, and later they become the only thing that scopes a retrieval to the right tenant. Retrieval is often a two-stage affair, broad recall first and precise ranking second, which is the single highest-leverage upgrade for most systems. The full version of this pipeline, with the index and the permission-aware retrieval layer, is what our RAG development team builds.

The decisions that actually move accuracy

Four decisions move RAG accuracy more than model choice: chunk size and overlap, the embedding model, top-k, and whether you rerank. A common starting point is roughly 400 to 512-token chunks with 10 to 20% overlap, retrieving top-k of 10 to 50 candidates and reranking down to 3 to 5 for the final context. Treat those as starting points to tune on your own data, because the right values vary by document type and workload and rarely hold as fixed constants.

Chunk size and overlap. Smaller chunks give more granular, precise matches but risk losing context; larger chunks carry more context but dilute relevance and add retrieval noise. Pinecone's guidance is to start with fixed-size chunking and iterate only if recall is insufficient, testing a range such as 128 or 256 tokens for granular matching against 512 or 1024 for context.⁵ The widely repeated 400 to 512-token default with 10 to 20% overlap is community consensus, useful as a first pass and nothing more.

Embedding model. Choose by retrieval performance, language coverage, dimensionality, cost, and whether you self-host or call an API. The MTEB leaderboard is the standard benchmark, though it tests mostly single-language text retrieval and not long-document or cross-lingual cases, so read it with that caveat.⁶ Keep one embedding model consistent across the entire corpus, because vectors from different models are not comparable. Leaderboard rankings shift often, so verify current standings before committing to a fixed pick.

top-k and reranking. A reliable pattern is to retrieve 10 to 50 candidates with approximate-nearest-neighbor search, optionally fuse dense vector search with BM25 keyword search for hybrid retrieval, then apply a cross-encoder reranker to cut the candidate set down to roughly 3 to 5 for the prompt.⁷ A reranker jointly encodes the query and each chunk for higher precision, and it earns its place when the right documents keep landing in positions three to eight instead of one or two. Keep the candidate set modest, under about fifty, to hold latency down. The two-stage "retrieve broadly, rank precisely" approach improves answer accuracy in benchmarks; specific percentage gains are vendor and blog claims, so trust the direction, not a fixed number.

How to evaluate a RAG system

Evaluate retrieval and generation separately. Context precision and context recall grade the retriever, while faithfulness and answer relevancy grade the generator. Ragas is the standard reference-light framework for this. Build a small labeled evaluation set first, score against it offline, then add online monitoring once the system is live.

The four core metrics split cleanly across the pipeline.⁸ Faithfulness measures the fraction of claims in the answer that are actually supported by the retrieved context, which is your direct hallucination check. Answer relevancy measures how well the answer addresses the question asked. Context precision measures how much of the retrieved context was genuinely relevant, and context recall measures how much of the information the answer needed was successfully retrieved. The split is what makes RAG debuggable: a faithful but irrelevant answer is a generation problem, while a faithful answer that misses available facts is almost always a recall problem in the retriever. Diagnose the layer first, then fix the stage, instead of tuning the whole system blind.

Production concerns: cost, latency, freshness, and security

In production, four concerns dominate a RAG system: latency, cost, freshness, and security. Budget latency end to end across query encoding, vector search, reranking, and generation as a single measurement; cut cost by reranking to a tighter top-k so fewer tokens reach the model; re-ingest documents on change so answers stay fresh; and treat retrieved text as an untrusted attack surface, because indirect prompt injection and embedding poisoning are real risks.

Latency and cost. Vector queries are a meaningful share of RAG latency, spanning query encoding, network round trips, and the approximate-nearest-neighbor search itself, before generation even starts; exact numbers are workload-dependent, so budget the whole path end to end.⁹ Reranking a small candidate set and tightening top-k is the lever that does double duty here, cutting both latency drift and the token count sent to the model. Freshness is the quiet killer: stale documents produce confidently wrong answers, so run re-ingestion on document change, store timestamps, and expire or flag documents past an age threshold.

Security. The retrieval layer is a genuine attack surface. Because RAG often treats retrieved text as trusted, an attacker who plants instructions inside an ingested document can hijack generation through indirect prompt injection, the prompt-injection class OWASP ranks first in its Top 10 for LLM applications.¹¹ Embedding or data poisoning of the vector store can steer what gets retrieved in the first place.¹⁰ The defenses are concrete: allow-list sources, sanitize on ingestion, enforce access control and metadata-based tenant isolation at the retrieval step, and treat every retrieved chunk as untrusted input, never as instructions.

One last reality check is worth keeping on the wall. Barnett et al. catalog seven failure points in production RAG: content missing from the corpus, the right document missed by top-k ranking, a retrieved document dropped during context consolidation, an answer present in context but not extracted, the wrong output format, incorrect specificity, and an incomplete answer.¹² The fixes map straight back onto the pipeline: improve chunking and coverage for missing content, raise top-k and add reranking for missed documents, fix context assembly for dropped chunks, and tighten prompts and format constraints for the extraction and formatting failures. For the agentic variant, where retrieval becomes a tool the model decides when to call, see our companion guide on agentic RAG.

Frequently asked

How to implement a RAG system: common questions

What is a RAG system in simple terms?

A RAG system retrieves relevant documents from your own data and feeds them to a large language model, so answers are grounded in your sources instead of the model’s memory. The architecture comes from Lewis et al. (NeurIPS 2020), which paired a passage retriever with a generator over a vector index. The practical benefit is that you can add proprietary or fresh knowledge without retraining the model, and you can cite exactly which sources an answer came from.

What are the main steps to build a RAG system?

There are seven stages across two pipelines. The offline index covers ingest and chunk, embed, and store, which you run once per document change. The online query path covers retrieve, augment, and generate, which runs on every request, with retrieval optionally adding hybrid keyword search and a reranker. Evaluation cuts across both, scoring retrieval and generation separately. That decomposition follows the standard LlamaIndex and LangChain framing.

RAG or fine-tuning, which one should I use?

Use RAG to change knowledge and fine-tuning to change behavior. RAG fits when answers depend on proprietary, large, or fast-changing facts and when citations matter, because it injects that knowledge at inference without retraining. Fine-tuning fits when you need to shape tone, output format, domain jargon, or reasoning style that prompting cannot reliably elicit. They are not mutually exclusive, and production systems often fine-tune for voice and use RAG for knowledge.

What chunk size and top-k should I start with?

Start with fixed-size chunking and iterate only if recall is insufficient, per Pinecone. A common community starting point is roughly 400 to 512-token chunks with 10 to 20% overlap, retrieving top-k of 10 to 50 candidates and reranking down to about 3 to 5 for the final context. Treat all of these as starting points to tune on your own data, since the right values shift with document type and workload.

How do you evaluate a RAG system?

Evaluate retrieval and generation separately. Context precision and context recall grade the retriever, measuring how relevant the retrieved context was and how much needed information it captured. Faithfulness and answer relevancy grade the generator, measuring whether claims are supported by context and whether the answer addresses the question. Ragas is the standard reference-light framework. Build a small labeled evaluation set first, then add online monitoring once the system is live.

How do I implement RAG in Python?

The fastest way to implement RAG in Python is to pick a framework that handles the pipeline plumbing. LlamaIndex and LangChain are the two dominant options: both give you loaders for common document types, text splitters for chunking, connector classes for vector stores such as Pinecone, Weaviate, or pgvector, and chain abstractions that wire retrieval to generation. At a minimum you need four libraries: a document loader, an embedding model client (such as OpenAI Embeddings or a sentence-transformers model), a vector store client, and an LLM client. Wire them in the order of the pipeline: load, chunk, embed, store, then on query: embed the query, retrieve top-k chunks, insert into a prompt template, and call the LLM. LlamaIndex's Introduction to RAG and LangChain's Build a RAG App tutorials are the reference starting points.

Kanika Mathur

Head of Service Delivery, Resourcifi

Kanika Mathur is Head of Service Delivery at Resourcifi, where her engineering pods ship retrieval-augmented systems on proprietary corpora, from the offline index to the permission-aware retrieval layer that decides what an LLM is allowed to read. She has sat through the eval reviews where a system that demoed well fell apart on context recall, and the security reviews where retrieved text turned out to be an attack surface, which is the vantage point behind this guide.

Resourcifi on LinkedIn →