How to implement a RAG system: the seven-stage pipeline, end to end
To implement RAG (retrieval-augmented generation) you stand up two pipelines: an offline index that turns your documents into searchable vectors, and an online query path that retrieves, augments, and generates a grounded answer. This guide covers every stage of a RAG implementation, the decisions that move accuracy, how to evaluate retrieval and generation separately, and the seven production failure points to budget for before launch.

The short version
- RAG couples a parametric LLM with an external knowledge store: at inference, relevant passages are retrieved from your corpus and supplied as context, so generation is grounded in evidence rather than weights. The architecture comes from Lewis et al., NeurIPS 2020.
- Think of it as two pipelines. An offline index (ingest, chunk, embed, store) and an online query path (retrieve, augment, generate), with evaluation running across both. That is the standard LlamaIndex and LangChain framing.
- The cleanest mental model for RAG versus fine-tuning: fine-tuning changes behavior, style, and format; RAG changes knowledge. Use RAG for proprietary or fast-changing facts and when citations matter; the two combine well in production.
- The decisions that move accuracy are chunk size and overlap, embedding model, top-k, and reranking. A common starting point is roughly 400 to 512-token chunks with 10 to 20% overlap, retrieving 10 to 50 candidates and reranking to 3 to 5, all requiring tuning on your own data; none are fixed law.
- Evaluate retrieval and generation separately: context precision and recall grade the retriever, faithfulness and answer relevancy grade the generator, e.g. with Ragas. And budget for the seven documented RAG failure points before they surface in production.
What a RAG system is, and when to use it
A RAG system couples a parametric large language model with an external, non-parametric knowledge store: at inference, relevant text passages are retrieved from your corpus and supplied to the model as context, so the answer is grounded in retrieved evidence instead of relying solely on the model's weights. The architecture comes from Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020), which paired a dense passage retriever with a sequence-to-sequence generator over a vector index and reported more specific, diverse, and factual output than a parametric-only baseline.1
Teams reach for RAG because it injects up-to-date or proprietary knowledge without retraining, grounds answers in cited sources to reduce hallucination, lets one base model serve many domains by swapping the data source, and stays debuggable: you can inspect exactly which chunks were retrieved.2 Enterprise adoption has accelerated fast: MarketsandMarkets valued the RAG market at USD 1.94 billion in 2025 and projects USD 9.86 billion by 2030 at a 38.4% CAGR, driven by demand for context-aware AI applications across every sector.13 The cleanest way to decide between RAG and fine-tuning is to remember that fine-tuning changes behavior, style, and format, while RAG changes knowledge.3 Use RAG when the answer depends on facts that are proprietary, large, or change often, and when provenance matters; use fine-tuning to shape tone, format, or reasoning style. They are not mutually exclusive, and production systems often do both. If you are building for an agentic use case, our RAG development team and the broader AI agents guide cover the permission-aware retrieval patterns most enterprise teams need.
How to implement RAG: the seven-stage pipeline
To implement RAG you split the work into two pipelines: an offline indexing pipeline that runs once per document change (ingest and chunk, embed, store) and an online query pipeline that runs on every request (retrieve, augment, generate), with evaluation cutting across both. That seven-stage decomposition is the standard framing in the LlamaIndex and LangChain documentation.4 The table below is the reference map: stage, which phase it belongs to, what happens, and the key decision each stage forces.
| Stage | Phase | What happens | Key decision |
|---|---|---|---|
| 1. Ingest and chunk | Offline index | Load source documents, normalize, and split into retrievable units of meaning | Chunk size and overlap |
| 2. Embed | Offline index | Convert each chunk to a dense vector with one consistent embedding model | Embedding model |
| 3. Store | Offline index | Write vectors plus source metadata into a vector database or index | Vector store and index type |
| 4. Retrieve | Online query | Embed the query, run approximate-nearest-neighbor search, optionally add keyword search and a reranker | top-k, hybrid search, reranking |
| 5. Augment | Online query | Insert the retrieved chunks into a prompt template with the query and system instructions | Context assembly |
| 6. Generate | Online query | The LLM answers grounded in the supplied context, ideally with inline citations | Citations, format constraints |
| 7. Evaluate | Cross-cutting | Score retrieval and generation against a labeled set, then monitor in production | Faithfulness, relevancy, precision, recall |
A few stages deserve a closer read. Chunking matters because embedding a whole document averages too many concepts into one vector, so chunks should be coherent, retrievable units; fixed-size, recursive-character, sentence, and semantic splitters are all standard options.5 Storage is where source metadata earns its keep: doc id, title, URL, and timestamp drive filtering, citation, and freshness, and later they become the only thing that scopes a retrieval to the right tenant. Retrieval is often a two-stage affair, broad recall first and precise ranking second, which is the single highest-leverage upgrade for most systems. The full version of this pipeline, with the index and the permission-aware retrieval layer, is what our RAG development team builds.
The decisions that actually move accuracy
Four decisions move RAG accuracy more than model choice: chunk size and overlap, the embedding model, top-k, and whether you rerank. A common starting point is roughly 400 to 512-token chunks with 10 to 20% overlap, retrieving top-k of 10 to 50 candidates and reranking down to 3 to 5 for the final context. Treat those as starting points to tune on your own data, because the right values vary by document type and workload and rarely hold as fixed constants.
Chunk size and overlap. Smaller chunks give more granular, precise matches but risk losing context; larger chunks carry more context but dilute relevance and add retrieval noise. Pinecone's guidance is to start with fixed-size chunking and iterate only if recall is insufficient, testing a range such as 128 or 256 tokens for granular matching against 512 or 1024 for context.5 The widely repeated 400 to 512-token default with 10 to 20% overlap is community consensus, useful as a first pass and nothing more.
Embedding model. Choose by retrieval performance, language coverage, dimensionality, cost, and whether you self-host or call an API. The MTEB leaderboard is the standard benchmark, though it tests mostly single-language text retrieval and not long-document or cross-lingual cases, so read it with that caveat.6 Keep one embedding model consistent across the entire corpus, because vectors from different models are not comparable. Leaderboard rankings shift often, so verify current standings before committing to a fixed pick.
top-k and reranking. A reliable pattern is to retrieve 10 to 50 candidates with approximate-nearest-neighbor search, optionally fuse dense vector search with BM25 keyword search for hybrid retrieval, then apply a cross-encoder reranker to cut the candidate set down to roughly 3 to 5 for the prompt.7 A reranker jointly encodes the query and each chunk for higher precision, and it earns its place when the right documents keep landing in positions three to eight instead of one or two. Keep the candidate set modest, under about fifty, to hold latency down. The two-stage "retrieve broadly, rank precisely" approach improves answer accuracy in benchmarks; specific percentage gains are vendor and blog claims, so trust the direction, not a fixed number.
How to evaluate a RAG system
Evaluate retrieval and generation separately. Context precision and context recall grade the retriever, while faithfulness and answer relevancy grade the generator. Ragas is the standard reference-light framework for this. Build a small labeled evaluation set first, score against it offline, then add online monitoring once the system is live.
The four core metrics split cleanly across the pipeline.8 Faithfulness measures the fraction of claims in the answer that are actually supported by the retrieved context, which is your direct hallucination check. Answer relevancy measures how well the answer addresses the question asked. Context precision measures how much of the retrieved context was genuinely relevant, and context recall measures how much of the information the answer needed was successfully retrieved. The split is what makes RAG debuggable: a faithful but irrelevant answer is a generation problem, while a faithful answer that misses available facts is almost always a recall problem in the retriever. Diagnose the layer first, then fix the stage, instead of tuning the whole system blind.
Production concerns: cost, latency, freshness, and security
In production, four concerns dominate a RAG system: latency, cost, freshness, and security. Budget latency end to end across query encoding, vector search, reranking, and generation as a single measurement; cut cost by reranking to a tighter top-k so fewer tokens reach the model; re-ingest documents on change so answers stay fresh; and treat retrieved text as an untrusted attack surface, because indirect prompt injection and embedding poisoning are real risks.
Latency and cost. Vector queries are a meaningful share of RAG latency, spanning query encoding, network round trips, and the approximate-nearest-neighbor search itself, before generation even starts; exact numbers are workload-dependent, so budget the whole path end to end.9 Reranking a small candidate set and tightening top-k is the lever that does double duty here, cutting both latency drift and the token count sent to the model. Freshness is the quiet killer: stale documents produce confidently wrong answers, so run re-ingestion on document change, store timestamps, and expire or flag documents past an age threshold.
Security. The retrieval layer is a genuine attack surface. Because RAG often treats retrieved text as trusted, an attacker who plants instructions inside an ingested document can hijack generation through indirect prompt injection, the prompt-injection class OWASP ranks first in its Top 10 for LLM applications.11 Embedding or data poisoning of the vector store can steer what gets retrieved in the first place.10 The defenses are concrete: allow-list sources, sanitize on ingestion, enforce access control and metadata-based tenant isolation at the retrieval step, and treat every retrieved chunk as untrusted input, never as instructions.
One last reality check is worth keeping on the wall. Barnett et al. catalog seven failure points in production RAG: content missing from the corpus, the right document missed by top-k ranking, a retrieved document dropped during context consolidation, an answer present in context but not extracted, the wrong output format, incorrect specificity, and an incomplete answer.12 The fixes map straight back onto the pipeline: improve chunking and coverage for missing content, raise top-k and add reranking for missed documents, fix context assembly for dropped chunks, and tighten prompts and format constraints for the extraction and formatting failures. For the agentic variant, where retrieval becomes a tool the model decides when to call, see our companion guide on agentic RAG.
How to implement a RAG system: common questions
What is a RAG system in simple terms?
What are the main steps to build a RAG system?
RAG or fine-tuning, which one should I use?
What chunk size and top-k should I start with?
How do you evaluate a RAG system?
How do I implement RAG in Python?
Sources
- Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS (2020).
- Pinecone, Retrieval-Augmented Generation.
- Red Hat, RAG vs. fine-tuning.
- LlamaIndex, Introduction to RAG; LangChain, Build a RAG app.
- Pinecone, Chunking Strategies for LLM Applications.
- MTEB, Massive Text Embedding Benchmark leaderboard.
- Weaviate, Hybrid search; Cohere, Rerank overview.
- Ragas, List of available metrics.
- Introl, RAG infrastructure for production.
- Lasso Security, RAG Security.
- OWASP, Top 10 for Large Language Model Applications.
- Barnett et al., Seven Failure Points When Engineering a RAG System, CAIN (2024).
- MarketsandMarkets, Retrieval-augmented Generation (RAG) Market worth $9.86 billion by 2030 (2025).
Building AI
AI Copilots for SaaS: Build vs Buy Guide
AI copilot vs AI agent for SaaS: a copilot assists, an agent acts. How an in-app copilot works, the RAG and multi-tenant...
Read guide →
Building AI
How to Add AI to Your SaaS Product: A Production-First Playbook
Learn how to build an AI SaaS product: the build-order playbook (prompt, RAG, fine-tune, agents), multi-tenant isolation...
Read guide →
Building AI
How to Build a Domain-Specific LLM
How to build a domain-specific LLM: RAG for facts, LoRA fine-tuning for behavior. Practical guide with compute costs from...
Read guide →
Building AI
How to Build an AI Copilot
Learn how to make an AI assistant: eight steps covering RAG, tool calling, guardrails, evals, and telemetry, backed by Mi...
Read guide →
Building AI
How to Build an AI SaaS Product
How to build a SaaS product with AI: the 5-phase build path, stack, margin reality, and pricing models. Trusted by 200+ e...
Read guide →
Building AI
How to Train a Custom Model
How to train an AI model: when to train vs. use an API, the 7-stage workflow, classical ML vs LLM fine-tuning, and the pi...
Read guide →
Agents & RAG
Agentic RAG: When to Use It and How to Build It
Agentic RAG explained: how it differs from naive and advanced RAG, the key patterns like corrective RAG and self-RAG, the...
Read guide →
Agents & RAG
AI Agent for Fintech: Risk, Compliance, Ops, Customer
AI agents in finance: fraud, AML, KYC and servicing use cases, how to build with money-movement guardrails and human appr...
Read guide →
Agents & RAG
AI Agent for Healthcare: Use Cases, Governance & Implementation
AI agents in healthcare: the use cases that pay off first, how to build one HIPAA-safe on FHIR with clinician review, and...
Read guide →
