Skip to main content

RAG System Architecture

Retrieval-Augmented Generation (RAG) grounds LLM responses in your own data, reducing hallucinations and keeping answers current without fine-tuning.

Basic Pipeline

User Query


[Embedding Model] ──► Query Vector


[Vector Store] ──► Top-K Relevant Chunks


[LLM] + Context ──► Response

Indexing Phase

1. Document Chunking

Chunking strategy has the largest impact on retrieval quality.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(docs)

Guidelines:

  • Chunk size 256–512 tokens works well for most use cases
  • Overlap of 10–15% prevents context from being cut at boundaries
  • Preserve document metadata (source, page, section) with each chunk

2. Embedding Models

ModelDimensionsNotes
text-embedding-3-small1536Good default, OpenAI
text-embedding-3-large3072Higher quality, higher cost
BAAI/bge-large-en-v1.51024Strong open-source option
nomic-embed-text768Fast, runs locally

3. Vector Stores

StoreBest For
FAISSIn-memory, prototyping
ChromaLocal development
QdrantProduction, filtering support
PineconeManaged, serverless
pgvectorAlready using PostgreSQL

Retrieval Phase

Combine dense (vector) and sparse (BM25/keyword) retrieval for better coverage:

# Dense + sparse scores combined via Reciprocal Rank Fusion
results = hybrid_search(query, alpha=0.7) # 0.7 = weight toward dense

Re-ranking

After retrieval, re-rank with a cross-encoder for higher precision:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, chunk) for chunk in retrieved_chunks])
top_chunks = sorted(zip(scores, retrieved_chunks), reverse=True)[:3]

Common Failure Modes

ProblemSymptomFix
Chunks too largeLLM ignores retrieved contextReduce chunk size
Poor embeddingsIrrelevant resultsSwitch embedding model, try hybrid search
Lost in the middleLLM uses first/last chunks onlyPut most relevant chunks first
Stale indexOutdated answersAutomate re-indexing pipeline
No metadata filteringResults from wrong sourceAdd pre-filtering by doc type/date

Advanced Patterns

  • HyDE (Hypothetical Document Embeddings): Embed a generated answer to find similar real docs
  • Multi-query retrieval: Generate query variants to increase recall
  • Contextual compression: Trim retrieved chunks to only the relevant sentences before passing to LLM
  • Agentic RAG: LLM decides when and what to retrieve in a loop