RAG System Architecture

Retrieval-Augmented Generation (RAG) grounds LLM responses in your own data, reducing hallucinations and keeping answers current without fine-tuning.

Basic Pipeline

User Query
    │
    ▼
[Embedding Model] ──► Query Vector
    │
    ▼
[Vector Store] ──► Top-K Relevant Chunks
    │
    ▼
[LLM] + Context ──► Response

Indexing Phase

1. Document Chunking

Chunking strategy has the largest impact on retrieval quality.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(docs)

Guidelines:

Chunk size 256–512 tokens works well for most use cases
Overlap of 10–15% prevents context from being cut at boundaries
Preserve document metadata (source, page, section) with each chunk

2. Embedding Models

Model	Dimensions	Notes
`text-embedding-3-small`	1536	Good default, OpenAI
`text-embedding-3-large`	3072	Higher quality, higher cost
`BAAI/bge-large-en-v1.5`	1024	Strong open-source option
`nomic-embed-text`	768	Fast, runs locally

3. Vector Stores

Store	Best For
FAISS	In-memory, prototyping
Chroma	Local development
Qdrant	Production, filtering support
Pinecone	Managed, serverless
pgvector	Already using PostgreSQL

Retrieval Phase

Hybrid Search

Combine dense (vector) and sparse (BM25/keyword) retrieval for better coverage:

# Dense + sparse scores combined via Reciprocal Rank Fusion
results = hybrid_search(query, alpha=0.7)  # 0.7 = weight toward dense

Re-ranking

After retrieval, re-rank with a cross-encoder for higher precision:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, chunk) for chunk in retrieved_chunks])
top_chunks = sorted(zip(scores, retrieved_chunks), reverse=True)[:3]

Common Failure Modes

Problem	Symptom	Fix
Chunks too large	LLM ignores retrieved context	Reduce chunk size
Poor embeddings	Irrelevant results	Switch embedding model, try hybrid search
Lost in the middle	LLM uses first/last chunks only	Put most relevant chunks first
Stale index	Outdated answers	Automate re-indexing pipeline
No metadata filtering	Results from wrong source	Add pre-filtering by doc type/date

Advanced Patterns

HyDE (Hypothetical Document Embeddings): Embed a generated answer to find similar real docs
Multi-query retrieval: Generate query variants to increase recall
Contextual compression: Trim retrieved chunks to only the relevant sentences before passing to LLM
Agentic RAG: LLM decides when and what to retrieve in a loop

Basic Pipeline​

Indexing Phase​

1. Document Chunking​

2. Embedding Models​

3. Vector Stores​

Retrieval Phase​

Hybrid Search​

Re-ranking​

Common Failure Modes​

Advanced Patterns​