RAG System Architecture
Retrieval-Augmented Generation (RAG) grounds LLM responses in your own data, reducing hallucinations and keeping answers current without fine-tuning.
Basic Pipeline
User Query
│
▼
[Embedding Model] ──► Query Vector
│
▼
[Vector Store] ──► Top-K Relevant Chunks
│
▼
[LLM] + Context ──► Response
Indexing Phase
1. Document Chunking
Chunking strategy has the largest impact on retrieval quality.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(docs)
Guidelines:
- Chunk size 256–512 tokens works well for most use cases
- Overlap of 10–15% prevents context from being cut at boundaries
- Preserve document metadata (source, page, section) with each chunk
2. Embedding Models
| Model | Dimensions | Notes |
|---|---|---|
text-embedding-3-small | 1536 | Good default, OpenAI |
text-embedding-3-large | 3072 | Higher quality, higher cost |
BAAI/bge-large-en-v1.5 | 1024 | Strong open-source option |
nomic-embed-text | 768 | Fast, runs locally |
3. Vector Stores
| Store | Best For |
|---|---|
| FAISS | In-memory, prototyping |
| Chroma | Local development |
| Qdrant | Production, filtering support |
| Pinecone | Managed, serverless |
| pgvector | Already using PostgreSQL |
Retrieval Phase
Hybrid Search
Combine dense (vector) and sparse (BM25/keyword) retrieval for better coverage:
# Dense + sparse scores combined via Reciprocal Rank Fusion
results = hybrid_search(query, alpha=0.7) # 0.7 = weight toward dense
Re-ranking
After retrieval, re-rank with a cross-encoder for higher precision:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, chunk) for chunk in retrieved_chunks])
top_chunks = sorted(zip(scores, retrieved_chunks), reverse=True)[:3]
Common Failure Modes
| Problem | Symptom | Fix |
|---|---|---|
| Chunks too large | LLM ignores retrieved context | Reduce chunk size |
| Poor embeddings | Irrelevant results | Switch embedding model, try hybrid search |
| Lost in the middle | LLM uses first/last chunks only | Put most relevant chunks first |
| Stale index | Outdated answers | Automate re-indexing pipeline |
| No metadata filtering | Results from wrong source | Add pre-filtering by doc type/date |
Advanced Patterns
- HyDE (Hypothetical Document Embeddings): Embed a generated answer to find similar real docs
- Multi-query retrieval: Generate query variants to increase recall
- Contextual compression: Trim retrieved chunks to only the relevant sentences before passing to LLM
- Agentic RAG: LLM decides when and what to retrieve in a loop