The Problem: Hallucination & Staleness
LLMs (GPT-4, Claude) are frozen in time.
- Staleness: They don't know today's news.
- Private Data: They don't know your company's internal wiki.
- Hallucination: They confidently make up facts.
RAG solves this by fetching relevant data before meaningful generation.
The Architecture
graph TD
User[User Question] --> App[RAG Application]
subgraph Ingestion [1. Ingestion Phase]
Docs[PDF/Wiki] --> Chunk[HTML/Text Splitter]
Chunk --> Emb[Embedding Model]
Emb --> VecDB[(Vector Database)]
end
subgraph Retrieval [2. Retrieval Phase]
App -->|Embed Query| Emb
Emb -->|Vector| VecDB
VecDB -->|Top K Chunks| App
end
subgraph Generation [3. Generation Phase]
App -->|Context + Query| LLM[LLM (GPT-4)]
LLM -->|Answer| User
end
1. Ingestion (Offline)
- Load: Read PDFs, Slack history, Notion.
- Split: Break text into chunks (e.g., 500 tokens).
- Embed: Convert text to vectors using an Embedding Model (OpenAI
text-embedding-3). - Store: Save vectors + text in a Vector DB (Pinecone, Milvus, pgvector).
2. Retrieval (Online)
- User asks: "What is the vacation policy?"
- Convert question to vector:
[0.1, 0.5, -0.9...] - Perform Semantic Search (Cosine Similarity) in Vector DB.
- Get top 3 chunks: "Policy 2024: 20 days off..."
3. Generation (Online)
- Prompt Engineering:
text
System: Answer using only the Context below. Context: "Policy 2024: 20 days off..." User: What is the vacation policy?
Click to expand code... - LLM generates accurate answer grounded in fact.
Vector Search Internals
How do we search 1 billion vectors in milliseconds? We can't compare every vector (). We use Approximate Nearest Neighbor (ANN) algorithms.
HNSW (Hierarchical Navigable Small World)
Think of it like a skip-list for graphs.
- Data Structure: Multi-layered graph.
- Search: Start at top layer (sparse), drill down to bottom layer (dense).
- Complexity: .
graph TD
Entry point --> Node1
Node1 --> Node2
Node2 --> Target
Hybrid Search
Semantic search (Vectors) is great for concepts ("dog" matches "puppy"), but bad for keywords ("Error 504" might match "Error 404").
Solution: Combine Vector Search + Keyword Search (BM25).
RRF (Reciprocal Rank Fusion)merges the two result sets.
Code Example: Simple RAG pipeline
import os
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
# 1. Setup Retrieval
embeddings = OpenAIEmbeddings()
db = Chroma(persist_directory="./db", embedding_function=embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})
# 2. Define Chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
prompt = PromptTemplate.from_template(
"Context: {context}\n\nQuestion: {question}\nAnswer:"
)
def ask(question):
# A. Retrieve
docs = retriever.get_relevant_documents(question)
context_text = "\n\n".join([d.page_content for d in docs])
# B. Augment
final_prompt = prompt.format(context=context_text, question=question)
# C. Generate
return llm.predict(final_prompt)
# print(ask("How do I reset my password?"))
Advanced Techniques
Corrective RAG (CRAG)
If vector search returns low confidence score:
- Fallback: Use Web Search (Google API).
- Filter: LLM grades retrieved documents for relevance. Discard irrelevant ones.
Multi-Query Retrieval
User asks complex question?
- LLM rewrites query into 3 sub-queries.
- Execute all 3.
- Deduplicate results.
Interview Tips š”
- "Context Window Limit" ā LLMs can process 128k+ tokens now, so why RAG?
- Cost: Processing 1M tokens per query is expensive ($$).
- Latency: Takes seconds to process huge context.
- Accuracy: "Lost in the Middle" phenomenon.
- "Chunking Strategy" ā Too small? Missing context. Too big? Noise. Overlapping chunks helps.
Related Concepts
- Vector Databases
- Database Indexing (HNSW vs B-Tree)
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
ML Model Serving
Taking models from Jupyter Notebooks to Production. Inference patterns (Real-time vs Batch), Batching strategies, and optimization techniques (Quantization, KV Caching).
Vector Databases (AI/Embeddings)
Complete guide to vector databases optimized for AI embeddings and semantic search, covering vector similarity, approximate nearest neighbor algorithms (HNSW, IVF), and production implementations in Pinecone, Milvus, Weaviate, and pgvector powering LLM applications.