The Problem: Hallucination & Staleness

LLMs (GPT-4, Claude) are frozen in time.

Staleness: They don't know today's news.
Private Data: They don't know your company's internal wiki.
Hallucination: They confidently make up facts.

RAG solves this by fetching relevant data before meaningful generation.

The Architecture

mermaid

graph TD
    User[User Question] --> App[RAG Application]
    
    subgraph Ingestion [1. Ingestion Phase]
        Docs[PDF/Wiki] --> Chunk[HTML/Text Splitter]
        Chunk --> Emb[Embedding Model]
        Emb --> VecDB[(Vector Database)]
    end
    
    subgraph Retrieval [2. Retrieval Phase]
        App -->|Embed Query| Emb
        Emb -->|Vector| VecDB
        VecDB -->|Top K Chunks| App
    end
    
    subgraph Generation [3. Generation Phase]
        App -->|Context + Query| LLM[LLM (GPT-4)]
        LLM -->|Answer| User
    end

Click to expand code...

1. Ingestion (Offline)

Load: Read PDFs, Slack history, Notion.
Split: Break text into chunks (e.g., 500 tokens).
Embed: Convert text to vectors using an Embedding Model (OpenAI text-embedding-3).
Store: Save vectors + text in a Vector DB (Pinecone, Milvus, pgvector).

2. Retrieval (Online)

User asks: "What is the vacation policy?"
Convert question to vector: [0.1, 0.5, -0.9...]
Perform Semantic Search (Cosine Similarity) in Vector DB.
Get top 3 chunks: "Policy 2024: 20 days off..."

3. Generation (Online)

Prompt Engineering:

text

System: Answer using only the Context below.
Context: "Policy 2024: 20 days off..."
User: What is the vacation policy?

Click to expand code...

LLM generates accurate answer grounded in fact.

Vector Search Internals

How do we search 1 billion vectors in milliseconds? We can't compare every vector ( $O(N)$ ). We use Approximate Nearest Neighbor (ANN) algorithms.

HNSW (Hierarchical Navigable Small World)

Think of it like a skip-list for graphs.

Data Structure: Multi-layered graph.
Search: Start at top layer (sparse), drill down to bottom layer (dense).
Complexity: $O(\log N)$ .

mermaid

graph TD
    Entry point --> Node1
    Node1 --> Node2
    Node2 --> Target

Click to expand code...

Hybrid Search

Semantic search (Vectors) is great for concepts ("dog" matches "puppy"), but bad for keywords ("Error 504" might match "Error 404").

Solution: Combine Vector Search + Keyword Search (BM25).

RRF (Reciprocal Rank Fusion) merges the two result sets.

Code Example: Simple RAG pipeline

python

import os
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

# 1. Setup Retrieval
embeddings = OpenAIEmbeddings()
db = Chroma(persist_directory="./db", embedding_function=embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

# 2. Define Chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
prompt = PromptTemplate.from_template(
    "Context: {context}\n\nQuestion: {question}\nAnswer:"
)

def ask(question):
    # A. Retrieve
    docs = retriever.get_relevant_documents(question)
    context_text = "\n\n".join([d.page_content for d in docs])
    
    # B. Augment
    final_prompt = prompt.format(context=context_text, question=question)
    
    # C. Generate
    return llm.predict(final_prompt)

# print(ask("How do I reset my password?"))

Click to expand code...

Advanced Techniques

Corrective RAG (CRAG)

If vector search returns low confidence score:

Fallback: Use Web Search (Google API).
Filter: LLM grades retrieved documents for relevance. Discard irrelevant ones.

Multi-Query Retrieval

User asks complex question?

LLM rewrites query into 3 sub-queries.
Execute all 3.
Deduplicate results.

Interview Tips 💡

"Context Window Limit" — LLMs can process 128k+ tokens now, so why RAG?
- Cost: Processing 1M tokens per query is expensive ($$).
- Latency: Takes seconds to process huge context.
- Accuracy: "Lost in the Middle" phenomenon.
"Chunking Strategy" — Too small? Missing context. Too big? Noise. Overlapping chunks helps.

Related Concepts

Vector Databases
Database Indexing (HNSW vs B-Tree)

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Advanced

ML Model Serving

Taking models from Jupyter Notebooks to Production. Inference patterns (Real-time vs Batch), Batching strategies, and optimization techniques (Quantization, KV Caching).

AIMLOpsInference

Advanced

Vector Databases (AI/Embeddings)

Complete guide to vector databases optimized for AI embeddings and semantic search, covering vector similarity, approximate nearest neighbor algorithms (HNSW, IVF), and production implementations in Pinecone, Milvus, Weaviate, and pgvector powering LLM applications.

AIDatabasesSearch

RAG Architecture (LLMs)

The Problem: Hallucination & Staleness

The Architecture

1. Ingestion (Offline)

2. Retrieval (Online)

3. Generation (Online)

Vector Search Internals

HNSW (Hierarchical Navigable Small World)

Hybrid Search

Code Example: Simple RAG pipeline

Advanced Techniques

Corrective RAG (CRAG)

Multi-Query Retrieval

Interview Tips 💡

Related Concepts

About ScaleWiki

Related Articles

ML Model Serving

Vector Databases (AI/Embeddings)