Back to All Concepts
AIDatabasesSearchAdvanced

Vector Databases (AI/Embeddings)

Complete guide to vector databases optimized for AI embeddings and semantic search, covering vector similarity, approximate nearest neighbor algorithms (HNSW, IVF), and production implementations in Pinecone, Milvus, Weaviate, and pgvector powering LLM applications.

Search by Meaning, Not Keywords

Traditional SQL:
  Query: "terrible experience"
  SELECT * FROM reviews WHERE text LIKE '%terrible%'
  Result: ✓ Found
  
  Query: "awful service"  
  SELECT * FROM reviews WHERE text LIKE '%awful%'
  Result: ❌ Nothing (different words!)

Vector Database:
  Query: "awful service"
  Embedding: [0.82, 0.11, ...]
  Similar to: "terrible experience" [0.81, 0.09, ...]
  Distance: 0.05 (very close!)
  Result: ✓ Found by meaning, not keywords
Click to expand code...

What are Vector Databases?

Vector databases store and query high-dimensional vectors (embeddings) representing semantic meaning, enabling similarity search based on concepts rather than exact matches.

The Problem They Solve

Traditional databases match exact text. Vector databases match meaning.

User searches: "puppy"

Traditional:
  - Matches: "puppy", "puppies"
  - Misses: "dog", "canine", "pet"

Vector DB:
  - Matches all related concepts
  - "puppy" ≈ "dog" ≈ "pet" ≈ "canine"
  - Based on semantic similarity
Click to expand code...

Embeddings: Text → Numbers

Embedding models (like OpenAI, BERT, etc.) convert text/images into vectors.

python
from openai import OpenAI

client = OpenAI(api_key="...")

# Convert text to vector
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The quick brown fox"
)

embedding = response.data[0].embedding
# Result: [0.123, -0.456, 0.789, ..., 0.321]
# Dimension: 1536 (for text-embedding-3-small)
Click to expand code...

Embedding Examples

"puppy"     → [0.1,  0.9,  0.2, -0.1, ...]
"dog"       → [0.12, 0.88, 0.21, -0.09, ...]
"car"       → [0.85, 0.05, 0.15, 0.92, ...]
"automobile"→ [0.83, 0.06, 0.14, 0.91, ...]

Notice:
- "puppy" and "dog" are CLOSE (similar meaning)
- "car" and "automobile" are CLOSE
- "puppy" and "car" are FAR (different concepts)
Click to expand code...

Similarity Measurement

1. Cosine Similarity

Most common for text embeddings.

python
import numpy as np

def cosine_similarity(v1, v2):
    """
    Measures angle between vectors
    Range: [-1, 1]
      1 = identical direction
      0 = orthogonal
     -1 = opposite direction
    """
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

# Example
puppy = np.array([0.1, 0.9, 0.2])
dog = np.array([0.12, 0.88, 0.21])
car = np.array([0.85, 0.05, 0.15])

print(cosine_similarity(puppy, dog))  # 0.9985 (very similar!)
print(cosine_similarity(puppy, car))  # 0.352 (different)
Click to expand code...

2. Euclidean Distance

Straight-line distance in vector space.

python
def euclidean_distance(v1, v2):
    """
    L2 distance
    Range: [0, ∞)
      0 = identical
      larger = more different
    """
    return np.linalg.norm(v1 - v2)

print(euclidean_distance(puppy, dog))  # 0.023 (close)
print(euclidean_distance(puppy, car))  # 1.15 (far)
Click to expand code...

3. Dot Product

python
def dot_product_similarity(v1, v2):
    """
    Faster than cosine (no normalization)
    Assumes normalized vectors
    """
    return np.dot(v1, v2)
Click to expand code...

The Billion-Vector Problem

Naive approach:

python
def find_similar(query_vector, all_vectors):
    """Compare query to every vector - TOO SLOW!"""
    similarities = []
    
    for vector in all_vectors:  # 1 billion iterations!
        sim = cosine_similarity(query_vector, vector)
        similarities.append(sim)
    
    # Return top 10
    return sorted(similarities, reverse=True)[:10]

# Time: O(N * D) where N=billion, D=dimensions
# For 1B vectors of 1536 dims: ~hours!
Click to expand code...

Solution: Approximate Nearest Neighbor (ANN) algorithms


ANN Algorithms

1. HNSW (Hierarchical Navigable Small World)

Graph-based index - fastest for most use cases.

Concept: Multi-layer graph
- Layer 0: All vectors
- Layer 1: Subset (0.1x)
- Layer 2: Subset (0.01x)
etc.

Search:
1. Start at top layer
2. Navigate to approximate region
3. Drop to lower layer
4. Refine search
5. Repeat until bottom layer

Time: O(log N)
Click to expand code...

Implementation concept:

python
class HNSWIndex:
    def __init__(self, dim, M=16, ef_construction=200):
        self.dim = dim
        self.M = M  # Max connections per node
        self.ef_construction = ef_construction
        self.layers = []  # Multi-layer graph
        self.vectors = {}
        
    def insert(self, vector_id, vector):
        """Insert vector into index"""
        # Determine layer for this vector
        layer = self.get_random_layer()
        
        # Find nearest neighbors at each layer
        for l in range(layer, -1, -1):
            neighbors = self.search_layer(vector, l)
            self.connect_nodes(vector_id, neighbors, l)
        
        self.vectors[vector_id] = vector
    
    def search(self, query_vector, k=10):
        """Find k nearest neighbors"""
        # Start from top layer
        current = self.entry_point
        
        # Navigate down layers
        for layer in range(len(self.layers) - 1, -1, -1):
            current = self.search_layer(query_vector, layer, current)
        
        # Final search in layer 0
        return self.get_top_k(current, k)
Click to expand code...

2. IVF (Inverted File Index)

Clustering-based - good for massive scale.

1. Cluster vectors into N groups (k-means)
2. Query:
   a. Find nearest cluster centroids
   b. Search only those clusters
   c. Return top results

Speedup: Only search 1-5% of vectors
Click to expand code...

Implementation:

python
class IVFIndex:
    def __init__(self, dim, n_clusters=1000):
        self.dim = dim
        self.n_clusters = n_clusters
        self.centroids = None
        self.clusters = {i: [] for i in range(n_clusters)}
        
    def train(self, training_vectors):
        """Cluster vectors with k-means"""
        from sklearn.cluster import KMeans
        
        kmeans = KMeans(n_clusters=self.n_clusters)
        kmeans.fit(training_vectors)
        self.centroids = kmeans.cluster_centers_
    
    def add(self, vector_id, vector):
        """Add vector to nearest cluster"""
        cluster_id = self.find_nearest_cluster(vector)
        self.clusters[cluster_id].append((vector_id, vector))
    
    def search(self, query_vector, k=10, n_probe=5):
        """Search n_probe nearest clusters"""
        # Find nearest cluster centroids
        cluster_scores = [
            (i, cosine_similarity(query_vector, centroid))
            for i, centroid in enumerate(self.centroids)
        ]
        top_clusters = sorted(cluster_scores, key=lambda x: -x[1])[:n_probe]
        
        # Search those clusters
        candidates = []
        for cluster_id, _ in top_clusters:
            for vector_id, vector in self.clusters[cluster_id]:
                sim = cosine_similarity(query_vector, vector)
                candidates.append((vector_id, sim))
        
        # Return top k
        return sorted(candidates, key=lambda x: -x[1])[:k]
Click to expand code...

3. LSH (Locality Sensitive Hashing)

Hash-based - fast but less accurate.

python
import random

class LSHIndex:
    def __init__(self, dim, n_hash_tables=10, n_hash_functions=20):
        self.dim = dim
        self.n_hash_tables = n_hash_tables
        self.hash_tables = [{} for _ in range(n_hash_tables)]
        
        # Random hyperplanes for hashing
        self.hyperplanes = [
            [np.random.randn(dim) for _ in range(n_hash_functions)]
            for _ in range(n_hash_tables)
        ]
    
    def hash_vector(self, vector, table_id):
        """Hash vector to binary code"""
        hash_code = []
        for hyperplane in self.hyperplanes[table_id]:
            # If dot product > 0, bit = 1, else bit = 0
            bit = 1 if np.dot(vector, hyperplane) > 0 else 0
            hash_code.append(str(bit))
        
        return ''.join(hash_code)
    
    def add(self, vector_id, vector):
        """Add to all hash tables"""
        for i, table in enumerate(self.hash_tables):
            hash_code = self.hash_vector(vector, i)
            if hash_code not in table:
                table[hash_code] = []
            table[hash_code].append((vector_id, vector))
    
    def search(self, query_vector, k=10):
        """Find similar by hash collision"""
        candidates = set()
        
        # Check all hash tables
        for i, table in enumerate(self.hash_tables):
            hash_code = self.hash_vector(query_vector, i)
            if hash_code in table:
                candidates.update(table[hash_code])
        
        # Rank candidates by exact similarity
        results = [
            (vid, cosine_similarity(query_vector, vec))
            for vid, vec in candidates
        ]
        
        return sorted(results, key=lambda x: -x[1])[:k]
Click to expand code...

Production Vector Databases

1. Pinecone (Managed)

python
import pinecone

# Initialize
pinecone.init(api_key="...", environment="us-west1-gcp")

# Create index
pinecone.create_index(
    "my-index",
    dimension=1536,
    metric="cosine"
)

index = pinecone.Index("my-index")

# Insert vectors
index.upsert(vectors=[
    ("doc1", [0.1, 0.2, ...], {"text": "The quick brown fox"}),
    ("doc2", [0.3, 0.4, ...], {"text": "Lazy dog sleeps"}),
])

# Query
results = index.query(
    vector=[0.15, 0.25, ...],
    top_k=10,
    include_metadata=True
)

for match in results['matches']:
    print(f"Score: {match['score']}, Text: {match['metadata']['text']}")
Click to expand code...

2. Weaviate (Open Source)

python
import weaviate

client = weaviate.Client("http://localhost:8080")

# Create schema
schema = {
    "class": "Article",
    "vectorizer": "text2vec-openai",
    "properties": [
        {"name": "title", "dataType": ["text"]},
        {"name": "content", "dataType": ["text"]}
    ]
}

client.schema.create_class(schema)

# Insert
client.data_object.create(
    {"title": "AI News", "content": "Latest AI developments..."},
    "Article"
)

# Semantic search
result = client.query.get(
    "Article", ["title", "content"]
).with_near_text({
    "concepts": ["artificial intelligence breakthroughs"]
}).with_limit(10).do()
Click to expand code...

3. pgvector (PostgreSQL Extension)

sql
-- Install extension
CREATE EXTENSION vector;

-- Create table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Create index (IVF)
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Insert
INSERT INTO documents (content, embedding)
VALUES ('AI is amazing', '[0.1, 0.2, ..., 0.5]');

-- Semantic search
SELECT content, 
       1 - (embedding <=> '[0.15, 0.25, ..., 0.45]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.15, 0.25, ..., 0.45]'
LIMIT 10;
Click to expand code...

Real-World Applications

1. ChatGPT/OpenAI (RAG Pattern)

python
# Retrieval-Augmented Generation
def answer_question(question):
    # 1. Convert question to embedding
    q_embedding = openai_embed(question)
    
    # 2. Find relevant docs from vectorDB
    relevant_docs = vector_db.search(q_embedding, k=5)
    
    # 3. Build context
    context = "\n".join([doc.content for doc in relevant_docs])
    
    # 4. Ask LLM with context
    prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
    answer = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return answer.choices[0].message.content
Click to expand code...

2. Recommendation Systems

python
# Find similar products
product_embedding = get_product_embedding(product_id)
similar_products = vector_db.search(product_embedding, k=20)
Click to expand code...

3. Image Search

python
# Text-to-image search (CLIP embeddings)
text = "sunset over mountains"
text_embedding = clip.encode_text(text)

# Search image database
matching_images = vector_db.search(text_embedding, k=10)
Click to expand code...

Interview Tips 💡

When discussing vector databases in interviews:

  1. Problem: "Traditional search can't find 'terrible' when user searches 'awful'..."
  2. Embeddings: "ML models convert text to vectors representing meaning..."
  3. Similarity: "Use cosine similarity to find semantically similar vectors..."
  4. ANN: "Can't compare to billion vectors - use HNSW or IVF for O(log N)..."
  5. Use cases: "RAG for ChatGPT, recommendations, semantic search..."
  6. Tools: "Pinecone for managed, Weaviate for self-hosted, pgvector for PostgreSQL..."

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles