Search by Meaning, Not Keywords
Traditional SQL: Query: "terrible experience" SELECT * FROM reviews WHERE text LIKE '%terrible%' Result: ✓ Found Query: "awful service" SELECT * FROM reviews WHERE text LIKE '%awful%' Result: ❌ Nothing (different words!) Vector Database: Query: "awful service" Embedding: [0.82, 0.11, ...] Similar to: "terrible experience" [0.81, 0.09, ...] Distance: 0.05 (very close!) Result: ✓ Found by meaning, not keywords
What are Vector Databases?
Vector databases store and query high-dimensional vectors (embeddings) representing semantic meaning, enabling similarity search based on concepts rather than exact matches.
The Problem They Solve
Traditional databases match exact text. Vector databases match meaning.
User searches: "puppy" Traditional: - Matches: "puppy", "puppies" - Misses: "dog", "canine", "pet" Vector DB: - Matches all related concepts - "puppy" ≈ "dog" ≈ "pet" ≈ "canine" - Based on semantic similarity
Embeddings: Text → Numbers
Embedding models (like OpenAI, BERT, etc.) convert text/images into vectors.
from openai import OpenAI
client = OpenAI(api_key="...")
# Convert text to vector
response = client.embeddings.create(
model="text-embedding-3-small",
input="The quick brown fox"
)
embedding = response.data[0].embedding
# Result: [0.123, -0.456, 0.789, ..., 0.321]
# Dimension: 1536 (for text-embedding-3-small)
Embedding Examples
"puppy" → [0.1, 0.9, 0.2, -0.1, ...] "dog" → [0.12, 0.88, 0.21, -0.09, ...] "car" → [0.85, 0.05, 0.15, 0.92, ...] "automobile"→ [0.83, 0.06, 0.14, 0.91, ...] Notice: - "puppy" and "dog" are CLOSE (similar meaning) - "car" and "automobile" are CLOSE - "puppy" and "car" are FAR (different concepts)
Similarity Measurement
1. Cosine Similarity
Most common for text embeddings.
import numpy as np
def cosine_similarity(v1, v2):
"""
Measures angle between vectors
Range: [-1, 1]
1 = identical direction
0 = orthogonal
-1 = opposite direction
"""
dot_product = np.dot(v1, v2)
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
return dot_product / (norm_v1 * norm_v2)
# Example
puppy = np.array([0.1, 0.9, 0.2])
dog = np.array([0.12, 0.88, 0.21])
car = np.array([0.85, 0.05, 0.15])
print(cosine_similarity(puppy, dog)) # 0.9985 (very similar!)
print(cosine_similarity(puppy, car)) # 0.352 (different)
2. Euclidean Distance
Straight-line distance in vector space.
def euclidean_distance(v1, v2):
"""
L2 distance
Range: [0, ∞)
0 = identical
larger = more different
"""
return np.linalg.norm(v1 - v2)
print(euclidean_distance(puppy, dog)) # 0.023 (close)
print(euclidean_distance(puppy, car)) # 1.15 (far)
3. Dot Product
def dot_product_similarity(v1, v2):
"""
Faster than cosine (no normalization)
Assumes normalized vectors
"""
return np.dot(v1, v2)
The Billion-Vector Problem
Naive approach:
def find_similar(query_vector, all_vectors):
"""Compare query to every vector - TOO SLOW!"""
similarities = []
for vector in all_vectors: # 1 billion iterations!
sim = cosine_similarity(query_vector, vector)
similarities.append(sim)
# Return top 10
return sorted(similarities, reverse=True)[:10]
# Time: O(N * D) where N=billion, D=dimensions
# For 1B vectors of 1536 dims: ~hours!
Solution: Approximate Nearest Neighbor (ANN) algorithms
ANN Algorithms
1. HNSW (Hierarchical Navigable Small World)
Graph-based index - fastest for most use cases.
Concept: Multi-layer graph - Layer 0: All vectors - Layer 1: Subset (0.1x) - Layer 2: Subset (0.01x) etc. Search: 1. Start at top layer 2. Navigate to approximate region 3. Drop to lower layer 4. Refine search 5. Repeat until bottom layer Time: O(log N)
Implementation concept:
class HNSWIndex:
def __init__(self, dim, M=16, ef_construction=200):
self.dim = dim
self.M = M # Max connections per node
self.ef_construction = ef_construction
self.layers = [] # Multi-layer graph
self.vectors = {}
def insert(self, vector_id, vector):
"""Insert vector into index"""
# Determine layer for this vector
layer = self.get_random_layer()
# Find nearest neighbors at each layer
for l in range(layer, -1, -1):
neighbors = self.search_layer(vector, l)
self.connect_nodes(vector_id, neighbors, l)
self.vectors[vector_id] = vector
def search(self, query_vector, k=10):
"""Find k nearest neighbors"""
# Start from top layer
current = self.entry_point
# Navigate down layers
for layer in range(len(self.layers) - 1, -1, -1):
current = self.search_layer(query_vector, layer, current)
# Final search in layer 0
return self.get_top_k(current, k)
2. IVF (Inverted File Index)
Clustering-based - good for massive scale.
1. Cluster vectors into N groups (k-means) 2. Query: a. Find nearest cluster centroids b. Search only those clusters c. Return top results Speedup: Only search 1-5% of vectors
Implementation:
class IVFIndex:
def __init__(self, dim, n_clusters=1000):
self.dim = dim
self.n_clusters = n_clusters
self.centroids = None
self.clusters = {i: [] for i in range(n_clusters)}
def train(self, training_vectors):
"""Cluster vectors with k-means"""
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=self.n_clusters)
kmeans.fit(training_vectors)
self.centroids = kmeans.cluster_centers_
def add(self, vector_id, vector):
"""Add vector to nearest cluster"""
cluster_id = self.find_nearest_cluster(vector)
self.clusters[cluster_id].append((vector_id, vector))
def search(self, query_vector, k=10, n_probe=5):
"""Search n_probe nearest clusters"""
# Find nearest cluster centroids
cluster_scores = [
(i, cosine_similarity(query_vector, centroid))
for i, centroid in enumerate(self.centroids)
]
top_clusters = sorted(cluster_scores, key=lambda x: -x[1])[:n_probe]
# Search those clusters
candidates = []
for cluster_id, _ in top_clusters:
for vector_id, vector in self.clusters[cluster_id]:
sim = cosine_similarity(query_vector, vector)
candidates.append((vector_id, sim))
# Return top k
return sorted(candidates, key=lambda x: -x[1])[:k]
3. LSH (Locality Sensitive Hashing)
Hash-based - fast but less accurate.
import random
class LSHIndex:
def __init__(self, dim, n_hash_tables=10, n_hash_functions=20):
self.dim = dim
self.n_hash_tables = n_hash_tables
self.hash_tables = [{} for _ in range(n_hash_tables)]
# Random hyperplanes for hashing
self.hyperplanes = [
[np.random.randn(dim) for _ in range(n_hash_functions)]
for _ in range(n_hash_tables)
]
def hash_vector(self, vector, table_id):
"""Hash vector to binary code"""
hash_code = []
for hyperplane in self.hyperplanes[table_id]:
# If dot product > 0, bit = 1, else bit = 0
bit = 1 if np.dot(vector, hyperplane) > 0 else 0
hash_code.append(str(bit))
return ''.join(hash_code)
def add(self, vector_id, vector):
"""Add to all hash tables"""
for i, table in enumerate(self.hash_tables):
hash_code = self.hash_vector(vector, i)
if hash_code not in table:
table[hash_code] = []
table[hash_code].append((vector_id, vector))
def search(self, query_vector, k=10):
"""Find similar by hash collision"""
candidates = set()
# Check all hash tables
for i, table in enumerate(self.hash_tables):
hash_code = self.hash_vector(query_vector, i)
if hash_code in table:
candidates.update(table[hash_code])
# Rank candidates by exact similarity
results = [
(vid, cosine_similarity(query_vector, vec))
for vid, vec in candidates
]
return sorted(results, key=lambda x: -x[1])[:k]
Production Vector Databases
1. Pinecone (Managed)
import pinecone
# Initialize
pinecone.init(api_key="...", environment="us-west1-gcp")
# Create index
pinecone.create_index(
"my-index",
dimension=1536,
metric="cosine"
)
index = pinecone.Index("my-index")
# Insert vectors
index.upsert(vectors=[
("doc1", [0.1, 0.2, ...], {"text": "The quick brown fox"}),
("doc2", [0.3, 0.4, ...], {"text": "Lazy dog sleeps"}),
])
# Query
results = index.query(
vector=[0.15, 0.25, ...],
top_k=10,
include_metadata=True
)
for match in results['matches']:
print(f"Score: {match['score']}, Text: {match['metadata']['text']}")
2. Weaviate (Open Source)
import weaviate
client = weaviate.Client("http://localhost:8080")
# Create schema
schema = {
"class": "Article",
"vectorizer": "text2vec-openai",
"properties": [
{"name": "title", "dataType": ["text"]},
{"name": "content", "dataType": ["text"]}
]
}
client.schema.create_class(schema)
# Insert
client.data_object.create(
{"title": "AI News", "content": "Latest AI developments..."},
"Article"
)
# Semantic search
result = client.query.get(
"Article", ["title", "content"]
).with_near_text({
"concepts": ["artificial intelligence breakthroughs"]
}).with_limit(10).do()
3. pgvector (PostgreSQL Extension)
-- Install extension
CREATE EXTENSION vector;
-- Create table
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536)
);
-- Create index (IVF)
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Insert
INSERT INTO documents (content, embedding)
VALUES ('AI is amazing', '[0.1, 0.2, ..., 0.5]');
-- Semantic search
SELECT content,
1 - (embedding <=> '[0.15, 0.25, ..., 0.45]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.15, 0.25, ..., 0.45]'
LIMIT 10;
Real-World Applications
1. ChatGPT/OpenAI (RAG Pattern)
# Retrieval-Augmented Generation
def answer_question(question):
# 1. Convert question to embedding
q_embedding = openai_embed(question)
# 2. Find relevant docs from vectorDB
relevant_docs = vector_db.search(q_embedding, k=5)
# 3. Build context
context = "\n".join([doc.content for doc in relevant_docs])
# 4. Ask LLM with context
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
answer = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return answer.choices[0].message.content
2. Recommendation Systems
# Find similar products product_embedding = get_product_embedding(product_id) similar_products = vector_db.search(product_embedding, k=20)
3. Image Search
# Text-to-image search (CLIP embeddings) text = "sunset over mountains" text_embedding = clip.encode_text(text) # Search image database matching_images = vector_db.search(text_embedding, k=10)
Interview Tips 💡
When discussing vector databases in interviews:
- Problem: "Traditional search can't find 'terrible' when user searches 'awful'..."
- Embeddings: "ML models convert text to vectors representing meaning..."
- Similarity: "Use cosine similarity to find semantically similar vectors..."
- ANN: "Can't compare to billion vectors - use HNSW or IVF for O(log N)..."
- Use cases: "RAG for ChatGPT, recommendations, semantic search..."
- Tools: "Pinecone for managed, Weaviate for self-hosted, pgvector for PostgreSQL..."
Related Concepts
- Embeddings — Vector representations
- Semantic Search — Meaning-based retrieval
- RAG — Retrieval-Augmented Generation
- ANN Algorithms — Approximate nearest neighbor
- LLM Applications — AI system architecture
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Consistent Hashing
How to add/remove servers without moving every single key. The Ring, Virtual Nodes, and real-world usage in Cassandra, DynamoDB, and Discord.
Database Indexing
Deep dive into database indexing internals. How B-Trees work, Clustered vs Non-Clustered indexes, Composite Index best practices, and covering indexes.
Database Sharding
How to split a massive database across multiple servers. Horizontal scaling strategies, challenges (Joins, ACID), and real-world algorithms used by Instagram, Vitess, and CockroachDB.