Why Indexing Matters
An index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space.
Without an index, the database must perform a Full Table Scan (), checking every row to find matches. With an index, lookup becomes .
B-Tree Structure
Most relational databases (PostgreSQL, MySQL, Oracle) use B-Trees (specifically B+ Trees) for indexing.
Key Characteristics
- Balanced: All leaf nodes are at the same depth.
- Sorted: Data is kept capable of range queries.
- High Fanout: Nodes have many children (unlike binary trees), minimizing disk I/O.
graph TD
Root[Root Node] --> I1[Internal Node 1]
Root --> I2[Internal Node 2]
I1 --> L1[Leaf: 1-10]
I1 --> L2[Leaf: 11-20]
I2 --> L3[Leaf: 21-30]
I2 --> L4[Leaf: 31-40]
Clustered vs Non-Clustered
Understanding this distinction is the single most important concept in physical database design.
| Feature | Clustered Index | Non-Clustered (Secondary) Index |
|---|---|---|
| Data Storage | Leaf nodes store the actual row data | Leaf nodes store a pointer (or PK) to the row |
| Quantity | Only 1 per table (organizes the table itself) | Many per table |
| Size | Large (contains all columns) | Small (contains indexed cols + PK) |
| Lookup | Direct access | 2 steps: Index Lookup → Key Lookup (in Clustered) |
graph LR
subgraph NonClustered [Secondary Index (Name)]
direction TB
IdxNode[Index Node: 'Alice'] --> Ptr[Pointer to PK: 101]
end
subgraph Clustered [Clustered Index (Primary Key)]
direction TB
PK101[PK: 101] --> RowData["{ID: 101, Name: 'Alice', Age: 30}"]
end
Ptr -.-> PK101
[!TIP] Primary Keys are clustered indexes by default in MySQL (InnoDB) and SQL Server. In PostgreSQL, all indexes are secondary (heap table organization), though
CLUSTERcommand exists.
Composite Indexes
An index on multiple columns: CREATE INDEX idx_name_age ON users(last_name, age).
Leftmost Prefix Rule
The order matters! The database can use the index if the query uses the columns from left to right.
Given Index on (A, B, C):
| Query Condition | Uses Index? |
|---|---|
WHERE A=1 | ✅ Yes |
WHERE A=1 AND B=2 | ✅ Yes |
WHERE A=1 AND B=2 AND C=3 | ✅ Yes |
WHERE B=2 | ❌ No (skipped A) |
WHERE A=1 AND C=3 | ⚠️ Partially (only for A) |
Visualizing Multi-Column Sort
Think of a phone book. It's sorted by Last Name, then First Name.
- Finding "Smith" is easy.
- Finding "John" (without knowing Last Name) is impossible without scanning everything.
Covering Indexes
A super-optimization. If the index contains all the columns needed for the query, the DB doesn't need to look up the table row at all!
Query: SELECT age FROM users WHERE last_name = 'Smith'
Index: (last_name, age)
The DB finds 'Smith' in the index, sees 'age' is already there, and returns it. Zero table access.
Indexing Strategies
1. Cardinality
Index columns with high cardinality (many unique values like UserID, Email).
Avoid indexing low cardinality columns like Gender or IsActive (boolean). The B-Tree won't save much time vs a scan.
2. Is NULL
B-Trees handle NULLs differently per DB.
- Oracle: NULLs are not indexed.
- PostgreSQL/MySQL/SQL Server: NULLs are indexed.
3. Write Penalty
Every INSERT, UPDATE, DELETE requires updating all indexes.
- Too many indexes = Slow writes.
- Too few indexes = Slow reads.
Code Examples
SQL: Analyzing Execution Plans
-- 1. Create Table
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255),
age INT,
created_at TIMESTAMP
);
-- 2. Populate 1M rows...
-- 3. Analyze Query without Index
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
-- Result: Seq Scan on users (cost=0.00..18334.00 rows=1)
-- Time: 150ms
-- 4. Create Index
CREATE INDEX idx_email ON users(email);
-- 5. Analyze Again
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
-- Result: Index Scan using idx_email on users
-- Time: 0.5ms (300x faster!)
Hash Index (Key-Value)
Good for equality checks (=), useless for ranges (>, <).
Structure: Hash Table. lookup.
CREATE INDEX idx_session ON sessions USING HASH (session_id);
Interview Tips 💡
- "How does a B-Tree work?" — Explain "Balanced, Sorted, High Fanout". Differentiate from Binary Tree (depth).
- "What is the difference between Clustered and Non-Clustered?" — Clustered IS the table. Non-clustered points to the table.
- "What is a Composite Index?" — Explain the Leftmost Prefix Rule.
- "Why not index every column?" — Write amplification. Storage cost.
- "LSM Tree vs B-Tree?" — LSM (Cassandra, RocksDB) is faster for writes (append-only). B-Tree is faster for reads.
Related Concepts
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Consistent Hashing
How to add/remove servers without moving every single key. The Ring, Virtual Nodes, and real-world usage in Cassandra, DynamoDB, and Discord.
Database Sharding
How to split a massive database across multiple servers. Horizontal scaling strategies, challenges (Joins, ACID), and real-world algorithms used by Instagram, Vitess, and CockroachDB.
Vector Databases (AI/Embeddings)
Complete guide to vector databases optimized for AI embeddings and semantic search, covering vector similarity, approximate nearest neighbor algorithms (HNSW, IVF), and production implementations in Pinecone, Milvus, Weaviate, and pgvector powering LLM applications.