Why Indexing Matters

An index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space.

Without an index, the database must perform a Full Table Scan ( $O(N)$ ), checking every row to find matches. With an index, lookup becomes $O(\log N)$ .

B-Tree Structure

Most relational databases (PostgreSQL, MySQL, Oracle) use B-Trees (specifically B+ Trees) for indexing.

Key Characteristics

Balanced: All leaf nodes are at the same depth.
Sorted: Data is kept capable of range queries.
High Fanout: Nodes have many children (unlike binary trees), minimizing disk I/O.

mermaid

graph TD
    Root[Root Node] --> I1[Internal Node 1]
    Root --> I2[Internal Node 2]
    
    I1 --> L1[Leaf: 1-10]
    I1 --> L2[Leaf: 11-20]
    
    I2 --> L3[Leaf: 21-30]
    I2 --> L4[Leaf: 31-40]

Click to expand code...

Clustered vs Non-Clustered

Understanding this distinction is the single most important concept in physical database design.

Feature	Clustered Index	Non-Clustered (Secondary) Index
Data Storage	Leaf nodes store the actual row data	Leaf nodes store a pointer (or PK) to the row
Quantity	Only 1 per table (organizes the table itself)	Many per table
Size	Large (contains all columns)	Small (contains indexed cols + PK)
Lookup	Direct access	2 steps: Index Lookup → Key Lookup (in Clustered)

mermaid

graph LR
    subgraph NonClustered [Secondary Index (Name)]
        direction TB
        IdxNode[Index Node: 'Alice'] --> Ptr[Pointer to PK: 101]
    end
    
    subgraph Clustered [Clustered Index (Primary Key)]
        direction TB
        PK101[PK: 101] --> RowData["{ID: 101, Name: 'Alice', Age: 30}"]
    end
    
    Ptr -.-> PK101

Click to expand code...

[!TIP] Primary Keys are clustered indexes by default in MySQL (InnoDB) and SQL Server. In PostgreSQL, all indexes are secondary (heap table organization), though CLUSTER command exists.

Composite Indexes

An index on multiple columns: CREATE INDEX idx_name_age ON users(last_name, age).

Leftmost Prefix Rule

The order matters! The database can use the index if the query uses the columns from left to right.

Given Index on (A, B, C):

Query Condition	Uses Index?
`WHERE A=1`	✅ Yes
`WHERE A=1 AND B=2`	✅ Yes
`WHERE A=1 AND B=2 AND C=3`	✅ Yes
`WHERE B=2`	❌ No (skipped A)
`WHERE A=1 AND C=3`	⚠️ Partially (only for A)

Visualizing Multi-Column Sort

Think of a phone book. It's sorted by Last Name, then First Name.

Finding "Smith" is easy.
Finding "John" (without knowing Last Name) is impossible without scanning everything.

Covering Indexes

A super-optimization. If the index contains all the columns needed for the query, the DB doesn't need to look up the table row at all!

Query: SELECT age FROM users WHERE last_name = 'Smith' Index: (last_name, age)

The DB finds 'Smith' in the index, sees 'age' is already there, and returns it. Zero table access.

Indexing Strategies

1. Cardinality

Index columns with high cardinality (many unique values like UserID, Email). Avoid indexing low cardinality columns like Gender or IsActive (boolean). The B-Tree won't save much time vs a scan.

2. Is NULL

B-Trees handle NULLs differently per DB.

Oracle: NULLs are not indexed.
PostgreSQL/MySQL/SQL Server: NULLs are indexed.

3. Write Penalty

Every INSERT, UPDATE, DELETE requires updating all indexes.

Too many indexes = Slow writes.
Too few indexes = Slow reads.

Code Examples

SQL: Analyzing Execution Plans

sql

-- 1. Create Table
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255),
    age INT,
    created_at TIMESTAMP
);

-- 2. Populate 1M rows...

-- 3. Analyze Query without Index
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
-- Result: Seq Scan on users (cost=0.00..18334.00 rows=1)
-- Time: 150ms

-- 4. Create Index
CREATE INDEX idx_email ON users(email);

-- 5. Analyze Again
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
-- Result: Index Scan using idx_email on users
-- Time: 0.5ms (300x faster!)

Click to expand code...

Hash Index (Key-Value)

Good for equality checks (=), useless for ranges (>, <). Structure: Hash Table. $O(1)$ lookup.

sql

CREATE INDEX idx_session ON sessions USING HASH (session_id);

Click to expand code...

Interview Tips 💡

"How does a B-Tree work?" — Explain "Balanced, Sorted, High Fanout". Differentiate from Binary Tree (depth).
"What is the difference between Clustered and Non-Clustered?" — Clustered IS the table. Non-clustered points to the table.
"What is a Composite Index?" — Explain the Leftmost Prefix Rule.
"Why not index every column?" — Write amplification. Storage cost.
"LSM Tree vs B-Tree?" — LSM (Cassandra, RocksDB) is faster for writes (append-only). B-Tree is faster for reads.

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Advanced

Consistent Hashing

How to add/remove servers without moving every single key. The Ring, Virtual Nodes, and real-world usage in Cassandra, DynamoDB, and Discord.

DatabasesDistributed SystemsAlgorithms

Advanced

Database Sharding

How to split a massive database across multiple servers. Horizontal scaling strategies, challenges (Joins, ACID), and real-world algorithms used by Instagram, Vitess, and CockroachDB.

DatabasesScalabilitySystem Design

Advanced

Vector Databases (AI/Embeddings)

Complete guide to vector databases optimized for AI embeddings and semantic search, covering vector similarity, approximate nearest neighbor algorithms (HNSW, IVF), and production implementations in Pinecone, Milvus, Weaviate, and pgvector powering LLM applications.

AIDatabasesSearch

Database Indexing

Why Indexing Matters

B-Tree Structure

Key Characteristics

Clustered vs Non-Clustered

Composite Indexes

Leftmost Prefix Rule

Visualizing Multi-Column Sort

Covering Indexes

Indexing Strategies

1. Cardinality

2. Is NULL

3. Write Penalty

Code Examples

SQL: Analyzing Execution Plans

Hash Index (Key-Value)

Interview Tips 💡

Related Concepts

About ScaleWiki

Related Articles

Consistent Hashing

Database Sharding

Vector Databases (AI/Embeddings)