Back to All Concepts
DatabasesPerformanceSQLData StructuresAdvanced

Database Indexing

Deep dive into database indexing internals. How B-Trees work, Clustered vs Non-Clustered indexes, Composite Index best practices, and covering indexes.

Why Indexing Matters

An index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space.

Without an index, the database must perform a Full Table Scan (O(N)O(N)), checking every row to find matches. With an index, lookup becomes O(logN)O(\log N).

B-Tree Structure

Most relational databases (PostgreSQL, MySQL, Oracle) use B-Trees (specifically B+ Trees) for indexing.

Key Characteristics

  • Balanced: All leaf nodes are at the same depth.
  • Sorted: Data is kept capable of range queries.
  • High Fanout: Nodes have many children (unlike binary trees), minimizing disk I/O.
mermaid
graph TD
    Root[Root Node] --> I1[Internal Node 1]
    Root --> I2[Internal Node 2]
    
    I1 --> L1[Leaf: 1-10]
    I1 --> L2[Leaf: 11-20]
    
    I2 --> L3[Leaf: 21-30]
    I2 --> L4[Leaf: 31-40]
Click to expand code...

Clustered vs Non-Clustered

Understanding this distinction is the single most important concept in physical database design.

FeatureClustered IndexNon-Clustered (Secondary) Index
Data StorageLeaf nodes store the actual row dataLeaf nodes store a pointer (or PK) to the row
QuantityOnly 1 per table (organizes the table itself)Many per table
SizeLarge (contains all columns)Small (contains indexed cols + PK)
LookupDirect access2 steps: Index Lookup → Key Lookup (in Clustered)
mermaid
graph LR
    subgraph NonClustered [Secondary Index (Name)]
        direction TB
        IdxNode[Index Node: 'Alice'] --> Ptr[Pointer to PK: 101]
    end
    
    subgraph Clustered [Clustered Index (Primary Key)]
        direction TB
        PK101[PK: 101] --> RowData["{ID: 101, Name: 'Alice', Age: 30}"]
    end
    
    Ptr -.-> PK101
Click to expand code...

[!TIP] Primary Keys are clustered indexes by default in MySQL (InnoDB) and SQL Server. In PostgreSQL, all indexes are secondary (heap table organization), though CLUSTER command exists.

Composite Indexes

An index on multiple columns: CREATE INDEX idx_name_age ON users(last_name, age).

Leftmost Prefix Rule

The order matters! The database can use the index if the query uses the columns from left to right.

Given Index on (A, B, C):

Query ConditionUses Index?
WHERE A=1✅ Yes
WHERE A=1 AND B=2✅ Yes
WHERE A=1 AND B=2 AND C=3✅ Yes
WHERE B=2❌ No (skipped A)
WHERE A=1 AND C=3⚠️ Partially (only for A)

Visualizing Multi-Column Sort

Think of a phone book. It's sorted by Last Name, then First Name.

  • Finding "Smith" is easy.
  • Finding "John" (without knowing Last Name) is impossible without scanning everything.

Covering Indexes

A super-optimization. If the index contains all the columns needed for the query, the DB doesn't need to look up the table row at all!

Query: SELECT age FROM users WHERE last_name = 'Smith' Index: (last_name, age)

The DB finds 'Smith' in the index, sees 'age' is already there, and returns it. Zero table access.

Indexing Strategies

1. Cardinality

Index columns with high cardinality (many unique values like UserID, Email). Avoid indexing low cardinality columns like Gender or IsActive (boolean). The B-Tree won't save much time vs a scan.

2. Is NULL

B-Trees handle NULLs differently per DB.

  • Oracle: NULLs are not indexed.
  • PostgreSQL/MySQL/SQL Server: NULLs are indexed.

3. Write Penalty

Every INSERT, UPDATE, DELETE requires updating all indexes.

  • Too many indexes = Slow writes.
  • Too few indexes = Slow reads.

Code Examples

SQL: Analyzing Execution Plans

sql
-- 1. Create Table
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255),
    age INT,
    created_at TIMESTAMP
);

-- 2. Populate 1M rows...

-- 3. Analyze Query without Index
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
-- Result: Seq Scan on users (cost=0.00..18334.00 rows=1)
-- Time: 150ms

-- 4. Create Index
CREATE INDEX idx_email ON users(email);

-- 5. Analyze Again
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
-- Result: Index Scan using idx_email on users
-- Time: 0.5ms (300x faster!)
Click to expand code...

Hash Index (Key-Value)

Good for equality checks (=), useless for ranges (>, <). Structure: Hash Table. O(1)O(1) lookup.

sql
CREATE INDEX idx_session ON sessions USING HASH (session_id);
Click to expand code...

Interview Tips 💡

  • "How does a B-Tree work?" — Explain "Balanced, Sorted, High Fanout". Differentiate from Binary Tree (depth).
  • "What is the difference between Clustered and Non-Clustered?" — Clustered IS the table. Non-clustered points to the table.
  • "What is a Composite Index?" — Explain the Leftmost Prefix Rule.
  • "Why not index every column?" — Write amplification. Storage cost.
  • "LSM Tree vs B-Tree?" — LSM (Cassandra, RocksDB) is faster for writes (append-only). B-Tree is faster for reads.

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles