Back to All Concepts
DatabaseData StructuresStorage EnginePerformancePro

LSM Trees (Database Internals)

Log-Structured Merge Trees: The data structure powering write-heavy databases like Cassandra, RocksDB, and DynamoDB.

Why B-Trees are Slow for Writes

Most relational databases (PostgreSQL, MySQL) use B+ Trees.

  • Read: Fast
  • Write: Slow. A write requires finding the correct page on disk and modifying it. This is Random I/O.

To handle millions of writes per second (e.g., IoT data, Chat logs), we need a data structure optimized for Sequential I/O. Enter the LSM Tree (Log-Structured Merge Tree).

Components

MemTable (In-Memory)

When a write comes in (PUT key=value), it is written to an in-memory structure called the MemTable.

  • Structure: Usually a Red-Black Tree or SkipList.
  • Why?: Keeps data sorted in RAM. Writes are fast (O(logN)O(log N) memory op).
  • WAL: To prevent data loss if power fails, we write to a sequential Write Ahead Log on disk first.

SSTable (Sorted String Table) (On-Disk)

When the MemTable gets big (e.g., 64MB), it is flushed to disk as an SSTable.

  • Immutable: Once written, an SSTable is never modified.
  • Sorted: Keys are sorted, allowing binary search scanning.

Bloom Filter

Problem: If we have 100 SSTables on disk, do we search all of them for a key? Solution: Every SSTable has a Bloom Filter.

  • It tells us "The key is definitely NOT in this file" or "Maybe in this file".
  • Saves 99% of unnecessary disk seeks.

The Write Path

  • Append to WAL (for durability).
  • Insert into MemTable (RAM).
  • Ack 200 OK.
  • Background: If MemTable > Limit, flush to new SSTable.

The Read Path

  • Check MemTable (Is it in RAM?).
  • Check Block Cache (Is it in LRU Cache?).
  • Check Bloom Filter of SSTable L0 (Most recent).
  • Check Bloom Filter of SSTable L... until found

Note: "Read Amplification". Reads are slower than writes because we check multiple levels.

Compaction (The Cleanup)

Over time, you have duplicates.

  • SSTable: Key A = 10
  • SSTable 5: Key A = 20 (Newer)

To save space and speed up reads, we run Compaction.

  • Read multiple SSTables.
  • Merge them into one new larger SSTable (like Merge Sort).
  • Discard old values and Deleted keys (Tombstones).

Strategies

  • Size-Tiered: Optimized for Write throughput (Cassandra default). Fast writes slower reads.
  • Leveled: Optimized for Read throughput (LevelDB/RocksDB). More aggressive merging means fewer files to search.

B-Tree vs LSM Tree

FeatureB-Tree (MySQL)LSM Tree (Cassandra)
Write SpeedSlow (Random I/O)Fast (Sequential I/O)
Read SpeedFastSlower (Check multiple levels)
SpaceFragmentationCompact (Compressed)
Use CaseFinancial, SQLLogs, Time-Series, Chat

Production Examples

RocksDB (Meta/Facebook)

Used for social graph storage and MyRocks MySQL storage engine. Handles 100,000+ writes/sec per instance.

Apache Cassandra

Powers Netflix recommendations, Apple iCloud, Instagram messages. 1,000+ node clusters handling millions of writes/sec.

DynamoDB (AWS)

Uses LSM-based storage engine optimized for consistent single-digit millisecond latency. Handles 10+ million requests/sec.

Summary

  • MemTable: Buffers writes in RAM (Sorted).
  • SSTable: Immutable files on disk (Sorted).
  • Bloom Filter: Optimization to avoid disk reads.
  • Compaction: Background merge sort to cleanup garbage.
  • Use Case: Write-heavy workloads like time-series, logs, analytics

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles