The Durability Problem
When you COMMIT a transaction, the database promises that your data is safe. But writing data to disk is slow (random I/O). Writing to memory is fast but volatile (lost on power failure).
Write-Ahead Logging (WAL) solves this by writing changes to a sequential log file on disk before applying them to the main data files.
[!IMPORTANT] Rule: No modification goes to the data file until it has been written to the WAL file.
How WAL Works
- Transaction Start: User begins modifying data.
- Log Record: DB creates a log entry describing the change (e.g., "Page 42, Offset 10: Change 'A' to 'B'").
- Append to WAL: This small record is appended to the
wal_bufferin memory. - Commit: User commits.
- Flush: The
wal_bufferisfsync()ed to disk. Transaction is now durable. - Apply to Data: Later (background process), the change is written to the actual data file (checkpointing).
Why is this faster?
Append-only writes are sequential I/O, which is significantly faster than the random I/O required to update data pages scattered across the disk.
Crash Recovery
What happens if the server crashes right after Step 5 but before Step 6?
- Restart: DB starts up and sees the data file is old.
- Replay: It reads the WAL from the last Checkpoint.
- Redo: It reapplies all changes found in the WAL to the data pages.
- Consistent State: The database is restored to the exact state at the moment of the crash.
sequenceDiagram
participant C as Client
participant W as WAL (Disk)
participant M as Memory Buffer
participant D as Data File (Disk)
C->>M: UPDATE accounts SET balance = 100
M->>W: Append Log Record (Sequential Write)
W-->>C: ACK (Commit Successful)
Note over C,D: --- Crash Here! ---
M--x D: (Data never reached disk)
Note over W: Recovery Process
W->>D: Replay Log (Redo)
D-->>C: Data Restored!
Key Concepts
LSN (Log Sequence Number)
Every WAL record has a unique ID, the LSN. It strictly increases. Data pages store the LSN of the last change that modified them.
- If
PageLSN < WAL_LSN, the page is stale and must be updated. - If
PageLSN >= WAL_LSN, the page is up-to-date.
Checkpointing
The WAL grows forever. We can't replay everything on startup. Checkpointing is the process of:
- Flushing all dirty pages from memory to the data file.
- Recording a "Checkpoint" record in the WAL.
- Deleting/Archiving old WAL segments before this point.
Tradeoff:
- Frequent Checkpoints: Fast recovery, but slower runtime performance (more disk I/O).
- Infrequent Checkpoints: Fast runtime, but slow recovery (scanning GBs of logs).
WAL in Replication
WAL isn't just for crash recovery. It's the standard way to replicate data.
Primary-Replica Replication:
- Primary writes to local WAL.
- Primary sends WAL records to Replica.
- Replica applies WAL records to its own data files.
This is basically "Continuous Recovery" on the Replica.
Code Example: Simplified WAL Implementation
import os
import struct
class WriteAheadLog:
def __init__(self, log_file="db.wal"):
self.log_file = log_file
self.file = open(self.log_file, "ab") # Append binary mode
def log_change(self, transaction_id, key, value):
# Create a log record: (TransID, KeyLen, Key, ValLen, Value)
# Using struct to pack binary data
key_bytes = key.encode('utf-8')
val_bytes = value.encode('utf-8')
record = struct.pack(
f"I I {len(key_bytes)}s I {len(val_bytes)}s",
transaction_id,
len(key_bytes),
key_bytes,
len(val_bytes),
val_bytes
)
# 1. Write to OS buffer
self.file.write(record)
# 2. Force flush to disk (fsync) - Critical for durability!
self.file.flush()
os.fsync(self.file.fileno())
print(f"Logged Transaction {transaction_id}: {key}={value}")
def close(self):
self.file.close()
# Usage
wal = WriteAheadLog()
wal.log_change(101, "user:1:balance", "500")
# If power fails now, we can recover transaction 101 from "db.wal"
Production Tuning
PostgreSQL wal_sync_method
fsync: Default. Safe. Flushes to disk hardware.fdatasync: Faster. Only flushes data, not metadata (mtime).open_datasync: UsesO_DSYNCflag.open_sync: UsesO_SYNCflag.
MySQL (InnoDB) innodb_flush_log_at_trx_commit
1(Default): Flush to disk on every commit. ACID compliant. Slowest.0: Flush every 1 second. Fast. Can lose 1s of data on crash.2: Write to OS cache on commit, flush every 1s. Can lose data on OS crash, but safe if DB crashes.
Interview Tips š”
- "Why is WAL faster than writing directly?" ā Sequential I/O vs Random I/O.
- "What is a Checkpoint?" ā Explain it as a way to truncate the log and speed up recovery.
- "Can I disable WAL?" ā Yes (e.g., Unlogged Tables in Postgres), for temporary data where performance > durability.
- "How does replication relate to WAL?" ā It's just shipping the log. Log Shipping.
Related Concepts
- ACID vs BASE
- Database Replication
- LSM Trees (Use WAL for MemTable durability)
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Consistent Hashing
How to add/remove servers without moving every single key. The Ring, Virtual Nodes, and real-world usage in Cassandra, DynamoDB, and Discord.
Database Indexing
Deep dive into database indexing internals. How B-Trees work, Clustered vs Non-Clustered indexes, Composite Index best practices, and covering indexes.
Database Sharding
How to split a massive database across multiple servers. Horizontal scaling strategies, challenges (Joins, ACID), and real-world algorithms used by Instagram, Vitess, and CockroachDB.