The Durability Problem

When you COMMIT a transaction, the database promises that your data is safe. But writing data to disk is slow (random I/O). Writing to memory is fast but volatile (lost on power failure).

Write-Ahead Logging (WAL) solves this by writing changes to a sequential log file on disk before applying them to the main data files.

[!IMPORTANT] Rule: No modification goes to the data file until it has been written to the WAL file.

How WAL Works

Transaction Start: User begins modifying data.
Log Record: DB creates a log entry describing the change (e.g., "Page 42, Offset 10: Change 'A' to 'B'").
Append to WAL: This small record is appended to the wal_buffer in memory.
Commit: User commits.
Flush: The wal_buffer is fsync()ed to disk. Transaction is now durable.
Apply to Data: Later (background process), the change is written to the actual data file (checkpointing).

Why is this faster?

Append-only writes are sequential I/O, which is significantly faster than the random I/O required to update data pages scattered across the disk.

Crash Recovery

What happens if the server crashes right after Step 5 but before Step 6?

Restart: DB starts up and sees the data file is old.
Replay: It reads the WAL from the last Checkpoint.
Redo: It reapplies all changes found in the WAL to the data pages.
Consistent State: The database is restored to the exact state at the moment of the crash.

mermaid

sequenceDiagram
    participant C as Client
    participant W as WAL (Disk)
    participant M as Memory Buffer
    participant D as Data File (Disk)
    
    C->>M: UPDATE accounts SET balance = 100
    M->>W: Append Log Record (Sequential Write)
    W-->>C: ACK (Commit Successful)
    
    Note over C,D: --- Crash Here! ---
    
    M--x D: (Data never reached disk)
    
    Note over W: Recovery Process
    W->>D: Replay Log (Redo)
    D-->>C: Data Restored!

Click to expand code...

Key Concepts

LSN (Log Sequence Number)

Every WAL record has a unique ID, the LSN. It strictly increases. Data pages store the LSN of the last change that modified them.

If PageLSN < WAL_LSN, the page is stale and must be updated.
If PageLSN >= WAL_LSN, the page is up-to-date.

Checkpointing

The WAL grows forever. We can't replay everything on startup. Checkpointing is the process of:

Flushing all dirty pages from memory to the data file.
Recording a "Checkpoint" record in the WAL.
Deleting/Archiving old WAL segments before this point.

Tradeoff:

Frequent Checkpoints: Fast recovery, but slower runtime performance (more disk I/O).
Infrequent Checkpoints: Fast runtime, but slow recovery (scanning GBs of logs).

WAL in Replication

WAL isn't just for crash recovery. It's the standard way to replicate data.

Primary-Replica Replication:

Primary writes to local WAL.
Primary sends WAL records to Replica.
Replica applies WAL records to its own data files.

This is basically "Continuous Recovery" on the Replica.

Code Example: Simplified WAL Implementation

python

import os
import struct

class WriteAheadLog:
    def __init__(self, log_file="db.wal"):
        self.log_file = log_file
        self.file = open(self.log_file, "ab") # Append binary mode
    
    def log_change(self, transaction_id, key, value):
        # Create a log record: (TransID, KeyLen, Key, ValLen, Value)
        # Using struct to pack binary data
        key_bytes = key.encode('utf-8')
        val_bytes = value.encode('utf-8')
        
        record = struct.pack(
            f"I I {len(key_bytes)}s I {len(val_bytes)}s", 
            transaction_id, 
            len(key_bytes), 
            key_bytes, 
            len(val_bytes), 
            val_bytes
        )
        
        # 1. Write to OS buffer
        self.file.write(record)
        
        # 2. Force flush to disk (fsync) - Critical for durability!
        self.file.flush()
        os.fsync(self.file.fileno())
        
        print(f"Logged Transaction {transaction_id}: {key}={value}")

    def close(self):
        self.file.close()

# Usage
wal = WriteAheadLog()
wal.log_change(101, "user:1:balance", "500")
# If power fails now, we can recover transaction 101 from "db.wal"

Click to expand code...

Production Tuning

PostgreSQL `wal_sync_method`

fsync: Default. Safe. Flushes to disk hardware.
fdatasync: Faster. Only flushes data, not metadata (mtime).
open_datasync: Uses O_DSYNC flag.
open_sync: Uses O_SYNC flag.

MySQL (InnoDB) `innodb_flush_log_at_trx_commit`

1 (Default): Flush to disk on every commit. ACID compliant. Slowest.
0: Flush every 1 second. Fast. Can lose 1s of data on crash.
2: Write to OS cache on commit, flush every 1s. Can lose data on OS crash, but safe if DB crashes.

Interview Tips 💡

"Why is WAL faster than writing directly?" — Sequential I/O vs Random I/O.
"What is a Checkpoint?" — Explain it as a way to truncate the log and speed up recovery.
"Can I disable WAL?" — Yes (e.g., Unlogged Tables in Postgres), for temporary data where performance > durability.
"How does replication relate to WAL?" — It's just shipping the log. Log Shipping.

Related Concepts

ACID vs BASE
Database Replication
LSM Trees (Use WAL for MemTable durability)

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Advanced

Consistent Hashing

How to add/remove servers without moving every single key. The Ring, Virtual Nodes, and real-world usage in Cassandra, DynamoDB, and Discord.

DatabasesDistributed SystemsAlgorithms

Advanced

Database Indexing

Deep dive into database indexing internals. How B-Trees work, Clustered vs Non-Clustered indexes, Composite Index best practices, and covering indexes.

DatabasesPerformanceSQL

Advanced

Database Sharding

How to split a massive database across multiple servers. Horizontal scaling strategies, challenges (Joins, ACID), and real-world algorithms used by Instagram, Vitess, and CockroachDB.

DatabasesScalabilitySystem Design

The Durability Problem

How WAL Works

Why is this faster?

Crash Recovery

Key Concepts

LSN (Log Sequence Number)

Checkpointing

WAL in Replication

Code Example: Simplified WAL Implementation

Production Tuning

PostgreSQL wal_sync_method

MySQL (InnoDB) innodb_flush_log_at_trx_commit

Interview Tips 💡

Related Concepts

About ScaleWiki

Related Articles

Consistent Hashing

Database Indexing

Database Sharding

PostgreSQL `wal_sync_method`

MySQL (InnoDB) `innodb_flush_log_at_trx_commit`