Distributed Transactions: The Consistency Challenge

In a monolith with a single database, transactions are straightforward: BEGIN TRANSACTION → Update User → Update Order → COMMIT. If anything fails, you ROLLBACK, and the database guarantees atomicity.

In microservices, where each service owns its database, you face a fundamental challenge: How do you maintain data consistency across multiple independent databases when a business transaction spans multiple services?

This is the distributed transaction problem.

The Problem Statement

Imagine an e-commerce order flow:

Order Service creates an order
Payment Service charges the customer
Inventory Service reserves items
Shipping Service schedules delivery

If Payment succeeds but Inventory fails (out of stock), you've charged the customer for items you can't deliver. You need a way to ensure all services succeed together or all fail together.

Traditional ACID transactions don't work across network boundaries and independent databases.

Solution 1: Two-Phase Commit (2PC)

Two-Phase Commit is a distributed algorithm that extends ACID semantics across multiple databases.

How It Works

Phase 1: Prepare (Voting)

Coordinator sends PREPARE to all participants
Each participant performs the operation but doesn't commit
Each participant locks the affected rows/resources
Each participant responds: VOTE_COMMIT (ready) or VOTE_ABORT (can't do it)

Phase 2: Commit/Abort (Decision)

If ALL voted COMMIT: Coordinator sends GLOBAL_COMMIT to everyone
If ANY voted ABORT: Coordinator sends GLOBAL_ABORT to everyone
Each participant commits or rolls back accordingly
Participants release locks

Sequence Diagram

mermaid

sequenceDiagram
    participant C as Coordinator
    participant A as Service A
    participant B as Service B
    
    Note over C: Phase 1: Prepare
    C->>A: PREPARE transaction
    C->>B: PREPARE transaction
    A-->>C: VOTE_COMMIT
    B-->>C: VOTE_COMMIT
    
    Note over C: All voted YES
    Note over C: Phase 2: Commit
    C->>A: GLOBAL_COMMIT
    C->>B: GLOBAL_COMMIT
    A-->>C: ACK
    B-->>C: ACK

Click to expand code...

The Fatal Flaw: Blocking

Scenario: Coordinator crashes after Phase 1 but before Phase 2.

Result: All participants are stuck holding locks indefinitely. They don't know if they should commit or abort. This is called the blocking problem.

Impact:

Database locks cause contention and deadlocks
Services become unavailable
System throughput drops to zero

When to Use 2PC

✅ Use when:

Strict consistency is absolutely required (banking, financial transactions)
Transactions are short-lived (< 100ms)
Network is reliable and low-latency (same data center)

❌ Avoid when:

Cross-internet transactions
Long-running business processes
High-availability is critical

Real-World Usage

MySQL XA Transactions: Distributed transactions across multiple MySQL databases
Java JTA: Enterprise Java beans with distributed transactions
Modern adoption: Extremely rare in microservices due to availability concerns

Solution 2: SAGA Pattern

Instead of one atomic distributed transaction, SAGA uses a sequence of local transactions, with compensating transactions for rollback.

Two SAGA Approaches

1. Choreography (Event-Driven)

Services publish events, and other services react:

mermaid

sequenceDiagram
    participant O as Order Service
    participant P as Payment Service
    participant I as Inventory Service
    participant S as Shipping Service
    
    O->>O: Create Order
    O->>P: OrderCreated Event
    P->>P: Charge Payment
    P->>I: PaymentSucceeded Event
    I->>I: Reserve Inventory
    I->>S: InventoryReserved Event
    S->>S: Schedule Shipping
    
    Note over S: SUCCESS!

Click to expand code...

Failure Scenario:

mermaid

sequenceDiagram
    participant O as Order Service
    participant P as Payment Service
    participant I as Inventory Service
    
    O->>O: Create Order
    O->>P: OrderCreated Event
    P->>P: Charge Payment
    P->>I: PaymentSucceeded Event
    I->>I: Check Inventory
    Note over I: OUT OF STOCK!
    I->>P: InventoryFailed Event
    P->>P: Refund Payment
    P->>O: PaymentRefunded Event
    O->>O: Cancel Order

Click to expand code...

Pros:

Decoupled services
No single point of failure
Highly scalable

Cons:

Hard to understand the overall flow
Difficult to debug and monitor
Cyclic dependencies possible

2. Orchestration (Centralized)

A Saga Orchestrator coordinates the entire flow:

python

class OrderSagaOrchestrator:
    def execute_order(self, order_id):
        try:
            # Step 1: Create Order
            order = order_service.create_order(order_id)
            
            # Step 2: Charge Payment
            payment = payment_service.charge(order.customer_id, order.total)
            
            # Step 3: Reserve Inventory
            inventory = inventory_service.reserve(order.items)
            
            # Step 4: Schedule Shipping
            shipping = shipping_service.schedule(order.id)
            
            return {"status": "SUCCESS"}
            
        except PaymentFailedException as e:
            # Payment failed, rollback order
            order_service.cancel_order(order_id)
            raise
            
        except InventoryFailedException as e:
            # Inventory failed, rollback payment and order
            payment_service.refund(payment.id)
            order_service.cancel_order(order_id)
            raise
            
        except ShippingFailedException as e:
            # Shipping failed, rollback everything
            inventory_service.release(inventory.id)
            payment_service.refund(payment.id)
            order_service.cancel_order(order_id)
            raise

Click to expand code...

Pros:

Explicit business logic in one place
Easy to understand and debug
Centralized monitoring

Cons:

Orchestrator is a single point of failure
Orchestrator can become a bottleneck
Tight coupling to orchestrator

Compensating Transactions

Critical concept: You can't "undo" a committed transaction. Instead, you run a new transaction that reverses the effect.

Example:

Forward Transaction: INSERT INTO orders (id, status) VALUES (123, 'PENDING')
Compensating Transaction: UPDATE orders SET status = 'CANCELLED' WHERE id = 123

Challenges:

Non-Idempotent Operations: Sending an email can't be "unsent"
Semantic Compensation: Refunding money isn't identical to "not charging" (audit trail, fees, timing)
Partial Compensation: What if the compensation itself fails?

Idempotency is Critical

Services must be idempotent (calling twice has same effect as calling once):

javascript

// BAD: Not idempotent
async function chargePayment(orderId, amount) {
  const charge = await stripe.charge({ amount });
  return charge;
}

// GOOD: Idempotent using idempotency key
async function chargePayment(orderId, amount) {
  const charge = await stripe.charge({ 
    amount,
    idempotency_key: `order-${orderId}` // Prevents duplicate charges
  });
  return charge;
}

Click to expand code...

Solution 3: Outbox Pattern

Ensures transactional messaging: Database write + Event publish must be atomic.

The Problem

python

# NOT ATOMIC!
def create_order(order):
    db.insert_order(order)  # Step 1: DB write
    kafka.publish("OrderCreated", order)  # Step 2: Event publish
    # What if Kafka fails? Order is created but no event published!

Click to expand code...

The Solution: Outbox Table

sql

BEGIN TRANSACTION;
  INSERT INTO orders (id, customer_id, total) VALUES (123, 456, 99.99);
  INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{"order_id": 123}');
COMMIT;

Click to expand code...

A background worker reads the outbox table and publishes events:

python

while True:
    events = db.execute("SELECT * FROM outbox WHERE published = false LIMIT 100")
    for event in events:
        kafka.publish(event.event_type, event.payload)
        db.execute("UPDATE outbox SET published = true WHERE id = ?", event.id)
    time.sleep(1)

Click to expand code...

Benefits:

Atomic: Database write and event "intent" in same transaction
Reliable: Events guaranteed to be published eventually
Polling: Worker can retry on Kafka failures

Comparison Table

Pattern	Consistency	Availability	Complexity	Use Case
2PC	Strong	Low (blocking)	Medium	Banking, financial systems
SAGA (Choreography)	Eventual	High	High	E-commerce, social platforms
SAGA (Orchestration)	Eventual	High	Medium	Order processing, workflows
Outbox	Eventual	High	Low	Any event-driven system

Real-World Examples

Uber Eats: Order Placement

SAGA Orchestration:

Place Order (Order Service)
Authorize Payment (Payment Service)
Notify Restaurant (Restaurant Service)
Assign Driver (Dispatch Service)

Compensations:

If restaurant rejects → Refund payment, cancel order
If no driver available → Cancel restaurant, refund payment, cancel order

Amazon: Inventory Management

Outbox Pattern:

Product purchase → Local DB transaction + Outbox record
Background worker → Publishes InventoryReduced event
Warehouse Service → Consumes event and updates physical inventory

Interview Tips 💡

When discussing distributed transactions in system design interviews:

Start with the problem: "We have Order Service and Payment Service with separate databases..."
Avoid 2PC: Mention it exists but explain why you won't use it (blocking, availability)
Explain SAGA: Walk through a concrete example (e-commerce order)
Mention trade-offs: "We'll have eventual consistency, so users might see 'Processing' status briefly"
Discuss idempotency: "Each service must handle duplicate requests gracefully"
Bring up Outbox: "To ensure we never lose events, we'll use the Outbox pattern"

Common Pitfalls

⚠️ Timeout Confusion: If a service times out, did it succeed or fail? Use idempotent operations and unique request IDs.

⚠️ Compensation Order: Reverse the order of compensations (LIFO). If you booked flight → hotel → car, cancel car → hotel → flight.

⚠️ Testing: Test failure scenarios! Use chaos engineering to randomly fail services and verify compensations work.

Related Concepts

ACID vs BASE — Consistency models in databases
Event Sourcing & CQRS — Alternative to distributed transactions
Message Queues — Reliable event delivery
Microservices — Service architecture patterns
CAP Theorem — Fundamental trade-offs in distributed systems

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Intermediate

ACID vs BASE: Consistency Models

The two philosophies of database transaction handling: Strict guarantees (ACID) versus flexible availability (BASE). Deep dive into isolation levels, transaction anomalies, and hybrid approaches.

DatabaseTransactionsConsistency

Intermediate

Database Replication

The process of copying and maintaining database objects in multiple databases to improve reliability, fault-tolerance, and accessibility.

DatabaseAvailabilityDistributed Systems

Advanced

Database Sharding

How to split a massive database across multiple servers. Horizontal scaling strategies, challenges (Joins, ACID), and real-world algorithms used by Instagram, Vitess, and CockroachDB.

DatabasesScalabilitySystem Design

Distributed Transactions: The Consistency Challenge

The Problem Statement

Solution 1: Two-Phase Commit (2PC)

How It Works

Sequence Diagram

The Fatal Flaw: Blocking

When to Use 2PC

Real-World Usage

Solution 2: SAGA Pattern

Two SAGA Approaches

1. Choreography (Event-Driven)

2. Orchestration (Centralized)

Compensating Transactions

Idempotency is Critical

Solution 3: Outbox Pattern

The Problem

The Solution: Outbox Table

Comparison Table

Real-World Examples

Uber Eats: Order Placement

Amazon: Inventory Management

Interview Tips 💡

Common Pitfalls

Related Concepts

About ScaleWiki

Related Articles

ACID vs BASE: Consistency Models

Database Replication

Database Sharding