Back to All Concepts
MicroservicesConsistencyDatabaseSystem DesignAdvanced

Distributed Transactions

Managing data consistency across multiple services where a single operation must either fully succeed or fully fail. Deep dive into SAGA, 2PC, and modern patterns.

Distributed Transactions: The Consistency Challenge

In a monolith with a single database, transactions are straightforward: BEGIN TRANSACTION → Update User → Update Order → COMMIT. If anything fails, you ROLLBACK, and the database guarantees atomicity.

In microservices, where each service owns its database, you face a fundamental challenge: How do you maintain data consistency across multiple independent databases when a business transaction spans multiple services?

This is the distributed transaction problem.

The Problem Statement

Imagine an e-commerce order flow:

  1. Order Service creates an order
  2. Payment Service charges the customer
  3. Inventory Service reserves items
  4. Shipping Service schedules delivery

If Payment succeeds but Inventory fails (out of stock), you've charged the customer for items you can't deliver. You need a way to ensure all services succeed together or all fail together.

Traditional ACID transactions don't work across network boundaries and independent databases.

Solution 1: Two-Phase Commit (2PC)

Two-Phase Commit is a distributed algorithm that extends ACID semantics across multiple databases.

How It Works

Phase 1: Prepare (Voting)

  1. Coordinator sends PREPARE to all participants
  2. Each participant performs the operation but doesn't commit
  3. Each participant locks the affected rows/resources
  4. Each participant responds: VOTE_COMMIT (ready) or VOTE_ABORT (can't do it)

Phase 2: Commit/Abort (Decision)

  1. If ALL voted COMMIT: Coordinator sends GLOBAL_COMMIT to everyone
  2. If ANY voted ABORT: Coordinator sends GLOBAL_ABORT to everyone
  3. Each participant commits or rolls back accordingly
  4. Participants release locks

Sequence Diagram

mermaid
sequenceDiagram
    participant C as Coordinator
    participant A as Service A
    participant B as Service B
    
    Note over C: Phase 1: Prepare
    C->>A: PREPARE transaction
    C->>B: PREPARE transaction
    A-->>C: VOTE_COMMIT
    B-->>C: VOTE_COMMIT
    
    Note over C: All voted YES
    Note over C: Phase 2: Commit
    C->>A: GLOBAL_COMMIT
    C->>B: GLOBAL_COMMIT
    A-->>C: ACK
    B-->>C: ACK
Click to expand code...

The Fatal Flaw: Blocking

Scenario: Coordinator crashes after Phase 1 but before Phase 2.

Result: All participants are stuck holding locks indefinitely. They don't know if they should commit or abort. This is called the blocking problem.

Impact:

  • Database locks cause contention and deadlocks
  • Services become unavailable
  • System throughput drops to zero

When to Use 2PC

Use when:

  • Strict consistency is absolutely required (banking, financial transactions)
  • Transactions are short-lived (< 100ms)
  • Network is reliable and low-latency (same data center)

Avoid when:

  • Cross-internet transactions
  • Long-running business processes
  • High-availability is critical

Real-World Usage

  • MySQL XA Transactions: Distributed transactions across multiple MySQL databases
  • Java JTA: Enterprise Java beans with distributed transactions
  • Modern adoption: Extremely rare in microservices due to availability concerns

Solution 2: SAGA Pattern

Instead of one atomic distributed transaction, SAGA uses a sequence of local transactions, with compensating transactions for rollback.

Two SAGA Approaches

1. Choreography (Event-Driven)

Services publish events, and other services react:

mermaid
sequenceDiagram
    participant O as Order Service
    participant P as Payment Service
    participant I as Inventory Service
    participant S as Shipping Service
    
    O->>O: Create Order
    O->>P: OrderCreated Event
    P->>P: Charge Payment
    P->>I: PaymentSucceeded Event
    I->>I: Reserve Inventory
    I->>S: InventoryReserved Event
    S->>S: Schedule Shipping
    
    Note over S: SUCCESS!
Click to expand code...

Failure Scenario:

mermaid
sequenceDiagram
    participant O as Order Service
    participant P as Payment Service
    participant I as Inventory Service
    
    O->>O: Create Order
    O->>P: OrderCreated Event
    P->>P: Charge Payment
    P->>I: PaymentSucceeded Event
    I->>I: Check Inventory
    Note over I: OUT OF STOCK!
    I->>P: InventoryFailed Event
    P->>P: Refund Payment
    P->>O: PaymentRefunded Event
    O->>O: Cancel Order
Click to expand code...

Pros:

  • Decoupled services
  • No single point of failure
  • Highly scalable

Cons:

  • Hard to understand the overall flow
  • Difficult to debug and monitor
  • Cyclic dependencies possible

2. Orchestration (Centralized)

A Saga Orchestrator coordinates the entire flow:

python
class OrderSagaOrchestrator:
    def execute_order(self, order_id):
        try:
            # Step 1: Create Order
            order = order_service.create_order(order_id)
            
            # Step 2: Charge Payment
            payment = payment_service.charge(order.customer_id, order.total)
            
            # Step 3: Reserve Inventory
            inventory = inventory_service.reserve(order.items)
            
            # Step 4: Schedule Shipping
            shipping = shipping_service.schedule(order.id)
            
            return {"status": "SUCCESS"}
            
        except PaymentFailedException as e:
            # Payment failed, rollback order
            order_service.cancel_order(order_id)
            raise
            
        except InventoryFailedException as e:
            # Inventory failed, rollback payment and order
            payment_service.refund(payment.id)
            order_service.cancel_order(order_id)
            raise
            
        except ShippingFailedException as e:
            # Shipping failed, rollback everything
            inventory_service.release(inventory.id)
            payment_service.refund(payment.id)
            order_service.cancel_order(order_id)
            raise
Click to expand code...

Pros:

  • Explicit business logic in one place
  • Easy to understand and debug
  • Centralized monitoring

Cons:

  • Orchestrator is a single point of failure
  • Orchestrator can become a bottleneck
  • Tight coupling to orchestrator

Compensating Transactions

Critical concept: You can't "undo" a committed transaction. Instead, you run a new transaction that reverses the effect.

Example:

  • Forward Transaction: INSERT INTO orders (id, status) VALUES (123, 'PENDING')
  • Compensating Transaction: UPDATE orders SET status = 'CANCELLED' WHERE id = 123

Challenges:

  1. Non-Idempotent Operations: Sending an email can't be "unsent"
  2. Semantic Compensation: Refunding money isn't identical to "not charging" (audit trail, fees, timing)
  3. Partial Compensation: What if the compensation itself fails?

Idempotency is Critical

Services must be idempotent (calling twice has same effect as calling once):

javascript
// BAD: Not idempotent
async function chargePayment(orderId, amount) {
  const charge = await stripe.charge({ amount });
  return charge;
}

// GOOD: Idempotent using idempotency key
async function chargePayment(orderId, amount) {
  const charge = await stripe.charge({ 
    amount,
    idempotency_key: `order-${orderId}` // Prevents duplicate charges
  });
  return charge;
}
Click to expand code...

Solution 3: Outbox Pattern

Ensures transactional messaging: Database write + Event publish must be atomic.

The Problem

python
# NOT ATOMIC!
def create_order(order):
    db.insert_order(order)  # Step 1: DB write
    kafka.publish("OrderCreated", order)  # Step 2: Event publish
    # What if Kafka fails? Order is created but no event published!
Click to expand code...

The Solution: Outbox Table

sql
BEGIN TRANSACTION;
  INSERT INTO orders (id, customer_id, total) VALUES (123, 456, 99.99);
  INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{"order_id": 123}');
COMMIT;
Click to expand code...

A background worker reads the outbox table and publishes events:

python
while True:
    events = db.execute("SELECT * FROM outbox WHERE published = false LIMIT 100")
    for event in events:
        kafka.publish(event.event_type, event.payload)
        db.execute("UPDATE outbox SET published = true WHERE id = ?", event.id)
    time.sleep(1)
Click to expand code...

Benefits:

  • Atomic: Database write and event "intent" in same transaction
  • Reliable: Events guaranteed to be published eventually
  • Polling: Worker can retry on Kafka failures

Comparison Table

PatternConsistencyAvailabilityComplexityUse Case
2PCStrongLow (blocking)MediumBanking, financial systems
SAGA (Choreography)EventualHighHighE-commerce, social platforms
SAGA (Orchestration)EventualHighMediumOrder processing, workflows
OutboxEventualHighLowAny event-driven system

Real-World Examples

Uber Eats: Order Placement

SAGA Orchestration:

  1. Place Order (Order Service)
  2. Authorize Payment (Payment Service)
  3. Notify Restaurant (Restaurant Service)
  4. Assign Driver (Dispatch Service)

Compensations:

  • If restaurant rejects → Refund payment, cancel order
  • If no driver available → Cancel restaurant, refund payment, cancel order

Amazon: Inventory Management

Outbox Pattern:

  • Product purchase → Local DB transaction + Outbox record
  • Background worker → Publishes InventoryReduced event
  • Warehouse Service → Consumes event and updates physical inventory

Interview Tips 💡

When discussing distributed transactions in system design interviews:

  1. Start with the problem: "We have Order Service and Payment Service with separate databases..."
  2. Avoid 2PC: Mention it exists but explain why you won't use it (blocking, availability)
  3. Explain SAGA: Walk through a concrete example (e-commerce order)
  4. Mention trade-offs: "We'll have eventual consistency, so users might see 'Processing' status briefly"
  5. Discuss idempotency: "Each service must handle duplicate requests gracefully"
  6. Bring up Outbox: "To ensure we never lose events, we'll use the Outbox pattern"

Common Pitfalls

⚠️ Timeout Confusion: If a service times out, did it succeed or fail? Use idempotent operations and unique request IDs.

⚠️ Compensation Order: Reverse the order of compensations (LIFO). If you booked flight → hotel → car, cancel car → hotel → flight.

⚠️ Testing: Test failure scenarios! Use chaos engineering to randomly fail services and verify compensations work.

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles