Distributed Transactions: The Consistency Challenge
In a monolith with a single database, transactions are straightforward: BEGIN TRANSACTION → Update User → Update Order → COMMIT. If anything fails, you ROLLBACK, and the database guarantees atomicity.
In microservices, where each service owns its database, you face a fundamental challenge: How do you maintain data consistency across multiple independent databases when a business transaction spans multiple services?
This is the distributed transaction problem.
The Problem Statement
Imagine an e-commerce order flow:
- Order Service creates an order
- Payment Service charges the customer
- Inventory Service reserves items
- Shipping Service schedules delivery
If Payment succeeds but Inventory fails (out of stock), you've charged the customer for items you can't deliver. You need a way to ensure all services succeed together or all fail together.
Traditional ACID transactions don't work across network boundaries and independent databases.
Solution 1: Two-Phase Commit (2PC)
Two-Phase Commit is a distributed algorithm that extends ACID semantics across multiple databases.
How It Works
Phase 1: Prepare (Voting)
- Coordinator sends
PREPAREto all participants - Each participant performs the operation but doesn't commit
- Each participant locks the affected rows/resources
- Each participant responds:
VOTE_COMMIT(ready) orVOTE_ABORT(can't do it)
Phase 2: Commit/Abort (Decision)
- If ALL voted COMMIT: Coordinator sends
GLOBAL_COMMITto everyone - If ANY voted ABORT: Coordinator sends
GLOBAL_ABORTto everyone - Each participant commits or rolls back accordingly
- Participants release locks
Sequence Diagram
sequenceDiagram
participant C as Coordinator
participant A as Service A
participant B as Service B
Note over C: Phase 1: Prepare
C->>A: PREPARE transaction
C->>B: PREPARE transaction
A-->>C: VOTE_COMMIT
B-->>C: VOTE_COMMIT
Note over C: All voted YES
Note over C: Phase 2: Commit
C->>A: GLOBAL_COMMIT
C->>B: GLOBAL_COMMIT
A-->>C: ACK
B-->>C: ACK
The Fatal Flaw: Blocking
Scenario: Coordinator crashes after Phase 1 but before Phase 2.
Result: All participants are stuck holding locks indefinitely. They don't know if they should commit or abort. This is called the blocking problem.
Impact:
- Database locks cause contention and deadlocks
- Services become unavailable
- System throughput drops to zero
When to Use 2PC
✅ Use when:
- Strict consistency is absolutely required (banking, financial transactions)
- Transactions are short-lived (< 100ms)
- Network is reliable and low-latency (same data center)
❌ Avoid when:
- Cross-internet transactions
- Long-running business processes
- High-availability is critical
Real-World Usage
- MySQL XA Transactions: Distributed transactions across multiple MySQL databases
- Java JTA: Enterprise Java beans with distributed transactions
- Modern adoption: Extremely rare in microservices due to availability concerns
Solution 2: SAGA Pattern
Instead of one atomic distributed transaction, SAGA uses a sequence of local transactions, with compensating transactions for rollback.
Two SAGA Approaches
1. Choreography (Event-Driven)
Services publish events, and other services react:
sequenceDiagram
participant O as Order Service
participant P as Payment Service
participant I as Inventory Service
participant S as Shipping Service
O->>O: Create Order
O->>P: OrderCreated Event
P->>P: Charge Payment
P->>I: PaymentSucceeded Event
I->>I: Reserve Inventory
I->>S: InventoryReserved Event
S->>S: Schedule Shipping
Note over S: SUCCESS!
Failure Scenario:
sequenceDiagram
participant O as Order Service
participant P as Payment Service
participant I as Inventory Service
O->>O: Create Order
O->>P: OrderCreated Event
P->>P: Charge Payment
P->>I: PaymentSucceeded Event
I->>I: Check Inventory
Note over I: OUT OF STOCK!
I->>P: InventoryFailed Event
P->>P: Refund Payment
P->>O: PaymentRefunded Event
O->>O: Cancel Order
Pros:
- Decoupled services
- No single point of failure
- Highly scalable
Cons:
- Hard to understand the overall flow
- Difficult to debug and monitor
- Cyclic dependencies possible
2. Orchestration (Centralized)
A Saga Orchestrator coordinates the entire flow:
class OrderSagaOrchestrator:
def execute_order(self, order_id):
try:
# Step 1: Create Order
order = order_service.create_order(order_id)
# Step 2: Charge Payment
payment = payment_service.charge(order.customer_id, order.total)
# Step 3: Reserve Inventory
inventory = inventory_service.reserve(order.items)
# Step 4: Schedule Shipping
shipping = shipping_service.schedule(order.id)
return {"status": "SUCCESS"}
except PaymentFailedException as e:
# Payment failed, rollback order
order_service.cancel_order(order_id)
raise
except InventoryFailedException as e:
# Inventory failed, rollback payment and order
payment_service.refund(payment.id)
order_service.cancel_order(order_id)
raise
except ShippingFailedException as e:
# Shipping failed, rollback everything
inventory_service.release(inventory.id)
payment_service.refund(payment.id)
order_service.cancel_order(order_id)
raise
Pros:
- Explicit business logic in one place
- Easy to understand and debug
- Centralized monitoring
Cons:
- Orchestrator is a single point of failure
- Orchestrator can become a bottleneck
- Tight coupling to orchestrator
Compensating Transactions
Critical concept: You can't "undo" a committed transaction. Instead, you run a new transaction that reverses the effect.
Example:
- Forward Transaction:
INSERT INTO orders (id, status) VALUES (123, 'PENDING') - Compensating Transaction:
UPDATE orders SET status = 'CANCELLED' WHERE id = 123
Challenges:
- Non-Idempotent Operations: Sending an email can't be "unsent"
- Semantic Compensation: Refunding money isn't identical to "not charging" (audit trail, fees, timing)
- Partial Compensation: What if the compensation itself fails?
Idempotency is Critical
Services must be idempotent (calling twice has same effect as calling once):
// BAD: Not idempotent
async function chargePayment(orderId, amount) {
const charge = await stripe.charge({ amount });
return charge;
}
// GOOD: Idempotent using idempotency key
async function chargePayment(orderId, amount) {
const charge = await stripe.charge({
amount,
idempotency_key: `order-${orderId}` // Prevents duplicate charges
});
return charge;
}
Solution 3: Outbox Pattern
Ensures transactional messaging: Database write + Event publish must be atomic.
The Problem
# NOT ATOMIC!
def create_order(order):
db.insert_order(order) # Step 1: DB write
kafka.publish("OrderCreated", order) # Step 2: Event publish
# What if Kafka fails? Order is created but no event published!
The Solution: Outbox Table
BEGIN TRANSACTION;
INSERT INTO orders (id, customer_id, total) VALUES (123, 456, 99.99);
INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{"order_id": 123}');
COMMIT;
A background worker reads the outbox table and publishes events:
while True:
events = db.execute("SELECT * FROM outbox WHERE published = false LIMIT 100")
for event in events:
kafka.publish(event.event_type, event.payload)
db.execute("UPDATE outbox SET published = true WHERE id = ?", event.id)
time.sleep(1)
Benefits:
- Atomic: Database write and event "intent" in same transaction
- Reliable: Events guaranteed to be published eventually
- Polling: Worker can retry on Kafka failures
Comparison Table
| Pattern | Consistency | Availability | Complexity | Use Case |
|---|---|---|---|---|
| 2PC | Strong | Low (blocking) | Medium | Banking, financial systems |
| SAGA (Choreography) | Eventual | High | High | E-commerce, social platforms |
| SAGA (Orchestration) | Eventual | High | Medium | Order processing, workflows |
| Outbox | Eventual | High | Low | Any event-driven system |
Real-World Examples
Uber Eats: Order Placement
SAGA Orchestration:
- Place Order (Order Service)
- Authorize Payment (Payment Service)
- Notify Restaurant (Restaurant Service)
- Assign Driver (Dispatch Service)
Compensations:
- If restaurant rejects → Refund payment, cancel order
- If no driver available → Cancel restaurant, refund payment, cancel order
Amazon: Inventory Management
Outbox Pattern:
- Product purchase → Local DB transaction + Outbox record
- Background worker → Publishes
InventoryReducedevent - Warehouse Service → Consumes event and updates physical inventory
Interview Tips 💡
When discussing distributed transactions in system design interviews:
- Start with the problem: "We have Order Service and Payment Service with separate databases..."
- Avoid 2PC: Mention it exists but explain why you won't use it (blocking, availability)
- Explain SAGA: Walk through a concrete example (e-commerce order)
- Mention trade-offs: "We'll have eventual consistency, so users might see 'Processing' status briefly"
- Discuss idempotency: "Each service must handle duplicate requests gracefully"
- Bring up Outbox: "To ensure we never lose events, we'll use the Outbox pattern"
Common Pitfalls
⚠️ Timeout Confusion: If a service times out, did it succeed or fail? Use idempotent operations and unique request IDs.
⚠️ Compensation Order: Reverse the order of compensations (LIFO). If you booked flight → hotel → car, cancel car → hotel → flight.
⚠️ Testing: Test failure scenarios! Use chaos engineering to randomly fail services and verify compensations work.
Related Concepts
- ACID vs BASE — Consistency models in databases
- Event Sourcing & CQRS — Alternative to distributed transactions
- Message Queues — Reliable event delivery
- Microservices — Service architecture patterns
- CAP Theorem — Fundamental trade-offs in distributed systems
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
ACID vs BASE: Consistency Models
The two philosophies of database transaction handling: Strict guarantees (ACID) versus flexible availability (BASE). Deep dive into isolation levels, transaction anomalies, and hybrid approaches.
Database Replication
The process of copying and maintaining database objects in multiple databases to improve reliability, fault-tolerance, and accessibility.
Database Sharding
How to split a massive database across multiple servers. Horizontal scaling strategies, challenges (Joins, ACID), and real-world algorithms used by Instagram, Vitess, and CockroachDB.