Designing a Payment System
In most system design interviews, "Availability" (AP) is king. In Payment Systems, Consistency (CP) is the only thing that matters. Losing data or double-charging a user is unacceptable.
1. Requirements
Functional
- Pay: User transfers money to a merchant.
- Refund: Refund a transaction.
- History: View transaction logs.
Non-Functional
- Zero Data Loss: Just use ACID databases.
- Exactly-Once Processing: Network retries should not cause double charges.
- Consistency: Account balances must always sum to zero (or the correct total).
2. The Core Problem: Network Failure
The internet is unreliable.
- Client sends
POST /pay $10. - Server charges credit card.
- Server sends
200 OK. (Packets lost here) - Client sees "Network Error".
- Client retries
POST /pay $10. - Customer is charged $20. Disaster.
Solution: Idempotency Keys
Every request must include a unique client-generated ID (Idempotency Key).
- Header:
Idempotency-Key: v4-uuid-gen-by-client
Server Logic:
- Receive Request.
- Check DB:
SELECT status FROM transactions WHERE idempotency_key = 'abc'. - If Exists: Return the stored result. Do not execute logic again.
- If New: Execute logic -> Write to DB -> Return Result.
- Note: The recursive lookup must be atomic. Use
INSERT IGNOREor database transactions.
3. Storage Pattern: Double-Entry Ledger
Never, ever store a user's balance as a simple integer: User.balance = 100.
- Why? If two threads update it at once, you get race conditions.
- Audit? You can't prove how the balance got there.
The Ledger Table
Instead, record movements of money. Every transaction has two entries: a debit and a credit.
| ID | Account | Type | Amount |
|---|---|---|---|
| 1 | Alice | DEBIT | -50 |
| 1 | Bob | CREDIT | +50 |
Calculating Balance
SELECT SUM(amount) FROM ledger WHERE account = 'Alice'
- Pros: Immutable history. Easy to audit.
- Cons: Summing millions of rows is slow.
- Optimization: Use a "Snapshot" table that caches the balance every night. Current Balance = Snapshot + Sum(Today's Ledger).
4. Architecture
The Synchronous Phase
- Payment Service: Validates request. Checks fraud.
- Risk Engine: Is this IP blocked? Is the velocity too high?
- PSP Gateway: Calls Stripe/PayPal API. This is the external point of failure.
The Asynchronous Phase (Reconciliation)
What if Stripe charges the card, but our database crashes before we record it? We now have a "Ghost Charge".
The Reconciler (Cron Job):
- Download "Yesterday's Settlement File" from Stripe.
- Iterate through every row.
- Match against our local database.
- Mismatch Found?
- If Stripe has it, but we don't: We missed a success message. Update our DB.
- If we have it, but Stripe doesn't: The charge failed. Mark our DB as failed.
5. Database Choice
- NoSQL (Cassandra/Mongo)?: Risky. Eventual consistency is dangerous here.
- SQL (Postgres/MySQL): Yes. Strong ACID transactions are required.
- NewSQL (CockroachDB/Spanner): Good for global scale with ACID properties.
6. Distributed Transactions (Two-Phase Commit)
If you have a microservice architecture (Payment Service + Wallet Service):
- 2PC: Too slow, keys locks held too long.
- Saga Pattern: Preferred.
- Payment Service charges card.
- Wallet Service credits user.
- If Wallet fails? Trigger a "Compensating Transaction" (Refund card) in Payment Service.
Summary
- Idempotency: The client generates a UUID to prevent double-processing.
- Double-Entry: Store immutable transaction logs, not mutable balances.
- Reconciliation: The ultimate source of truth is the external bank; always sync with them nightly.
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
System Design: Notification System
How to send millions of SMS, Email, and Push notifications reliably. Message Queues, Rate Limiting, and Retry policies.
Caching Overview
High-speed data storage to reduce latency. The single most effective way to scale read-heavy systems.
Circuit Breaker Pattern
A mechanism to prevent an application from repeatedly trying to execute an operation that's likely to fail.