Designing a Real-Time Chat System (WhatsApp/Telegram)
WhatsApp handles billions of messages per day. The core challenges are strict low latency, high delivery reliability (messages must not be lost), and privacy.
1. Requirements
Functional Requirements
- 1:1 Chat: Real-time message delivery between two users.
- Group Chat: Fan-out of messages to group members (up to 256/1024).
- Status: Online/Offline/Last Seen.
- Delivery Receipts: Sent (Tick 1), Delivered (Tick 2), Read (Blue Ticks).
Non-Functional Requirements
- Low Latency: < 100ms for p99.
- Availability: High.
- Durability: Messages must persist until delivered.
2. High-Level Architecture
We cannot use standard HTTP requests (Polling) because they are slow and battery-intensive for mobile devices. We use WebSockets for persistent, bidirectional connections.
Components
- Chat Service (Gateway): The stateful server that holds millions of WebSocket connections.
- Presence Service: Tracks if user is Online/Offline using Heartbeats.
- Group Service: Manages group memberships.
- Message Queue: Kafka/RabbitMQ to decouple message ingestion from delivery.
- Database: NoSQL (Cassandra/HBase) or NewSQL (CockroachDB).
3. Communication Protocol
Why not standard WebSocket?
Raw WebSockets are just a stream of bytes. We need a sub-protocol for structure.
- Use XMPP?: Good for presence, but XML is heavy/verbose.
- Use MQTT?: Lightweight, binary, perfect for mobile battery life. Facebook Messenger and WhatsApp use custom binary protocols (like MQTT or Thrift) over TCP.
4. Message Flow
Message Delivery Lifecycle
Hey, are you free for coffee? ☕
Scenario 1: User A sends to User B (Both Online)
- User A sends message to Chat Server 1 (via persistent WebSocket).
- Chat Server 1 acknowledges receipt to A ("Sent" tick).
- Chat Server 1 asks Discovery Service: "Which server holds User B's connection?"
- Service replies: "User B is on Chat Server 5."
- Chat Server 1 forwards message to Chat Server 5 (via RPC or Redis Pub/Sub).
- Chat Server 5 pushes message to User B (via WebSocket).
- User B sends ACK.
- Chat Server 5 forwards ACK back to A ("Delivered" tick).
Scenario 2: User B is Offline
- Chat Server 1 fails to find an active connection for B.
- Chat Server 1 stores the message in the Unread Database (or Queue).
- Push Notification Service (APNS/FCM) is triggered to wake up B's phone.
- When B comes online next time, they sync with the Unread Database.
- After sync, messages are deleted from server (WhatsApp Architecture) or kept (Telegram Architecture).
5. End-to-End Encryption (E2E)
WhatsApp servers cannot read your messages. How?
The Signal Protocol
We use Double Ratchet Algorithm.
- Public Keys: When User A registers, they upload an "Identity Key" and a bundle of "Pre-Keys" to the server.
- Session Setup: When A wants to text B for the first time:
- A downloads B's path-key bundle from server.
- A generates a shared secret Session Key on their device locally (X3DH Key Agreement).
- Message Encryption: A encrypts message
Mwith Session Key. - Forward Secrecy: The key changes for every single message. If a hacker steals your key tomorrow, they cannot decrypt messages from yesterday.
6. Group Chat Optimization
Sending a message to a group of 500 people.
Naive: Client-Side Fan-out
User A sends 500 individual messages.
- Bad: Kills User A's bandwidth/battery.
Improved: Server-Side Fan-out
User A sends 1 message to Server. Server looks up Group Members (500 IDs). Server loops and sends 500 messages.
- Issue: Slow serial loop.
Advanced: Hybrid / Multicast
Large groups are sharded. The server pushes the message to a "Group Channel" in a Pub/Sub system (Kafka). Different consumer workers pick up shards of the group members and push reliably.
Summary
| Feature | Technology |
|---|---|
| Protocol | MQTT / WebSocket |
| Database | Cassandra / Hbase (Write Heavy) |
| Offline | Delayed Queue + Push Notifications |
| Encryption | Signal Protocol (Double Ratchet) |
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
System Design: Instagram News Feed
Designing a scalable social feed. Fan-out on Write vs Fan-out on Read, and solving the Justin Bieber problem.
System Design: Uber (Ride Sharing)
A breakdown of the geospatial architecture behind Uber. Validating QuadTrees, Google S2/H3, and handling millions of location updates per second.
API Gateway Pattern
The single entry point for microservices. Implementing rate limiting, authentication, and protocol translation/aggregation.