Back to All Concepts
System DesignMicroservicesMessagingReliabilityIntermediate

System Design: Notification System

How to send millions of SMS, Email, and Push notifications reliably. Message Queues, Rate Limiting, and Retry policies.

Designing a Scalable Notification Service

Sending one email is easy. Sending 10 million push notifications in 2 minutes for "Breaking News" is an architectural challenge.

1. Requirements

Functional

  • Send: Support Email (SES), SMS (Twilio), Push (FCM/APNS).
  • Bulk: Support "Broadcast" to all users.
  • Preferences: Don't email me if I opted out.

Non-Functional

  • Reliability: Never lose a notification.
  • Rate Limiting: Don't get banned by Apple/Google/Twilio for spamming.
  • Latency: Breaking news must be delivered fast.

2. High-Level Architecture

We need to decouple the Producer (The service wanting to send a message) from the Consumer (The worker calling the external API).

Components

  1. Notification Service: The API gateway. Receives POST /send.
  2. Message Queue (Kafka): Buffers requests. Prevents system crash if 1M requests hit at once.
  3. Workers: Pull from Kafka and call external APIs (FCM/Twilio).
  4. Rate Limiter: Controls the speed of workers.

3. The Message Queue (Buffer)

Why Kafka?

  • Buffering: If Apple's configured limit is 10k/sec, but we receive 100k/sec requests, Kafka holds the backlog.
  • Topics:
    • topic-high-priority (OTP codes, Login alerts) -> High number of consumers.
    • topic-low-priority (Marketing emails) -> Fewer consumers.

4. Reliability & Retry Mechanisms

External services (Twilio, FCM) fail all the time. What if workers call Twilio and get 500 Internal Server Error?

The Retry Queue

  1. Worker fails to send email.
  2. Push the message to a Retry Queue with a delay (Exponential Backoff).
  3. Wait 1s, then 2s, then 4s, then 8s.
  4. After 5 retries, move to Dead Letter Queue (DLQ) for human inspection.

5. Deduplication

  • Problem: Retries might cause duplicate emails if the failure was a network timeout (the server actually sent it, but the ACK was lost).
  • Fix: Check NotificationLog database.
    • INSERT INTO logs (id) VALUES (msg_id)
    • If insert fails (Duplicate Key), stop.

6. Rate Limiting (Token Bucket)

Token Bucket Algorithm

Capacity: 5
Refill (1/s)
Incoming Requests

We must protect third-party quotas.

  • Twilio Limit: 100 SMS/sec.
  • Worker Logic:
    • Before sending, Worker asks Rate Limiter (Redis): "Can I take a token?"
    • If Redis says "Yes" (count < 100), proceed.
    • If Redis says "No", Worker sleeps or re-queues the message.

7. Preference Service

Before sending "Marketing Email" to User A:

  1. Worker calls Preference Service.
  2. Checks: "Did User A unsubscribe from Marketing?"
  3. If yes, drop message silently.

Summary

  1. Decouple: Use Queues (Kafka/RabbitMQ) to absorb spikes.
  2. Sort: Prioritize OTPs over Marketing.
  3. Protect: Rate limit your workers to respect external API limits.
  4. Retry: Use exponential backoff for resilience.

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles