Finding the Top K (The "Trending" Problem)
In a stream of 1 Billion events, how do you find the 10 most frequent items using only 100MB of RAM? This is the core algorithm behind Twitter Trending, Google Search Suggestions, and DDoS Detection.
1. Requirements
- Input: A never-ending stream of strings (queries, IPs, hashtags).
- Output: The top 10 most frequent items in the last X minutes.
- Constraint: You cannot store every item. (10B items * 100 Bytes = 1TB RAM).
2. Approach 1: Batch Processing (MapReduce)
If we don't need real-time, we can use a distributed file system.
- Split: Divide logs into 100 files.
- Map: Emit
(Key, 1)for every occurrence. - Shuffle: Group by Key.
(Key, [1, 1, 1, 1]). - Reduce: Sum the lists.
(Key, Total). - Sort: Sort by Total and take Top 10.
- Con: High latency. Takes minutes/hours.
3. Approach 2: Streaming with Hash Map
Store HashMap<String, Integer>.
- Problem: In the "Long Tail" of the internet, there are billions of unique keys that appear once. Storing them fills up memory instantly.
4. Approach 3: Probabilistic Data Structures (Count-Min Sketch)
A Count-Min Sketch is a 2D array that trades accuracy for space. It is the "Bloom Filter" of counting.
Structure
- Matrix: A
Width x Deptharray of counters (e.g., 1000 x 5). Use 0 as initial value. - Hash Functions: 5 different hash functions ().
Algorithm
- Add(Item):
- Calculate row indices:
- Increment
Matrix[0][r1],Matrix[1][r2]...
- Estimate(Item):
- Get values at all 5 positions.
- Return the Minimum of these values.
- Why Min? Because collisions only add to the count. The true count cannot be higher than the minimum observed value.
Space Efficiency
With just a few kilobytes, you can count billions of items with 99.9% accuracy.
5. Sliding Windows
"Trending" means "Popular Right Now", not "Popular since 2010". We need to forget old data.
- Time Slicing:
- Create a CMS (Count-Min Sketch) for Minute 1, Minute 2, Minute 3.
- To get "Last 5 mins", merge the 5 sketches (Sum their cells).
- Exponential Decay:
- Multiply all counters by 0.9 every minute. Old values fade away.
6. Architecture (Design Twitter Trending)
- Ingestion: Tweets enter Kafka.
- Processor: Apache Flink / Spark Streaming workers read the stream.
- Aggregation:
- Workers maintain a local Count-Min Sketch in memory.
- Every 10 seconds, flush the Top 100 candidates to a central Redis.
- Storage: Redis Sorted Set (
ZSET).ZINCRBY trending "#systemdesign" 1
- API:
ZRANGE trending 0 9returns the global Top 10.
Summary
- Exact Count: Impossible with limited RAM.
- Count-Min Sketch: O(1) space and time, with controllable error rate.
- Lossy Counting: Another popular algorithm for this problem.
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Geohashing (Location Encoding)
A geocoding system that encodes latitude/longitude coordinates into short alphanumeric strings for efficient proximity searches and spatial indexing.
HyperLogLog (Cardinality Estimation)
A probabilistic algorithm for counting unique items in massive datasets using minimal memory, with less than 1% error using just kilobytes of space.
MapReduce
A programming model for processing massive datasets in parallel across distributed clusters. Understanding Map, Shuffle, Reduce with real-world use cases from Google, Hadoop, and Spark.