Designing a File Sync Service
Dropbox looks simple: A folder that appears everywhere. Under the hood, it's a complex distributed system ensuring that if you edit a file on your laptop, it instantly updates on your phone.
1. Requirements
Functional
- Sync: Add/Update/Delete files.
- History: Restore previous versions.
- Sharing: Share file with other users.
Non-Functional
- Consistency: Clients must view the same state of the file. No "conflict" files if possible.
- Bandwidth Efficiency: Don't upload the whole file if only one line changed.
- Reliability: 99.9999% durability. Do not lose user data.
2. High-Level Architecture
We decouple the "Data" from the "Metadata".
Components
- Block Server: Stores raw chunks of data (Blob Storage / S3). It doesn't know what a "file" is. It just knows
Hash(Chunk) -> Bytes. - Metadata Server: Knows the file system structure. "Folder A contains File B. File B is made of [Chunk1, Chunk2]".
- Synchronization Service: Handles the "Chat" between client and server to figure out what needs dragging.
3. The Magic: Chunking & Deduplication
How do we save bandwidth and storage?
Naive Approach
Upload the whole 100MB file every time it saves.
- Bad: Slow, destroys bandwidth.
Block-Level Deduplication
We split every file into fixed-size blocks (e.g., 4MB).
File A (10 MB) -> [Block 1, Block 2, Block 3]
Scenario 1: Small Edit
- User changes one character in the first paragraph.
- Only
Block 1changes.Block 2andBlock 3remain identical. - Client Uploads: Only the new
Block 1'. - Server Stores:
Block 1'(New),Block 2(Ref),Block 3(Ref).
Scenario 2: Cross-User Deduplication
- User A uploads
movie.mkv. - User B uploads
movie.mkv. - Client B calculates hash of blocks. Sends hashes to server.
- Server says: "I already have these blocks from User A. No need to upload."
- Result: Instant upload (Zero seconds). Massive storage savings for Dropbox.
4. Metadata Database (Namespace)
We need to store the file tree: directory structure, permissions, and version history.
- SQL (MySQL): Dropbox originally used MySQL. Why? Strong ACID consistency. If you move a folder, the database must purely reflect the new state immediately. No "Eventual Consistency" allowed here.
- Schema:
FileID: UUIDParentID: UUID (Folder)Version: IntBlockList:List<Hash>
5. Synchronization Workflow
1. Client detects change
The Dropbox client uses inotify (Linux) or FSEvents (macOS) to watch the local file system.
2. Client asks for instructions
Client talks to Sync Service:
"I have
File Aversion 5. Server has version 6. What Changed?"
3. Delta Sync (Rsync Algorithm)
Instead of re-downloading the whole block, sophisticated clients use a rolling hash (Rsync) to download only the changed bytes within the block. (Though mostly, block-level replacement is sufficient).
4. Conflict Resolution
What if User A and User B edit the same file offline, then both go online?
- Strategy: "Last Write Wins" is dangerous.
- Strategy: "Conflict File". Dropbox creates
File A (User B's Conflicted Copy). We let the human resolve it.
6. Cold Storage
Users rarely access old versions of files.
- Hot Storage (S3 Standard): Current file versions.
- Cold Storage (Glacier): History versions (v1, v2, v3). Cheaper, slower retrieval.
Summary
- Split data into blocks (4MB).
- Deduplicate blocks globally (hash-based).
- Store structure in a consistent SQL DB (Metadata).
- Sync only the delta blocks.
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Horizontal Scaling (Scaling Out)
The definitive guide to adding more servers to your infrastructure pool to handle infinite growth.
System Design: Instagram News Feed
Designing a scalable social feed. Fan-out on Write vs Fan-out on Read, and solving the Justin Bieber problem.
Raft Consensus Algorithm
A comprehensive guide to Raft, the consensus algorithm powering Etcd, Consul, and Kubernetes. Leader election, log replication, safety guarantees, and production deployment patterns.