Document Databases
Document databases (like MongoDB, CouchDB, Cosmos DB) store data in documents (JSON, BSON, XML) rather than rows and columns.
Key Characteristics
- Flexible Schema: You don't define the structure upfront. One document can have
emailandphone, while the next one only hasemailandtwitter_handle. - Hierarchical Data: You can store arrays and nested objects directly. In SQL, a "User" with 5 "Addresses" would require 2 tables (
Users,Addresses). In Document DBs, it's just one object.
{
"_id": 1,
"name": "Alice",
"addresses": [
{ "street": "123 Main St", "type": "Home" },
{ "street": "456 Work Ave", "type": "Office" }
]
}
Embed vs Reference: The Core Decision
The biggest decision in Document DBs is Embed vs Reference.
1. Embedding (Denormalization)
Storing related data inside the parent document.
Example: E-Commerce Products
{
"_id": "prod_123",
"name": "Laptop",
"price": 999,
"reviews": [ // Embedded
{
"user": "Alice",
"rating": 5,
"text": "Great laptop!",
"date": "2024-01-15"
},
{
"user": "Bob",
"rating": 4,
"text": "Good value"
}
]
}
- Pros: Fast reads. One query gets everything. No JOINs.
- Cons: Data duplication. If embed "Product Info" in every "Order", product name changes require updating 10,000 orders.
- Best For: One-to-Few relationships, data that is read together and rarely changes.
2. Referencing (Normalization)
Storing just the ID of the related data (like a Foreign Key).
Example: Blog Platform
// Posts collection
{
"_id": "post_001",
"title": "System Design 101",
"author_id": "user_123", // Reference
"content": "...",
"comment_ids": ["comment_001", "comment_002"] // References
}
// Comments collection
{
"_id": "comment_001",
"post_id": "post_001",
"user_id": "user_456",
"text": "Great post!"
}
- Pros: No duplication. Smaller documents. Easy updates.
- Cons: Slower reads. Requires application-side "joins" (multiple queries).
- Best For: One-to-Many (Unbounded) relationships (e.g., a Viral Post with 1M comments - don't embed 1M comments!).
Advanced Patterns
Hybrid: Embed First N, Reference the Rest
{
"_id": "post_viral",
"title": "Going Viral",
"comments_preview": [ // Embed first 10
{ "user": "Alice", "text": "First!" },
// ... 9 more
],
"total_comments": 50000,
"comments_overflow_ref": "comments_post_viral" // Reference to overflow collection
}
Bucketing Pattern
For time-series or unbounded arrays:
// Instead of one giant document with 100k events
{
"user_id": "user_123",
"year": 2024,
"month": 1,
"events": [
{ "type": "login", "timestamp": "2024-01-01T10:00:00Z" },
// ... ~1000 events per month
]
}
Why? MongoDB has a 16MB document size limit.
Indexing for Performance
Single Field Index
db.users.createIndex({ email: 1 }) // 1 = ascending
// Query uses index
db.users.find({ email: "alice@example.com" }) // Fast!
Compound Index
db.products.createIndex({ category: 1, price: -1 })
// This query uses the index efficiently
db.products.find({ category: "Electronics" }).sort({ price: -1 })
Multikey Index (Arrays)
MongoDB automatically creates multikey indexes for arrays:
db.posts.createIndex({ tags: 1 })
// Works on arrays!
db.posts.find({ tags: "system-design" })
Text Search
db.posts.createIndex({ title: "text", content: "text" })
db.posts.find({ $text: { $search: "distributed systems" } })
Aggregation Pipeline
MongoDB's killer feature for analytics:
db.orders.aggregate([
// Stage 1: Filter orders from 2024
{ $match: { date: { $gte: new Date("2024-01-01") } } },
// Stage 2: Group by customer
{
$group: {
_id: "$customer_id",
total_spent: { $sum: "$amount" },
order_count: { $sum: 1 }
}
},
// Stage 3: Sort by total spent
{ $sort: { total_spent: -1 } },
// Stage 4: Top 10 customers
{ $limit: 10 }
])
Output:
[
{ "_id": "cust_456", "total_spent": 15000, "order_count": 25 },
{ "_id": "cust_789", "total_spent": 12000, "order_count": 18 }
]
Transactions (Modern MongoDB)
MongoDB 4.0+ supports multi-document ACID transactions:
const session = client.startSession();
session.startTransaction();
try {
// Transfer inventory
await db.inventory.updateOne(
{ warehouse: "WH1", product: "laptop" },
{ $inc: { quantity: -1 } },
{ session }
);
await db.inventory.updateOne(
{ warehouse: "WH2", product: "laptop" },
{ $inc: { quantity: 1 } },
{ session }
);
await session.commitTransaction();
} catch (error) {
await session.abortTransaction();
} finally {
session.endSession();
}
Note: Transactions are slower than single-document atomicity. Use sparingly.
Real-World Use Cases
1. Content Management Systems
Why Document DB:
- Articles have varying fields (some have videos, some have galleries)
- Easy schema evolution
{
"title": "System Design Guide",
"author": "Alice",
"sections": [
{ "type": "text", "content": "..." },
{ "type": "code", "language": "python", "code": "..." },
{ "type": "image", "url": "...", "caption": "..." }
]
}
2. Product Catalogs
Why Document DB:
- Different product types have different attributes
- SQL would require many NULL columns or EAV pattern
3. User Profiles (Polymorphic Data)
// Free user
{ "type": "free", "name": "Bob", "email": "..." }
// Premium user - different schema!
{
"type": "premium",
"name": "Alice",
"email": "...",
"subscription": {
"plan": "pro",
"billing_cycle": "annual",
"features": ["unlimited_storage", "priority_support"]
}
}
Performance Best Practices
1. Projection (Fetch Only What You Need)
// Bad: Fetch entire document
db.users.find({ email: "alice@example.com" })
// Good: Project only needed fields
db.users.find(
{ email: "alice@example.com" },
{ name: 1, email: 1, _id: 0 } // Only name and email
)
2. Avoid Large Arrays in Documents
// Bad: Unbounded array (will hit 16MB limit)
{
"user_id": "user_123",
"activity_log": [
// ... 1 million entries (100MB!)
]
}
// Good: Use bucketing or separate collection
3. Covered Queries
If index contains all queried fields, MongoDB never touches the document:
db.users.createIndex({ email: 1, name: 1 })
// Covered! Only uses index
db.users.find(
{ email: "alice@example.com" },
{ name: 1, _id: 0 }
)
Common Pitfalls
⚠️ Over-Embedding: Embedding a product in 1M orders makes updates expensive.
⚠️ Under-Referencing: Not using references for large related datasets causes document bloat.
⚠️ Missing Indexes: Queries on unindexed fields scan entire collection (slow!).
⚠️ Ignoring Document Size Limit: 16MB max per document.
Interview Tips 💡
- Explain schema flexibility: "Product catalog with variable attributes fits document model naturally"
- Embed vs Reference decision: "Embed addresses (1-to-few), reference orders (1-to-many)"
- Real example: "Amazon product catalog uses DocumentDB for flexible attributes"
- Acknowledge limitations: "Document DBs aren't ideal for complex multi-collection joins"
- Mention aggregation: "MongoDB aggregation pipeline is like SQL GROUP BY on steroids"
MongoDB vs SQL Decision Matrix
| Use SQL When | Use Document DB When |
|---|---|
| Complex multi-table joins | Hierarchical/nested data |
| Fixed schema, rare changes | Schema changes frequently |
| Strong referential integrity | Flexible polymorphic data |
| Complex transactions (100+ tables) | Simple transactions (< 10 collections) |
Related Concepts
- SQL vs NoSQL — Choosing the right database type
- Database Sharding — Horizontal partitioning strategies
- Database Replication — High availability patterns
- Graph Databases — Relationship-heavy data
- Vector Databases — Similarity search and embeddings
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
LSM Trees (Database Internals)
Log-Structured Merge Trees: The data structure powering write-heavy databases like Cassandra, RocksDB, and DynamoDB.
Redis Internals: Why is it Fast?
Single-threaded Event Loop, RDB vs AOF Persistence, and Pipelining.
Redis Internals: Why is it Fast?
Deep dive into Redis architecture: single-threaded event loop, data structures, persistence strategies (RDB/AOF), replication, and cluster mode.