Document Databases

Document databases (like MongoDB, CouchDB, Cosmos DB) store data in documents (JSON, BSON, XML) rather than rows and columns.

Key Characteristics

Flexible Schema: You don't define the structure upfront. One document can have email and phone, while the next one only has email and twitter_handle.
Hierarchical Data: You can store arrays and nested objects directly. In SQL, a "User" with 5 "Addresses" would require 2 tables (Users, Addresses). In Document DBs, it's just one object.

json

{
  "_id": 1,
  "name": "Alice",
  "addresses": [
    { "street": "123 Main St", "type": "Home" },
    { "street": "456 Work Ave", "type": "Office" }
  ]
}

Click to expand code...

Embed vs Reference: The Core Decision

The biggest decision in Document DBs is Embed vs Reference.

1. Embedding (Denormalization)

Storing related data inside the parent document.

Example: E-Commerce Products

javascript

{
  "_id": "prod_123",
  "name": "Laptop",
  "price": 999,
  "reviews": [  // Embedded
    {
      "user": "Alice",
      "rating": 5,
      "text": "Great laptop!",
      "date": "2024-01-15"
    },
    {
      "user": "Bob",
      "rating": 4,
      "text": "Good value"
    }
  ]
}

Click to expand code...

Pros: Fast reads. One query gets everything. No JOINs.
Cons: Data duplication. If embed "Product Info" in every "Order", product name changes require updating 10,000 orders.
Best For: One-to-Few relationships, data that is read together and rarely changes.

2. Referencing (Normalization)

Storing just the ID of the related data (like a Foreign Key).

Example: Blog Platform

javascript

// Posts collection
{
  "_id": "post_001",
  "title": "System Design 101",
  "author_id": "user_123",  // Reference
  "content": "...",
  "comment_ids": ["comment_001", "comment_002"]  // References
}

// Comments collection
{
  "_id": "comment_001",
  "post_id": "post_001",
  "user_id": "user_456",
  "text": "Great post!"
}

Click to expand code...

Pros: No duplication. Smaller documents. Easy updates.
Cons: Slower reads. Requires application-side "joins" (multiple queries).
Best For: One-to-Many (Unbounded) relationships (e.g., a Viral Post with 1M comments - don't embed 1M comments!).

Advanced Patterns

Hybrid: Embed First N, Reference the Rest

javascript

{
  "_id": "post_viral",
  "title": "Going Viral",
  "comments_preview": [  // Embed first 10
    { "user": "Alice", "text": "First!" },
    // ... 9 more
  ],
  "total_comments": 50000,
  "comments_overflow_ref": "comments_post_viral"  // Reference to overflow collection
}

Click to expand code...

Bucketing Pattern

For time-series or unbounded arrays:

javascript

// Instead of one giant document with 100k events
{
  "user_id": "user_123",
  "year": 2024,
  "month": 1,
  "events": [
    { "type": "login", "timestamp": "2024-01-01T10:00:00Z" },
    // ... ~1000 events per month
  ]
}

Click to expand code...

Why? MongoDB has a 16MB document size limit.

Indexing for Performance

Single Field Index

javascript

db.users.createIndex({ email: 1 })  // 1 = ascending

// Query uses index
db.users.find({ email: "alice@example.com" })  // Fast!

Click to expand code...

Compound Index

javascript

db.products.createIndex({ category: 1, price: -1 })

// This query uses the index efficiently
db.products.find({ category: "Electronics" }).sort({ price: -1 })

Click to expand code...

Multikey Index (Arrays)

MongoDB automatically creates multikey indexes for arrays:

javascript

db.posts.createIndex({ tags: 1 })

// Works on arrays!
db.posts.find({ tags: "system-design" })

Click to expand code...

Text Search

javascript

db.posts.createIndex({ title: "text", content: "text" })

db.posts.find({ $text: { $search: "distributed systems" } })

Click to expand code...

Aggregation Pipeline

MongoDB's killer feature for analytics:

javascript

db.orders.aggregate([
  // Stage 1: Filter orders from 2024
  { $match: { date: { $gte: new Date("2024-01-01") } } },
  
  // Stage 2: Group by customer
  {
    $group: {
      _id: "$customer_id",
      total_spent: { $sum: "$amount" },
      order_count: { $sum: 1 }
    }
  },
  
  // Stage 3: Sort by total spent
  { $sort: { total_spent: -1 } },
  
  // Stage 4: Top 10 customers
  { $limit: 10 }
])

Click to expand code...

Output:

javascript

[
  { "_id": "cust_456", "total_spent": 15000, "order_count": 25 },
  { "_id": "cust_789", "total_spent": 12000, "order_count": 18 }
]

Click to expand code...

Transactions (Modern MongoDB)

MongoDB 4.0+ supports multi-document ACID transactions:

javascript

const session = client.startSession();
session.startTransaction();

try {
  // Transfer inventory
  await db.inventory.updateOne(
    { warehouse: "WH1", product: "laptop" },
    { $inc: { quantity: -1 } },
    { session }
  );
  
  await db.inventory.updateOne(
    { warehouse: "WH2", product: "laptop" },
    { $inc: { quantity: 1 } },
    { session }
  );
  
  await session.commitTransaction();
} catch (error) {
  await session.abortTransaction();
} finally {
  session.endSession();
}

Click to expand code...

Note: Transactions are slower than single-document atomicity. Use sparingly.

Real-World Use Cases

1. Content Management Systems

Why Document DB:

Articles have varying fields (some have videos, some have galleries)
Easy schema evolution

javascript

{
  "title": "System Design Guide",
  "author": "Alice",
  "sections": [
    { "type": "text", "content": "..." },
    { "type": "code", "language": "python", "code": "..." },
    { "type": "image", "url": "...", "caption": "..." }
  ]
}

Click to expand code...

2. Product Catalogs

Why Document DB:

Different product types have different attributes
SQL would require many NULL columns or EAV pattern

3. User Profiles (Polymorphic Data)

javascript

// Free user
{ "type": "free", "name": "Bob", "email": "..." }

// Premium user - different schema!
{
  "type": "premium",
  "name": "Alice",
  "email": "...",
  "subscription": {
    "plan": "pro",
    "billing_cycle": "annual",
    "features": ["unlimited_storage", "priority_support"]
  }
}

Click to expand code...

Performance Best Practices

1. Projection (Fetch Only What You Need)

javascript

// Bad: Fetch entire document
db.users.find({ email: "alice@example.com" })

// Good: Project only needed fields
db.users.find(
  { email: "alice@example.com" },
  { name: 1, email: 1, _id: 0 }  // Only name and email
)

Click to expand code...

2. Avoid Large Arrays in Documents

javascript

// Bad: Unbounded array (will hit 16MB limit)
{
  "user_id": "user_123",
  "activity_log": [
    // ... 1 million entries (100MB!)
  ]
}

// Good: Use bucketing or separate collection

Click to expand code...

3. Covered Queries

If index contains all queried fields, MongoDB never touches the document:

javascript

db.users.createIndex({ email: 1, name: 1 })

// Covered! Only uses index
db.users.find(
  { email: "alice@example.com" },
  { name: 1, _id: 0 }
)

Click to expand code...

Common Pitfalls

⚠️ Over-Embedding: Embedding a product in 1M orders makes updates expensive.

⚠️ Under-Referencing: Not using references for large related datasets causes document bloat.

⚠️ Missing Indexes: Queries on unindexed fields scan entire collection (slow!).

⚠️ Ignoring Document Size Limit: 16MB max per document.

Interview Tips 💡

Explain schema flexibility: "Product catalog with variable attributes fits document model naturally"
Embed vs Reference decision: "Embed addresses (1-to-few), reference orders (1-to-many)"
Real example: "Amazon product catalog uses DocumentDB for flexible attributes"
Acknowledge limitations: "Document DBs aren't ideal for complex multi-collection joins"
Mention aggregation: "MongoDB aggregation pipeline is like SQL GROUP BY on steroids"

MongoDB vs SQL Decision Matrix

Use SQL When	Use Document DB When
Complex multi-table joins	Hierarchical/nested data
Fixed schema, rare changes	Schema changes frequently
Strong referential integrity	Flexible polymorphic data
Complex transactions (100+ tables)	Simple transactions (< 10 collections)

Related Concepts

SQL vs NoSQL — Choosing the right database type
Database Sharding — Horizontal partitioning strategies
Database Replication — High availability patterns
Graph Databases — Relationship-heavy data
Vector Databases — Similarity search and embeddings

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Pro