The Most Important Decision
Every system starts with: "How do we store data?" Wrong choice: - Slow queries - Expensive rewrites - Scalability bottlenecks - Engineering frustration Right choice: - Natural data model - Fast queries - Easy scaling - Happy team
The Two Worlds
1. Relational (SQL)
Examples: PostgreSQL, MySQL, Oracle, SQL Server
Philosophy: Structure first, flexibility later Data model: Tables with rows/columns Schema: Strict (defined upfront) Relationships: Foreign keys + JOINs Query language: SQL Scaling: Primarily vertical Consistency: ACID guarantees
2. Non-Relational (NoSQL)
Examples: MongoDB, Cassandra, Redis, Neo4j
Philosophy: Flexibility first, structure optional Data model: Varies (documents, key-value, graphs, columns) Schema: Dynamic or schema-less Relationships: Embedded or denormalized Query language: Varies by database Scaling: Primarily horizontal Consistency: Eventually consistent (BASE)
SQL Databases In-Depth
Strengths
✅ ACID Transactions
BEGIN TRANSACTION; UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2; COMMIT; -- Guaranteed: Both happen or neither happens
✅ Complex Queries
-- Find top customers who bought products in multiple categories
SELECT
c.name,
COUNT(DISTINCT p.category) AS categories_purchased,
SUM(o.total) AS total_spent
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
GROUP BY c.id
HAVING COUNT(DISTINCT p.category) >= 3
ORDER BY total_spent DESC
LIMIT 10;
✅ Data Integrity
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
age INTEGER CHECK (age >= 18),
created_at TIMESTAMP DEFAULT NOW()
);
-- Database enforces:
-- - Unique emails
-- - Non-null emails
-- - Age >= 18
-- - Auto-generated IDs
Weaknesses
❌ Rigid Schema
-- Adding a column requires migration ALTER TABLE users ADD COLUMN phone VARCHAR(20); -- Affects entire table (millions of rows!) -- Requires downtime or careful migration
❌ Vertical Scaling Limits
Single server limitations: - Max CPU: 128 cores - Max RAM: 4TB - Max storage: ~100TB - Cost: $$$$$ Beyond this: Must shard manually (complex!)
❌ JOIN Performance
-- 5-way JOIN on large tables = slow SELECT * FROM table1 JOIN table2 ON ... JOIN table3 ON ... JOIN table4 ON ... JOIN table5 ON ... WHERE ...; -- Time: can be seconds on large datasets
NoSQL Families
A. Document Stores (MongoDB, CouchDB)
Model: JSON-like documents
// User document
{
"_id": "user123",
"name": "Alice",
"email": "alice@example.com",
"address": {
"street": "123 Main St",
"city": "NYC"
},
"orders": [
{ "id": "order1", "total": 99.99, "date": "2024-01-01" },
{ "id": "order2", "total": 149.99, "date": "2024-01-15" }
]
}
// Query
db.users.find({ "address.city": "NYC", "orders.total": { $gt: 100 } })
When to use:
- Flexible schemas
- Nested/hierarchical data
- Rapid prototyping
- Content management
Example: Blog platform with varying article structures
B. Key-Value Stores (Redis, DynamoDB)
Model: Simple key → value mapping
# Redis
redis.set("user:123:name", "Alice")
redis.set("user:123:email", "alice@example.com")
# Retrieve
name = redis.get("user:123:name")
# Atomic operations
redis.incr("page:views")
redis.expire("session:abc", 3600) # TTL
When to use:
- Caching
- Session storage
- Real-time counters
- Simple lookups
- Sub-millisecond latency required
Example: Session store for web app
C. Wide-Column Stores (Cassandra, HBase)
Model: Rows with dynamic columns
Column Family: user_events Row Key: user_123 Column: event_2024-01-01-10:00:00 → "login" Column: event_2024-01-01-10:05:23 → "click_button" Column: event_2024-01-01-10:10:15 → "purchase" ... Column: event_2024-01-15-09:30:00 → "logout" // Can have billions of columns per row!
When to use:
- Time-series data
- Massive write throughput
- Log aggregation
- IoT sensor data
Example: Netflix viewing history (billions of events)
D. Graph Databases (Neo4j, Neptune)
Model: Nodes + relationships
// Social network (Alice:Person)-[:FRIEND]->(Bob:Person)-[:FRIEND]->(Carol:Person) (Alice)-[:BOUGHT]->(Product:Laptop) (Bob)-[:BOUGHT]->(Product) // Query: Friends who bought same product MATCH (me:Person)-[:FRIEND]-(friend)-[:BOUGHT]->(product) WHERE me.name = 'Alice' AND (me)-[:BOUGHT]-> (product) RETURN friend.name, product.name
When to use:
- Social networks
- Recommendation engines
- Fraud detection
- Knowledge graphs
Example: LinkedIn "People you may know"
ACID vs BASE
ACID (SQL)
Atomicity: All or nothing Consistency: Valid state always Isolation: Transactions don't interfere Durability: Committed = persisted Example: Bank transfer: $100 from A to B Either both happen or neither Database never shows inconsistent state
BASE (NoSQL)
Basically Available: System responds even if some nodes down Soft state: State may change over time Eventually consistent: Reads may be stale, but will converge Example: Twitter follower count Might show 1,000 on one server, 1,001 on another Eventually both show 1,001 Acceptable trade-off for speed
Decision Framework
Choose SQL when:
✅ Structured data with clear relationships
E-commerce: - Users, Orders, Products, Inventory - Clear foreign keys - Need JOINs (order → order_items → products)
✅ Data integrity is critical
Banking, Healthcare, Payroll: - Cannot lose transactions - Must maintain constraints - ACID guarantees required
✅ Complex queries required
Analytics, Reporting: - Multi-table aggregations - Ad-hoc queries - Business intelligence
✅ Moderate scale
< 10M active rows - Single datacenter - Vertical scaling acceptable
Choose NoSQL when:
✅ Flexible/evolving schema
Content platforms, catalogs: - Product attributes vary - User profiles with custom fields - Rapid feature development
✅ Massive scale required
Social media, IoT, logs: - Billions of records - Petabytes of data - Horizontal scaling needed
✅ High write throughput
Time-series, events, metrics: - 100k+ writes/second - Append-only workload - Eventual consistency OK
✅ Geographic distribution
Global apps: - Low-latency worldwide - Multi-region replication - Partition tolerance critical
✅ Specific access patterns
Key-value: Cache, sessions (Redis) Graph: Social networks (Neo4j) Document: CMS, catalogs (MongoDB) Wide-column: Time-series (Cassandra)
Real-World Examples
1. Instagram (PostgreSQL → Cassandra)
Phase 1 (Early): PostgreSQL - Simple schema (users, photos, likes) - ACID transactions - <1M users Phase 2 (Growth): Still PostgreSQL - Vertical scaling to largest machines - Read replicas - ~10M users Phase 3 (Scale): Cassandra - Billions of photos - Petabytes of data - Global distribution - Horizontal scaling - Eventual consistency acceptable
2. Uber (MySQL + Redis + Cassandra)
MySQL: - Driver/rider accounts - Trip transactions - Payment processing (ACID critical) Redis: - Driver location cache - Surge pricing calculations - Real-time matching (Speed critical) Cassandra: - Trip history - Location logs - Analytics events (Scale + writes critical)
3. Netflix (Multiple Databases)
MySQL: - User accounts - Billing - Subscription management Cassandra: - Viewing history (billions of events) - Recommendations pre-computations - A/B test results EVCache (Memcached): - Movie metadata - User preferences - API response caching Elasticsearch: - Search and discovery - Content recommendations
Polyglot Persistence
Modern approach: Use multiple databases for different needs.
graph TB
App[Application]
App -->|User Auth| PG[PostgreSQL<br/>ACID transactions]
App -->|Caching| Redis[Redis<br/>Sub-ms reads]
App -->|Events| Kafka[Kafka<br/>Stream processing]
App -->|Analytics| Cassandra[Cassandra<br/>Time-series]
App -->|Search| ES[Elasticsearch<br/>Full-text search]
App -->|Recommendations| Neo4j[Neo4j<br/>Graph queries]
Example: E-commerce platform
PostgreSQL: - Orders, payments, inventory - ACID for financial accuracy Redis: - Shopping cart (session data) - Product catalog cache - Real-time inventory count Elasticsearch: - Product search - Faceted navigation - Autocomplete Cassandra: - User click streams - Product view history - Recommendation training data Neo4j: - Product recommendations - "Customers also bought..." - Fraud detection rules
Migration Considerations
SQL → NoSQL
Reasons:
- Outgrown vertical scaling
- Need horizontal scaling
- Schema too rigid
- Geographic distribution required
Challenges:
1. No JOINs → Denormalize data 2. No transactions → Application-level consistency 3. Different query patterns → Rethink access patterns 4. Data migration → Dual-write period
Strategy:
1. Identify bounded contexts 2. Migrate one service/table at a time 3. Use dual-write during transition 4. Validate data consistency 5. Gradual traffic shift
NoSQL → SQL
Reasons:
- Need complex queries
- ACID transactions required
- Data model stabilized
- JOIN performance needed
Example: Early startup with MongoDB
Phase 1: MongoDB (rapid prototyping) Phase 2: PostgreSQL (after PMF, stable schema) Reason: Complex reporting needs emerged
Performance Comparison
Scenario: 1M user profiles SQL (PostgreSQL): - Read: ~1ms (indexed) - Write: ~2ms (ACID overhead) - JOIN (3 tables): ~10-50ms - Analytics query: ~100ms-1s - Max throughput: ~10k writes/sec (single node) NoSQL (MongoDB): - Read: ~0.5ms (embedded docs, no JOINs) - Write: ~0.8ms (eventual consistency) - JOIN equivalent: N/A (denormalized) - Analytics query: ~1-10s (MapReduce) - Max throughput: ~100k writes/sec (sharded) Key-Value (Redis): - Read: ~0.1ms (in-memory) - Write: ~0.1ms (in-memory) - JOIN equivalent: N/A - Analytics query: N/A (not designed for this) - Max throughput: ~1M ops/sec
Interview Tips 💡
When discussing SQL vs NoSQL in interviews:
-
It's not either/or: "Modern apps often use both - PostgreSQL for transactions, Redis for cache, Cassandra for events..."
-
Start with use case: "For banking? SQL (ACID critical). For social feed? NoSQL (scale critical)..."
-
Mention trade-offs: "SQL gives ACID but limited horizontal scaling. NoSQL scales but eventual consistency..."
-
Real examples: "Instagram started PostgreSQL, migrated to Cassandra at scale. Uber uses both..."
-
Schema evolution: "SQL: rigid schema, migrations. NoSQL: flexible, but must handle multiple versions in code..."
-
Query patterns: "SQL excels at ad-hoc queries. NoSQL requires known access patterns upfront..."
Related Concepts
- ACID vs BASE — Consistency models
- CAP Theorem — Distributed system trade-offs
- Database Sharding — Horizontal partitioning
- Replication — Data redundancy
- Indexing — Query optimization
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
ACID vs BASE: Consistency Models
The two philosophies of database transaction handling: Strict guarantees (ACID) versus flexible availability (BASE). Deep dive into isolation levels, transaction anomalies, and hybrid approaches.
Database Replication
The process of copying and maintaining database objects in multiple databases to improve reliability, fault-tolerance, and accessibility.
Database Sharding
How to split a massive database across multiple servers. Horizontal scaling strategies, challenges (Joins, ACID), and real-world algorithms used by Instagram, Vitess, and CockroachDB.