File System vs Object Storage
Why not just store files on a really big hard drive?
Block Storage (Hard Drive): Good for OS, databases. Low latency. Hard to share across servers. Network File System (NFS): hierarchical folder structure. Good for sharing. Doesn't scale to billions of files (inode limits).
Object Storage (S3): Flat namespace. HTTP API. Infinite scale.
| Feature | File System (POSIX) | Object Storage |
|---|---|---|
| Interface | open(), seek(), read() | HTTP PUT, GET, DELETE |
| Structure | Hierarchy (Folders) | Flat (Buckets + Keys) |
| Metadata | Fixed (size, mtime) | Flexible (Custom Tags) |
| Updates | Edit parts of file | Immutable (Overwrite only) |
| Scale | TBs | Exabytes |
Architecture Internals
An Object Store typically consists of three layers:
graph TD
Client[Client] --> LB[Load Balancer / API Gateway]
subgraph Metadata_Layer [Metadata Service]
M1[Meta Node 1]
M2[Meta Node 2]
DB[(Key-Value DB)]
M1 --- DB
M2 --- DB
end
subgraph Storage_Layer [Storage Nodes (Blobs)]
S1[Node 1]
S2[Node 2]
S3[Node 3]
S4[Node 4]
end
LB --> Metadata_Layer
LB --> Storage_Layer
- Front-end (API): Authenticates requests and routes traffic.
- Metadata Service: "Where is
my-photo.jpg?"- Maps
Bucket + Key->Volume ID + Offset. - Needs a highly scalable KV store (e.g., Cassandra, FoundationDB).
- Maps
- Storage Nodes: Dumb servers filled with hard drives.
- Stores "Blobs" (Binary Large Objects).
Data Durability: Erasure Coding
Storing 3 copies of a 1GB file (Replication) uses 3GB. Expensive! Erasure Coding (EC) splits data into data chunks and parity chunks.
Example: Reed-Solomon (4, 2)
- Split 1GB file into 4 chunks (250MB each).
- Calculate 2 Parity chunks (250MB each).
- Total storage: (1.5x overhead).
- Durability: Can lose any 2 drives and still recover data.
[!NOTE] S3 Standard uses EC across at least 3 Availability Zones. It's designed for 99.999999999% (11 9s) durability.
Multipart Upload
How to upload a 5TB file over HTTP?
- Initiate: Get an
UploadID. - Parallel Upload: Split file into 100MB parts. Upload them in parallel.
PUT /bucket/key?partNumber=1&uploadId=...PUT /bucket/key?partNumber=2&uploadId=...
- Retry: If Part 5 fails, retry just Part 5.
- Complete: Send a request to stitch them together.
sequenceDiagram
participant C as Client
participant S3 as S3 API
C->>S3: Initiate Multipart Upload
S3-->>C: UploadID: xyz123
par Parallel Upload
C->>S3: PUT Part 1
C->>S3: PUT Part 2
C->>S3: PUT Part 3
end
Note right of C: Part 2 Fails? Retry Part 2.
C->>S3: Complete Multipart Upload (List of Parts)
S3-->>C: 200 OK (ETag)
Consistency Models
S3 Strong Consistency (Since Dec 2020):
- Read-after-Write: If you
PUTa new object and immediatelyGETit, you serve the new version. - List Consistency: If you
PUTand immediatelyLISTthe bucket, the new object appears.
How? Likely achieved using a distributed consensus algorithm (Paxos/Raft) for the metadata layer cache invalidation.
Code Example: Interaction
import boto3
s3 = boto3.client('s3')
# 1. Upload logic (simplified multipart)
def upload_large_file(bucket, key, file_path):
config = boto3.s3.transfer.TransferConfig(
multipart_threshold=100 * 1024 * 1024, # 100MB
max_concurrency=10,
use_threads=True
)
s3.upload_file(file_path, bucket, key, Config=config)
# 2. Presigned URLs (Secure sharing)
def generate_share_link(bucket, key, expiration=3600):
url = s3.generate_presigned_url(
'get_object',
Params={'Bucket': bucket, 'Key': key},
ExpiresIn=expiration
)
return url
# 3. Metadata Search (S3 Select / Athena)
# S3 isn't a database, but you can query CSV/JSON contents
def query_csv_content(bucket, key, query):
resp = s3.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query, # "SELECT * FROM s3object s WHERE s.age > 30"
InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization={'CSV': {}}
)
for event in resp['Payload']:
if 'Records' in event:
print(event['Records']['Payload'].decode('utf-8'))
Interview Tips š”
- "How does S3 achieve 11 9s durability?" ā Erasure Coding across multiple Availability Zones.
- "What involves latency in S3?" ā First byte latency (TTFB) is higher than disk. Metadata lookup + gathering chunks.
- "Immutable Objects" ā You cannot append to an S3 object (unlike a file). You must re-upload the whole thing (or use Multi-part copy).
- "CDN Integration" ā Always mention putting CloudFront/CDN in front of S3 to reduce latency and egress costs.
Related Concepts
- Consistent Hashing (Used to place chunks)
- Distributed File Systems (e.g. HDFS)
- CDN Architecture
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
ACID vs BASE: Consistency Models
The two philosophies of database transaction handling: Strict guarantees (ACID) versus flexible availability (BASE). Deep dive into isolation levels, transaction anomalies, and hybrid approaches.
Consistent Hashing
How to add/remove servers without moving every single key. The Ring, Virtual Nodes, and real-world usage in Cassandra, DynamoDB, and Discord.
CRDTs (Real-time Collaboration)
Conflict-free Replicated Data Types enable distributed systems to achieve eventual consistency without coordination, powering Google Docs, Figma, and collaborative editing through mathematically proven merge algorithms.