File System vs Object Storage

Why not just store files on a really big hard drive?

Block Storage (Hard Drive): Good for OS, databases. Low latency. Hard to share across servers. Network File System (NFS): hierarchical folder structure. Good for sharing. Doesn't scale to billions of files (inode limits).

Object Storage (S3): Flat namespace. HTTP API. Infinite scale.

Feature	File System (POSIX)	Object Storage
Interface	`open()`, `seek()`, `read()`	HTTP `PUT`, `GET`, `DELETE`
Structure	Hierarchy (Folders)	Flat (Buckets + Keys)
Metadata	Fixed (size, mtime)	Flexible (Custom Tags)
Updates	Edit parts of file	Immutable (Overwrite only)
Scale	TBs	Exabytes

Architecture Internals

An Object Store typically consists of three layers:

mermaid

graph TD
    Client[Client] --> LB[Load Balancer / API Gateway]
    
    subgraph Metadata_Layer [Metadata Service]
        M1[Meta Node 1]
        M2[Meta Node 2]
        DB[(Key-Value DB)]
        M1 --- DB
        M2 --- DB
    end
    
    subgraph Storage_Layer [Storage Nodes (Blobs)]
        S1[Node 1]
        S2[Node 2]
        S3[Node 3]
        S4[Node 4]
    end
    
    LB --> Metadata_Layer
    LB --> Storage_Layer

Click to expand code...

Front-end (API): Authenticates requests and routes traffic.
Metadata Service: "Where is my-photo.jpg?"
- Maps Bucket + Key -> Volume ID + Offset.
- Needs a highly scalable KV store (e.g., Cassandra, FoundationDB).
Storage Nodes: Dumb servers filled with hard drives.
- Stores "Blobs" (Binary Large Objects).

Data Durability: Erasure Coding

Storing 3 copies of a 1GB file (Replication) uses 3GB. Expensive! Erasure Coding (EC) splits data into $N$ data chunks and $K$ parity chunks.

Example: Reed-Solomon (4, 2)

Split 1GB file into 4 chunks (250MB each).
Calculate 2 Parity chunks (250MB each).
Total storage: $1GB + 0.5GB = 1.5GB$ (1.5x overhead).
Durability: Can lose any 2 drives and still recover data.

[!NOTE] S3 Standard uses EC across at least 3 Availability Zones. It's designed for 99.999999999% (11 9s) durability.

Multipart Upload

How to upload a 5TB file over HTTP?

Initiate: Get an UploadID.
Parallel Upload: Split file into 100MB parts. Upload them in parallel.
- PUT /bucket/key?partNumber=1&uploadId=...
- PUT /bucket/key?partNumber=2&uploadId=...
Retry: If Part 5 fails, retry just Part 5.
Complete: Send a request to stitch them together.

mermaid

sequenceDiagram
    participant C as Client
    participant S3 as S3 API
    
    C->>S3: Initiate Multipart Upload
    S3-->>C: UploadID: xyz123
    
    par Parallel Upload
        C->>S3: PUT Part 1
        C->>S3: PUT Part 2
        C->>S3: PUT Part 3
    end
    
    Note right of C: Part 2 Fails? Retry Part 2.
    
    C->>S3: Complete Multipart Upload (List of Parts)
    S3-->>C: 200 OK (ETag)

Click to expand code...

Consistency Models

S3 Strong Consistency (Since Dec 2020):

Read-after-Write: If you PUT a new object and immediately GET it, you serve the new version.
List Consistency: If you PUT and immediately LIST the bucket, the new object appears.

How? Likely achieved using a distributed consensus algorithm (Paxos/Raft) for the metadata layer cache invalidation.

Code Example: Interaction

python

import boto3

s3 = boto3.client('s3')

# 1. Upload logic (simplified multipart)
def upload_large_file(bucket, key, file_path):
    config = boto3.s3.transfer.TransferConfig(
        multipart_threshold=100 * 1024 * 1024, # 100MB
        max_concurrency=10,
        use_threads=True
    )
    s3.upload_file(file_path, bucket, key, Config=config)

# 2. Presigned URLs (Secure sharing)
def generate_share_link(bucket, key, expiration=3600):
    url = s3.generate_presigned_url(
        'get_object',
        Params={'Bucket': bucket, 'Key': key},
        ExpiresIn=expiration
    )
    return url

# 3. Metadata Search (S3 Select / Athena)
# S3 isn't a database, but you can query CSV/JSON contents
def query_csv_content(bucket, key, query):
    resp = s3.select_object_content(
        Bucket=bucket,
        Key=key,
        ExpressionType='SQL',
        Expression=query, # "SELECT * FROM s3object s WHERE s.age > 30"
        InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
        OutputSerialization={'CSV': {}}
    )
    for event in resp['Payload']:
        if 'Records' in event:
            print(event['Records']['Payload'].decode('utf-8'))

Click to expand code...

Interview Tips 💡

"How does S3 achieve 11 9s durability?" — Erasure Coding across multiple Availability Zones.
"What involves latency in S3?" — First byte latency (TTFB) is higher than disk. Metadata lookup + gathering chunks.
"Immutable Objects" — You cannot append to an S3 object (unlike a file). You must re-upload the whole thing (or use Multi-part copy).
"CDN Integration" — Always mention putting CloudFront/CDN in front of S3 to reduce latency and egress costs.

Related Concepts

Consistent Hashing (Used to place chunks)
Distributed File Systems (e.g. HDFS)
CDN Architecture

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Intermediate

ACID vs BASE: Consistency Models

The two philosophies of database transaction handling: Strict guarantees (ACID) versus flexible availability (BASE). Deep dive into isolation levels, transaction anomalies, and hybrid approaches.

DatabaseTransactionsConsistency

Advanced

Consistent Hashing

How to add/remove servers without moving every single key. The Ring, Virtual Nodes, and real-world usage in Cassandra, DynamoDB, and Discord.

DatabasesDistributed SystemsAlgorithms

Expert

CRDTs (Real-time Collaboration)

Conflict-free Replicated Data Types enable distributed systems to achieve eventual consistency without coordination, powering Google Docs, Figma, and collaborative editing through mathematically proven merge algorithms.

AlgorithmsDistributed SystemsCollaboration

Object Storage (S3 Design)

File System vs Object Storage

Architecture Internals

Data Durability: Erasure Coding

Multipart Upload

Consistency Models

Code Example: Interaction

Interview Tips 💡

Related Concepts

About ScaleWiki

Related Articles

ACID vs BASE: Consistency Models

Consistent Hashing

CRDTs (Real-time Collaboration)