Back to All Concepts
StorageCloudDistributed SystemsS3Advanced

Object Storage (S3 Design)

How systems like Amazon S3 store petabytes of data. Design internals, Erasure Coding vs Replication, Multipart Uploads, and Consistency Models.

File System vs Object Storage

Why not just store files on a really big hard drive?

Block Storage (Hard Drive): Good for OS, databases. Low latency. Hard to share across servers. Network File System (NFS): hierarchical folder structure. Good for sharing. Doesn't scale to billions of files (inode limits).

Object Storage (S3): Flat namespace. HTTP API. Infinite scale.

FeatureFile System (POSIX)Object Storage
Interfaceopen(), seek(), read()HTTP PUT, GET, DELETE
StructureHierarchy (Folders)Flat (Buckets + Keys)
MetadataFixed (size, mtime)Flexible (Custom Tags)
UpdatesEdit parts of fileImmutable (Overwrite only)
ScaleTBsExabytes

Architecture Internals

An Object Store typically consists of three layers:

mermaid
graph TD
    Client[Client] --> LB[Load Balancer / API Gateway]
    
    subgraph Metadata_Layer [Metadata Service]
        M1[Meta Node 1]
        M2[Meta Node 2]
        DB[(Key-Value DB)]
        M1 --- DB
        M2 --- DB
    end
    
    subgraph Storage_Layer [Storage Nodes (Blobs)]
        S1[Node 1]
        S2[Node 2]
        S3[Node 3]
        S4[Node 4]
    end
    
    LB --> Metadata_Layer
    LB --> Storage_Layer
Click to expand code...
  1. Front-end (API): Authenticates requests and routes traffic.
  2. Metadata Service: "Where is my-photo.jpg?"
    • Maps Bucket + Key -> Volume ID + Offset.
    • Needs a highly scalable KV store (e.g., Cassandra, FoundationDB).
  3. Storage Nodes: Dumb servers filled with hard drives.
    • Stores "Blobs" (Binary Large Objects).

Data Durability: Erasure Coding

Storing 3 copies of a 1GB file (Replication) uses 3GB. Expensive! Erasure Coding (EC) splits data into NN data chunks and KK parity chunks.

Example: Reed-Solomon (4, 2)

  • Split 1GB file into 4 chunks (250MB each).
  • Calculate 2 Parity chunks (250MB each).
  • Total storage: 1GB+0.5GB=1.5GB1GB + 0.5GB = 1.5GB (1.5x overhead).
  • Durability: Can lose any 2 drives and still recover data.

[!NOTE] S3 Standard uses EC across at least 3 Availability Zones. It's designed for 99.999999999% (11 9s) durability.

Multipart Upload

How to upload a 5TB file over HTTP?

  1. Initiate: Get an UploadID.
  2. Parallel Upload: Split file into 100MB parts. Upload them in parallel.
    • PUT /bucket/key?partNumber=1&uploadId=...
    • PUT /bucket/key?partNumber=2&uploadId=...
  3. Retry: If Part 5 fails, retry just Part 5.
  4. Complete: Send a request to stitch them together.
mermaid
sequenceDiagram
    participant C as Client
    participant S3 as S3 API
    
    C->>S3: Initiate Multipart Upload
    S3-->>C: UploadID: xyz123
    
    par Parallel Upload
        C->>S3: PUT Part 1
        C->>S3: PUT Part 2
        C->>S3: PUT Part 3
    end
    
    Note right of C: Part 2 Fails? Retry Part 2.
    
    C->>S3: Complete Multipart Upload (List of Parts)
    S3-->>C: 200 OK (ETag)
Click to expand code...

Consistency Models

S3 Strong Consistency (Since Dec 2020):

  • Read-after-Write: If you PUT a new object and immediately GET it, you serve the new version.
  • List Consistency: If you PUT and immediately LIST the bucket, the new object appears.

How? Likely achieved using a distributed consensus algorithm (Paxos/Raft) for the metadata layer cache invalidation.

Code Example: Interaction

python
import boto3

s3 = boto3.client('s3')

# 1. Upload logic (simplified multipart)
def upload_large_file(bucket, key, file_path):
    config = boto3.s3.transfer.TransferConfig(
        multipart_threshold=100 * 1024 * 1024, # 100MB
        max_concurrency=10,
        use_threads=True
    )
    s3.upload_file(file_path, bucket, key, Config=config)

# 2. Presigned URLs (Secure sharing)
def generate_share_link(bucket, key, expiration=3600):
    url = s3.generate_presigned_url(
        'get_object',
        Params={'Bucket': bucket, 'Key': key},
        ExpiresIn=expiration
    )
    return url

# 3. Metadata Search (S3 Select / Athena)
# S3 isn't a database, but you can query CSV/JSON contents
def query_csv_content(bucket, key, query):
    resp = s3.select_object_content(
        Bucket=bucket,
        Key=key,
        ExpressionType='SQL',
        Expression=query, # "SELECT * FROM s3object s WHERE s.age > 30"
        InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
        OutputSerialization={'CSV': {}}
    )
    for event in resp['Payload']:
        if 'Records' in event:
            print(event['Records']['Payload'].decode('utf-8'))
Click to expand code...

Interview Tips šŸ’”

  • "How does S3 achieve 11 9s durability?" — Erasure Coding across multiple Availability Zones.
  • "What involves latency in S3?" — First byte latency (TTFB) is higher than disk. Metadata lookup + gathering chunks.
  • "Immutable Objects" — You cannot append to an S3 object (unlike a file). You must re-upload the whole thing (or use Multi-part copy).
  • "CDN Integration" — Always mention putting CloudFront/CDN in front of S3 to reduce latency and egress costs.

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles