The Serving Challenge

Training a model is just the beginning. Serving it to millions of users is an infrastructure challenge.

Key Metrics:

Throughput: Requests per second (RPS).
Latency: Time to first token (TTFT) or total generation time.
Cost: GPU hours are expensive ( $2-$ 4/hr for A100).

1. Serving Patterns

Real-Time (Online) Inference

SLA: Low latency (< 200ms).
Architecture: REST/gRPC API.
Example: ChatGPT, Fraud Detection during checkout.
Challenges: Auto-scaling GPUs based on traffic.

mermaid

graph LR
    User -->|HTTP Request| API[FastAPI / Triton]
    API -->|Tensor| GPU[GPU Worker]
    GPU -->|Prediction| API
    API -->|Response| User

Click to expand code...

Batch Inference

SLA: High latency (Hours/Days).
Architecture: Cron Job / Airflow DAG.
Example: Nightly product recommendation generation.
Pros: High throughput, efficient GPU usage (packed batches).

Streaming Inference

SLA: Instant Feedback.
Architecture: Server-Sent Events (SSE) or WebSocket.
Example: LLM token generation (typing effect).

2. Optimization Techniques

How to make models faster and cheaper?

Dynamic Batching

Process multiple requests in parallel on the GPU.

Without Batching: 1 req = 10ms compute + 10ms memory overhead. Total = 20ms.
With Batching (Size 8): 8 reqs = 15ms compute + 10ms memory. Total = 25ms. Per req = 3ms!

Tools: NVIDIA Triton Inference Server, TorchServe, vLLM.

Quantization (Shrinking Models)

FP32 (32-bit float) is standard but huge.

FP16: Half precision. 2x speedup, 50% RAM. Minimal accuracy loss. Most common.
INT8: 8-bit integer. 4x speedup, 25% RAM. Needs calibration.
GPTQ / AWQ: 4-bit quantization for LLMs. Runs Llama-3-70B on 2x A100s instead of 4x.

KV Caching (For LLMs)

Transformers recompute the entire sequence for every token. KV Cache saves the Key/Value matrices of past tokens in GPU VRAM so we only compute the new token.

Tradeoff: Increases VRAM usage significantly.

3. Deployment Infrastructure

Model Registry

Version control for weights (model.pt).

Tools: MLflow, HuggingFace Hub, AWS S3.
Blue/Green Deployment: Test model v2 on 1% traffic before full rollout.

Serverless Inference

AWS Lambda: Good for CPU models (XGBoost, small BERT). Bad for big GPUs (Cold start > 10s).
Specialized: Modal, RunPod, Replicate. Spin up containers only when requests arrive.

Code Example: Triton Client

NVIDIA Triton handles batching and scheduling automatically.

python

import tritonclient.http as httpclient
import numpy as np

# 1. Connect to Inference Server
client = httpclient.InferenceServerClient(url="localhost:8000")

# 2. Prepare Input (Batch Size 1)
input_data = np.array([[0.5, 0.2, 0.1]], dtype=np.float32)
inputs = [
    httpclient.InferInput("input_node", input_data.shape, "FP32")
]
inputs[0].set_data_from_numpy(input_data)

# 3. Request Prediction
results = client.infer(model_name="my_model", inputs=inputs)

# 4. Get Output
output_data = results.as_numpy("output_node")
print(f"Prediction: {output_data}")

Click to expand code...

Interview Tips 💡

"CPU vs GPU?" — GPU for deep learning (Matrix Mult parallelization). CPU for decision trees (XGBoost) or very small models.
"What is Cold Start?" — Loading a 10GB model from disk to GPU memory takes 10-20 seconds. Severely impacts serverless latency.
"Multi-Model Serving" — Loading 10 different models on 1 GPU. If VRAM fills up, swapping to CPU RAM kills performance. Use LoRA adapters to share base weights.

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Intermediate

RAG Architecture (LLMs)

Retrieval-Augmented Generation. How to stop Large Language Models (LLMs) from hallucinating by grounding them in your private data.

AILLMRAG

Intermediate

Backpressure

Flow control mechanism that prevents fast producers from overwhelming slow consumers by signaling when to slow down, pause, or drop data in streaming systems.

StreamingPerformanceReactive Programming

Beginner

Caching Overview

High-speed data storage to reduce latency. The single most effective way to scale read-heavy systems.

PerformanceOptimizationDatabase

ML Model Serving