The Serving Challenge
Training a model is just the beginning. Serving it to millions of users is an infrastructure challenge.
Key Metrics:
- Throughput: Requests per second (RPS).
- Latency: Time to first token (TTFT) or total generation time.
- Cost: GPU hours are expensive (4/hr for A100).
1. Serving Patterns
Real-Time (Online) Inference
- SLA: Low latency (< 200ms).
- Architecture: REST/gRPC API.
- Example: ChatGPT, Fraud Detection during checkout.
- Challenges: Auto-scaling GPUs based on traffic.
graph LR
User -->|HTTP Request| API[FastAPI / Triton]
API -->|Tensor| GPU[GPU Worker]
GPU -->|Prediction| API
API -->|Response| User
Batch Inference
- SLA: High latency (Hours/Days).
- Architecture: Cron Job / Airflow DAG.
- Example: Nightly product recommendation generation.
- Pros: High throughput, efficient GPU usage (packed batches).
Streaming Inference
- SLA: Instant Feedback.
- Architecture: Server-Sent Events (SSE) or WebSocket.
- Example: LLM token generation (typing effect).
2. Optimization Techniques
How to make models faster and cheaper?
Dynamic Batching
Process multiple requests in parallel on the GPU.
- Without Batching: 1 req = 10ms compute + 10ms memory overhead. Total = 20ms.
- With Batching (Size 8): 8 reqs = 15ms compute + 10ms memory. Total = 25ms. Per req = 3ms!
Tools: NVIDIA Triton Inference Server, TorchServe, vLLM.
Quantization (Shrinking Models)
FP32 (32-bit float) is standard but huge.
- FP16: Half precision. 2x speedup, 50% RAM. Minimal accuracy loss. Most common.
- INT8: 8-bit integer. 4x speedup, 25% RAM. Needs calibration.
- GPTQ / AWQ: 4-bit quantization for LLMs. Runs Llama-3-70B on 2x A100s instead of 4x.
KV Caching (For LLMs)
Transformers recompute the entire sequence for every token. KV Cache saves the Key/Value matrices of past tokens in GPU VRAM so we only compute the new token.
- Tradeoff: Increases VRAM usage significantly.
3. Deployment Infrastructure
Model Registry
Version control for weights (model.pt).
- Tools: MLflow, HuggingFace Hub, AWS S3.
- Blue/Green Deployment: Test model v2 on 1% traffic before full rollout.
Serverless Inference
- AWS Lambda: Good for CPU models (XGBoost, small BERT). Bad for big GPUs (Cold start > 10s).
- Specialized: Modal, RunPod, Replicate. Spin up containers only when requests arrive.
Code Example: Triton Client
NVIDIA Triton handles batching and scheduling automatically.
import tritonclient.http as httpclient
import numpy as np
# 1. Connect to Inference Server
client = httpclient.InferenceServerClient(url="localhost:8000")
# 2. Prepare Input (Batch Size 1)
input_data = np.array([[0.5, 0.2, 0.1]], dtype=np.float32)
inputs = [
httpclient.InferInput("input_node", input_data.shape, "FP32")
]
inputs[0].set_data_from_numpy(input_data)
# 3. Request Prediction
results = client.infer(model_name="my_model", inputs=inputs)
# 4. Get Output
output_data = results.as_numpy("output_node")
print(f"Prediction: {output_data}")
Interview Tips š”
- "CPU vs GPU?" ā GPU for deep learning (Matrix Mult parallelization). CPU for decision trees (XGBoost) or very small models.
- "What is Cold Start?" ā Loading a 10GB model from disk to GPU memory takes 10-20 seconds. Severely impacts serverless latency.
- "Multi-Model Serving" ā Loading 10 different models on 1 GPU. If VRAM fills up, swapping to CPU RAM kills performance. Use LoRA adapters to share base weights.
Related Concepts
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
RAG Architecture (LLMs)
Retrieval-Augmented Generation. How to stop Large Language Models (LLMs) from hallucinating by grounding them in your private data.
Backpressure
Flow control mechanism that prevents fast producers from overwhelming slow consumers by signaling when to slow down, pause, or drop data in streaming systems.
Caching Overview
High-speed data storage to reduce latency. The single most effective way to scale read-heavy systems.