Back to All Concepts
AIMLOpsInferencePerformanceAdvanced

ML Model Serving

Taking models from Jupyter Notebooks to Production. Inference patterns (Real-time vs Batch), Batching strategies, and optimization techniques (Quantization, KV Caching).

The Serving Challenge

Training a model is just the beginning. Serving it to millions of users is an infrastructure challenge.

Key Metrics:

  • Throughput: Requests per second (RPS).
  • Latency: Time to first token (TTFT) or total generation time.
  • Cost: GPU hours are expensive (2āˆ’2-4/hr for A100).

1. Serving Patterns

Real-Time (Online) Inference

  • SLA: Low latency (< 200ms).
  • Architecture: REST/gRPC API.
  • Example: ChatGPT, Fraud Detection during checkout.
  • Challenges: Auto-scaling GPUs based on traffic.
mermaid
graph LR
    User -->|HTTP Request| API[FastAPI / Triton]
    API -->|Tensor| GPU[GPU Worker]
    GPU -->|Prediction| API
    API -->|Response| User
Click to expand code...

Batch Inference

  • SLA: High latency (Hours/Days).
  • Architecture: Cron Job / Airflow DAG.
  • Example: Nightly product recommendation generation.
  • Pros: High throughput, efficient GPU usage (packed batches).

Streaming Inference

  • SLA: Instant Feedback.
  • Architecture: Server-Sent Events (SSE) or WebSocket.
  • Example: LLM token generation (typing effect).

2. Optimization Techniques

How to make models faster and cheaper?

Dynamic Batching

Process multiple requests in parallel on the GPU.

  • Without Batching: 1 req = 10ms compute + 10ms memory overhead. Total = 20ms.
  • With Batching (Size 8): 8 reqs = 15ms compute + 10ms memory. Total = 25ms. Per req = 3ms!

Tools: NVIDIA Triton Inference Server, TorchServe, vLLM.

Quantization (Shrinking Models)

FP32 (32-bit float) is standard but huge.

  • FP16: Half precision. 2x speedup, 50% RAM. Minimal accuracy loss. Most common.
  • INT8: 8-bit integer. 4x speedup, 25% RAM. Needs calibration.
  • GPTQ / AWQ: 4-bit quantization for LLMs. Runs Llama-3-70B on 2x A100s instead of 4x.

KV Caching (For LLMs)

Transformers recompute the entire sequence for every token. KV Cache saves the Key/Value matrices of past tokens in GPU VRAM so we only compute the new token.

  • Tradeoff: Increases VRAM usage significantly.

3. Deployment Infrastructure

Model Registry

Version control for weights (model.pt).

  • Tools: MLflow, HuggingFace Hub, AWS S3.
  • Blue/Green Deployment: Test model v2 on 1% traffic before full rollout.

Serverless Inference

  • AWS Lambda: Good for CPU models (XGBoost, small BERT). Bad for big GPUs (Cold start > 10s).
  • Specialized: Modal, RunPod, Replicate. Spin up containers only when requests arrive.

Code Example: Triton Client

NVIDIA Triton handles batching and scheduling automatically.

python
import tritonclient.http as httpclient
import numpy as np

# 1. Connect to Inference Server
client = httpclient.InferenceServerClient(url="localhost:8000")

# 2. Prepare Input (Batch Size 1)
input_data = np.array([[0.5, 0.2, 0.1]], dtype=np.float32)
inputs = [
    httpclient.InferInput("input_node", input_data.shape, "FP32")
]
inputs[0].set_data_from_numpy(input_data)

# 3. Request Prediction
results = client.infer(model_name="my_model", inputs=inputs)

# 4. Get Output
output_data = results.as_numpy("output_node")
print(f"Prediction: {output_data}")
Click to expand code...

Interview Tips šŸ’”

  • "CPU vs GPU?" — GPU for deep learning (Matrix Mult parallelization). CPU for decision trees (XGBoost) or very small models.
  • "What is Cold Start?" — Loading a 10GB model from disk to GPU memory takes 10-20 seconds. Severely impacts serverless latency.
  • "Multi-Model Serving" — Loading 10 different models on 1 GPU. If VRAM fills up, swapping to CPU RAM kills performance. Use LoRA adapters to share base weights.

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles