The Problem: Where Did 5 Seconds Go?
User: "Login took 5 seconds!" You check each service: Login Service: 10ms ✓ User Service: 5ms ✓ Auth Service: 8ms ✓ Database: 2ms ✓ Total: 25ms... but user experienced 5 seconds! Where did the other 4.975 seconds go?
Answer: Network latency, queues, retry loops, cascading failures.
Solution: Distributed tracing shows the COMPLETE journey.
What is Distributed Tracing?
Distributed tracing tracks a single request as it flows through multiple services, capturing timing, errors, and metadata at each step.
Key Concepts
1. Trace: The complete journey of a request
Trace ID: abc123 Total duration: 5.2s Services involved: 8 Status: Error
2. Span: A single operation within a trace
Span: "Query Users DB"
Duration: 245ms
Parent: "Get User Profile"
Tags: {db: "postgres", query: "SELECT *..."}
3. Trace Context: Propagation metadata
traceparent: 00-abc123-def456-01 version: 00 trace-id: abc123 parent-span-id: def456 flags: 01 (sampled)
How It Works
Architecture
graph LR
Client[Client] -->|TraceID: abc123| API[API Gateway]
API -->|TraceID: abc123<br/>ParentSpan: span1| Auth[Auth Service]
API -->|TraceID: abc123<br/>ParentSpan: span1| User[User Service]
User -->|TraceID: abc123<br/>ParentSpan: span3| DB[(Database)]
API -.->|Spans| Collector[Trace Collector]
Auth -.->|Spans| Collector
User -.->|Spans| Collector
DB -.->|Spans| Collector
Collector --> Jaeger[Jaeger UI]
Flow
1. Request arrives at API Gateway - Generate Trace ID: "abc123" - Create root span 2. API Gateway calls Auth Service - Pass trace ID in headers - Create child span 3. Each Service: - Extracts trace ID from headers - Creates its own span - Adds metadata (duration, tags, errors) - Sends span to collector 4. Trace Collector: - Assembles all spans by Trace ID - Builds complete trace tree 5. UI Visualization: - Shows waterfall diagram - Highlights slow/failing spans
##Implementation
OpenTelemetry (Industry Standard)
// Node.js with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
// Initialize tracer
const provider = new NodeTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
});
// Configure exporter (Jaeger)
const exporter = new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces',
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Auto-instrument HTTP and Express
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
],
});
const tracer = provider.getTracer('user-service');
// Manual instrumentation
const express = require('express');
const app = express();
app.get('/user/:id', async (req, res) => {
// Create custom span
const span = tracer.startSpan('getUserProfile');
span.setAttribute('user.id', req.params.id);
try {
// Simulate DB query
const dbSpan = tracer.startSpan('queryDatabase', {
parent: span,
attributes: {
'db.system': 'postgresql',
'db.statement': 'SELECT * FROM users WHERE id = $1'
}
});
const user = await db.query('SELECT * FROM users WHERE id = $1', [req.params.id]);
dbSpan.end();
// Simulate external API call
const apiSpan = tracer.startSpan('callAuthService', {
parent: span
});
const authResponse = await fetch(`http://auth-service/verify/${user.id}`, {
headers: {
// Propagate trace context
'traceparent': `00-${span.spanContext().traceId}-${span.spanContext().spanId}-01`
}
});
apiSpan.end();
span.setStatus({ code: 0 }); // OK
res.json(user);
} catch (error) {
span.setStatus({
code: 2, // Error
message: error.message
});
span.recordException(error);
res.status(500).json({ error: error.message });
} finally {
span.end();
}
});
Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from flask import Flask, request
import requests
# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
app = Flask(__name__)
# Auto-instrument Flask and Requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
@app.route('/user/<user_id>')
def get_user(user_id):
# Get current span (auto-created by Flask instrumentation)
current_span = trace.get_current_span()
current_span.set_attribute("user.id", user_id)
# Create custom child span
with tracer.start_as_current_span("fetch_user_from_db") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", f"SELECT * FROM users WHERE id = {user_id}")
user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
# Another custom span
with tracer.start_as_current_span("verify_permissions") as span:
span.set_attribute("auth.service", "auth-service")
# Trace context auto-propagated in headers
response = requests.get(f"http://auth-service/verify/{user_id}")
if response.status_code != 200:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.add_event("permission_denied")
return {"user": user}
Trace Context Propagation
W3C Trace Context Standard
HTTP Header: traceparent Format: 00-<trace-id>-<parent-span-id>-<flags> Example: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 version: 00 trace-id: 0af7651916cd43dd8448eb211c80319c (128-bit) parent-span-id: b7ad6b7169203331 (64-bit) flags: 01 (sampled)
Manual Propagation
def propagate_trace_context(outgoing_request):
"""Manually propagate trace context"""
current_span = trace.get_current_span()
if current_span:
trace_id = current_span.get_span_context().trace_id
span_id = current_span.get_span_context().span_id
flags = current_span.get_span_context().trace_flags
traceparent = f"00-{trace_id:032x}-{span_id:016x}-{flags:02x}"
outgoing_request.headers['traceparent'] = traceparent
return outgoing_request
Sampling Strategies
Problem: Too many traces = storage/performance cost
Solution: Sample strategically
1. Probabilistic Sampling
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased # Sample 10% of traces sampler = TraceIdRatioBased(0.1) provider = TracerProvider(sampler=sampler)
2. Rate Limiting Sampling
from opentelemetry.sdk.trace.sampling import RateLimitingSampler # Max 100 traces per second sampler = RateLimitingSampler(100)
3. Parent-Based Sampling
from opentelemetry.sdk.trace.sampling import ParentBased # If parent span is sampled, sample children too sampler = ParentBased(root=TraceIdRatioBased(0.1))
4. Custom Sampling (Head-based)
class CustomSampler:
def should_sample(self, context, trace_id, name, attributes):
# Always sample errors
if attributes.get('http.status_code', 0) >= 500:
return Decision(sampled=True)
# Always sample slow requests
if attributes.get('duration_ms', 0) > 1000:
return Decision(sampled=True)
# Sample 1% of normal requests
return Decision(sampled=(trace_id % 100 == 0))
5. Tail-based Sampling
Problem: Head-based sampling doesn't know if request will be slow/error Solution: Sample AFTER request completes Implementation: 1. Buffer all spans for T seconds 2. After T, decide which traces to keep 3. Send sampled traces to storage 4. Discard others Trade-off: Adds latency, more memory
Visualization & Analysis
Jaeger UI Example
Trace: User Login Flow (5.2s total) ├─ API Gateway (5.2s) │ ├─ Authenticate User (2.1s) ⚠️ SLOW │ │ ├─ Query User DB (150ms) │ │ └─ Verify Password (1.95s) ⚠️ BOTTLENECK │ │ └─ bcrypt.compare (1.95s) ← Found it! │ ├─ Load User Profile (3ms) │ └─ Generate JWT (2ms) Found: bcrypt cost factor too high (14 rounds) Fix: Reduce to 12 rounds Result: Login now 500ms
Key Metrics from Traces
// Extract from traces
const metrics = {
p50_latency: 120ms,
p95_latency: 450ms,
p99_latency: 1200ms,
error_rate: 0.5%,
// Service dependencies
dependencies: {
'api-gateway -> auth-service': 95% of requests,
'api-gateway -> user-service': 100% of requests,
},
// Bottlenecks
slowest_services: [
{ service: 'auth-service', avg: 1.2s },
{ service: 'payment-service', avg: 800ms },
]
};
Real-World Use Cases
1. Debugging Production Issues (Uber)
Problem: Some rides take 30s to match Traditional debugging: Check logs of 50+ microservices❌ Distributed tracing: 1. Search for traces with >30s duration 2. Visualize waterfall 3. Find: "Driver Availability Check" taking 28s 4. Root cause: DB index missing 5. Fix: Add index 6. Result: Matching now <1s
2. Performance Optimization (Netflix)
Goal: Optimize video playback start time Trace analysis shows: - 40% time: CDN redirect - 30% time: DRM license fetch - 20% time: Manifest parsing - 10% time: Video buffering Actions: 1. Cache DRM licenses (save 30%) 2. Preload manifest (save 15%) Result: 45% faster playback start
3. Chaos Engineering (Google)
Inject faults and trace impact: 1. Inject 500ms latency in Auth Service 2. Traces show cascading timeouts: - API Gateway timeout (1s) - User Service retry storm - Database connection pool exhaustion 3. Fix: Add circuit breakers 4. Verify: Traces show graceful degradation
Tools Comparison
| Tool | Type | Best For | Cost |
|---|---|---|---|
| Jaeger | Open-source | Self-hosted | Free |
| Zipkin | Open-source | Simple setup | Free |
| Datadog APM | Commercial | Full-stack observability | $$$ |
| New Relic | Commercial | Enterprise | $$$ |
| Lightstep | Commercial | Large scale | $$$$ |
| AWS X-Ray | Cloud | AWS workloads | $-$$ |
| Google Cloud Trace | Cloud | GCP workloads | $-$$ |
Best Practices
✅ DO:
- Use semantic conventions (OpenTelemetry standards)
- Add business context (
user_id,order_id) - Sample intelligently (errors, slow requests)
- Instrument async operations
- Propagate context across boundaries
❌ DON'T:
- Log entire request/response bodies (PII risk)
- Sample 100% in production (cost!)
- Create too many custom spans (noise)
- Forget to end spans (memory leaks)
Interview Tips 💡
When discussing distributed tracing in interviews:
- Problem: "In microservices, hard to track request across 50 services..."
- Trace ID propagation: "Pass unique ID in headers, each service adds its span..."
- Sampling: "Can't trace everything - sample 1% or only errors/slow requests..."
- Tools: "OpenTelemetry for instrumentation, Jaeger for visualization..."
- Use case: "Found slow DB query causing 5s login - traced through 8 services..."
- Trade-offs: "Adds latency (~5-10ms), storage costs, implementation effort..."
Related Concepts
- Logging — Complementary observability
- Metrics — Time-series monitoring
- APM (Application Performance Monitoring) — Broader observability
- Service Mesh — Automatic tracing
- Microservices — Architecture pattern
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
The Observability Stack
Moving beyond simple monitoring. How to build a full observability stack using the Three Pillars: Logs, Metrics, and Distributed Tracing.
Blue-Green Deployment
Zero-downtime deployment strategy using two identical production environments (Blue and Green) to enable instant rollbacks, reduce risk, and allow thorough testing before directing traffic.
CI/CD Pipeline Architecture
Designing robust Continuous Integration and Continuous Deployment pipelines. Strategies for artifact promotion, testing pyramids, canary deployments, and rollback mechanisms.