The Problem: Where Did 5 Seconds Go?

User: "Login took 5 seconds!"
You check each service:
  Login Service:   10ms ✓
  User Service:     5ms ✓
  Auth Service:     8ms ✓
  Database:         2ms ✓
  Total: 25ms... but user experienced 5 seconds!

Where did the other 4.975 seconds go?

Click to expand code...

Answer: Network latency, queues, retry loops, cascading failures.

Solution: Distributed tracing shows the COMPLETE journey.

What is Distributed Tracing?

Distributed tracing tracks a single request as it flows through multiple services, capturing timing, errors, and metadata at each step.

Key Concepts

1. Trace: The complete journey of a request

Trace ID: abc123
Total duration: 5.2s
Services involved: 8
Status: Error

Click to expand code...

2. Span: A single operation within a trace

Span: "Query Users DB"
Duration: 245ms
Parent: "Get User Profile"
Tags: {db: "postgres", query: "SELECT *..."}

Click to expand code...

3. Trace Context: Propagation metadata

traceparent: 00-abc123-def456-01
  version: 00
  trace-id: abc123
  parent-span-id: def456
  flags: 01 (sampled)

Click to expand code...

How It Works

Architecture

mermaid

graph LR
    Client[Client] -->|TraceID: abc123| API[API Gateway]
    API -->|TraceID: abc123<br/>ParentSpan: span1| Auth[Auth Service]
    API -->|TraceID: abc123<br/>ParentSpan: span1| User[User Service]
    User -->|TraceID: abc123<br/>ParentSpan: span3| DB[(Database)]
    
    API -.->|Spans| Collector[Trace Collector]
    Auth -.->|Spans| Collector
    User -.->|Spans| Collector
    DB -.->|Spans| Collector
    
    Collector --> Jaeger[Jaeger UI]

Click to expand code...

Flow

1. Request arrives at API Gateway
   - Generate Trace ID: "abc123"
   - Create root span
   
2. API Gateway calls Auth Service
   - Pass trace ID in headers
   - Create child span
   
3. Each Service:
   - Extracts trace ID from headers
   - Creates its own span
   - Adds metadata (duration, tags, errors)
   - Sends span to collector
   
4. Trace Collector:
   - Assembles all spans by Trace ID
   - Builds complete trace tree
   
5. UI Visualization:
   - Shows waterfall diagram
   - Highlights slow/failing spans

Click to expand code...

##Implementation

OpenTelemetry (Industry Standard)

javascript

// Node.js with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
});

// Configure exporter (Jaeger)
const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces',
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Auto-instrument HTTP and Express
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

const tracer = provider.getTracer('user-service');

// Manual instrumentation
const express = require('express');
const app = express();

app.get('/user/:id', async (req, res) => {
  // Create custom span
  const span = tracer.startSpan('getUserProfile');
  span.setAttribute('user.id', req.params.id);
  
  try {
    // Simulate DB query
    const dbSpan = tracer.startSpan('queryDatabase', {
      parent: span,
      attributes: {
        'db.system': 'postgresql',
        'db.statement': 'SELECT * FROM users WHERE id = $1'
      }
    });
    
    const user = await db.query('SELECT * FROM users WHERE id = $1', [req.params.id]);
    dbSpan.end();
    
    // Simulate external API call
    const apiSpan = tracer.startSpan('callAuthService', {
      parent: span
    });
    
    const authResponse = await fetch(`http://auth-service/verify/${user.id}`, {
      headers: {
        // Propagate trace context
        'traceparent': `00-${span.spanContext().traceId}-${span.spanContext().spanId}-01`
      }
    });
    
    apiSpan.end();
    
    span.setStatus({ code: 0 });  // OK
    res.json(user);
    
  } catch (error) {
    span.setStatus({
      code: 2,  // Error
      message: error.message
    });
    span.recordException(error);
    res.status(500).json({ error: error.message });
    
  } finally {
    span.end();
  }
});

Click to expand code...

Python with OpenTelemetry

python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from flask import Flask, request
import requests

# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

app = Flask(__name__)

# Auto-instrument Flask and Requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

@app.route('/user/<user_id>')
def get_user(user_id):
    # Get current span (auto-created by Flask instrumentation)
    current_span = trace.get_current_span()
    current_span.set_attribute("user.id", user_id)
    
    # Create custom child span
    with tracer.start_as_current_span("fetch_user_from_db") as span:
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", f"SELECT * FROM users WHERE id = {user_id}")
        
        user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
    
    # Another custom span
    with tracer.start_as_current_span("verify_permissions") as span:
        span.set_attribute("auth.service", "auth-service")
        
        # Trace context auto-propagated in headers
        response = requests.get(f"http://auth-service/verify/{user_id}")
        
        if response.status_code != 200:
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.add_event("permission_denied")
    
    return {"user": user}

Click to expand code...

Trace Context Propagation

W3C Trace Context Standard

HTTP Header: traceparent
Format:
  00-<trace-id>-<parent-span-id>-<flags>
  
Example:
  traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
  
  version: 00
  trace-id: 0af7651916cd43dd8448eb211c80319c (128-bit)
  parent-span-id: b7ad6b7169203331 (64-bit)
  flags: 01 (sampled)

Click to expand code...

Manual Propagation

python

def propagate_trace_context(outgoing_request):
    """Manually propagate trace context"""
    current_span = trace.get_current_span()
    
    if current_span:
        trace_id = current_span.get_span_context().trace_id
        span_id = current_span.get_span_context().span_id
        flags = current_span.get_span_context().trace_flags
        
        traceparent = f"00-{trace_id:032x}-{span_id:016x}-{flags:02x}"
        
        outgoing_request.headers['traceparent'] = traceparent
    
    return outgoing_request

Click to expand code...

Sampling Strategies

Problem: Too many traces = storage/performance cost

Solution: Sample strategically

1. Probabilistic Sampling

python

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Sample 10% of traces
sampler = TraceIdRatioBased(0.1)

provider = TracerProvider(sampler=sampler)

Click to expand code...

2. Rate Limiting Sampling

python

from opentelemetry.sdk.trace.sampling import RateLimitingSampler

# Max 100 traces per second
sampler = RateLimitingSampler(100)

Click to expand code...

3. Parent-Based Sampling

python

from opentelemetry.sdk.trace.sampling import ParentBased

# If parent span is sampled, sample children too
sampler = ParentBased(root=TraceIdRatioBased(0.1))

Click to expand code...

4. Custom Sampling (Head-based)

python

class CustomSampler:
    def should_sample(self, context, trace_id, name, attributes):
        # Always sample errors
        if attributes.get('http.status_code', 0) >= 500:
            return Decision(sampled=True)
        
        # Always sample slow requests
        if attributes.get('duration_ms', 0) > 1000:
            return Decision(sampled=True)
        
        # Sample 1% of normal requests
        return Decision(sampled=(trace_id % 100 == 0))

Click to expand code...

5. Tail-based Sampling

Problem: Head-based sampling doesn't know if request will be slow/error
Solution: Sample AFTER request completes

Implementation:
1. Buffer all spans for T seconds
2. After T, decide which traces to keep
3. Send sampled traces to storage
4. Discard others

Trade-off: Adds latency, more memory

Click to expand code...

Visualization & Analysis

Jaeger UI Example

Trace: User Login Flow (5.2s total)
├─ API Gateway (5.2s)
│  ├─ Authenticate User (2.1s) ⚠️ SLOW
│  │  ├─ Query User DB (150ms)
│  │  └─ Verify Password (1.95s) ⚠️ BOTTLENECK
│  │     └─ bcrypt.compare (1.95s)  ← Found it!
│  ├─ Load User Profile (3ms)
│  └─ Generate JWT (2ms)

Found: bcrypt cost factor too high (14 rounds)
Fix: Reduce to 12 rounds
Result: Login now 500ms

Click to expand code...

Key Metrics from Traces

javascript

// Extract from traces
const metrics = {
  p50_latency: 120ms,
  p95_latency: 450ms,
  p99_latency: 1200ms,
  error_rate: 0.5%,
  
  // Service dependencies
  dependencies: {
    'api-gateway -> auth-service': 95% of requests,
    'api-gateway -> user-service': 100% of requests,
  },
  
  // Bottlenecks
  slowest_services: [
    { service: 'auth-service', avg: 1.2s },
    { service: 'payment-service', avg: 800ms },
  ]
};

Click to expand code...

Real-World Use Cases

1. Debugging Production Issues (Uber)

Problem: Some rides take 30s to match
Traditional debugging: Check logs of 50+ microservices❌
Distributed tracing:
  1. Search for traces with >30s duration
  2. Visualize waterfall
  3. Find: "Driver Availability Check" taking 28s
  4. Root cause: DB index missing
  5. Fix: Add index
  6. Result: Matching now <1s

Click to expand code...

2. Performance Optimization (Netflix)

Goal: Optimize video playback start time
Trace analysis shows:
  - 40% time: CDN redirect
  - 30% time: DRM license fetch
  - 20% time: Manifest parsing
  - 10% time: Video buffering

Actions:
1. Cache DRM licenses (save 30%)
2. Preload manifest (save 15%)
Result: 45% faster playback start

Click to expand code...

3. Chaos Engineering (Google)

Inject faults and trace impact:
  
1. Inject 500ms latency in Auth Service
2. Traces show cascading timeouts:
   - API Gateway timeout (1s)
   - User Service retry storm
   - Database connection pool exhaustion
   
3. Fix: Add circuit breakers
4. Verify: Traces show graceful degradation

Click to expand code...

Tools Comparison

Tool	Type	Best For	Cost
Jaeger	Open-source	Self-hosted	Free
Zipkin	Open-source	Simple setup	Free
Datadog APM	Commercial	Full-stack observability	$$$
New Relic	Commercial	Enterprise	$$$
Lightstep	Commercial	Large scale	$$$$
AWS X-Ray	Cloud	AWS workloads	$-$$
Google Cloud Trace	Cloud	GCP workloads	$-$$

Best Practices

✅ DO:

Use semantic conventions (OpenTelemetry standards)
Add business context (user_id, order_id)
Sample intelligently (errors, slow requests)
Instrument async operations
Propagate context across boundaries

❌ DON'T:

Log entire request/response bodies (PII risk)
Sample 100% in production (cost!)
Create too many custom spans (noise)
Forget to end spans (memory leaks)

Interview Tips 💡

When discussing distributed tracing in interviews:

Problem: "In microservices, hard to track request across 50 services..."
Trace ID propagation: "Pass unique ID in headers, each service adds its span..."
Sampling: "Can't trace everything - sample 1% or only errors/slow requests..."
Tools: "OpenTelemetry for instrumentation, Jaeger for visualization..."
Use case: "Found slow DB query causing 5s login - traced through 8 services..."
Trade-offs: "Adds latency (~5-10ms), storage costs, implementation effort..."

Related Concepts

Logging — Complementary observability
Metrics — Time-series monitoring
APM (Application Performance Monitoring) — Broader observability
Service Mesh — Automatic tracing
Microservices — Architecture pattern

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Intermediate

The Observability Stack

Moving beyond simple monitoring. How to build a full observability stack using the Three Pillars: Logs, Metrics, and Distributed Tracing.

DevOpsMonitoringObservability

Intermediate

Blue-Green Deployment

Zero-downtime deployment strategy using two identical production environments (Blue and Green) to enable instant rollbacks, reduce risk, and allow thorough testing before directing traffic.

DevOpsDeploymentInfrastructure

Intermediate

CI/CD Pipeline Architecture

Designing robust Continuous Integration and Continuous Deployment pipelines. Strategies for artifact promotion, testing pyramids, canary deployments, and rollback mechanisms.

DevOpsCICDAutomation

The Problem: Where Did 5 Seconds Go?

What is Distributed Tracing?

Key Concepts

How It Works

Architecture

Flow

OpenTelemetry (Industry Standard)

Python with OpenTelemetry

Trace Context Propagation

W3C Trace Context Standard

Manual Propagation

Sampling Strategies

1. Probabilistic Sampling

2. Rate Limiting Sampling

3. Parent-Based Sampling

4. Custom Sampling (Head-based)

5. Tail-based Sampling

Visualization & Analysis

Jaeger UI Example

Key Metrics from Traces

Real-World Use Cases

1. Debugging Production Issues (Uber)

2. Performance Optimization (Netflix)

3. Chaos Engineering (Google)

Tools Comparison

Best Practices

Interview Tips 💡

Related Concepts

About ScaleWiki

Related Articles

The Observability Stack

Blue-Green Deployment

CI/CD Pipeline Architecture