Back to All Concepts
ObservabilityDebuggingMicroservicesIntermediate

Distributed Tracing

Complete guide to tracking requests across microservices using distributed tracing, covering trace context propagation, span instrumentation, OpenTelemetry implementation, and production debugging with Jaeger, Zipkin, and Datadog APM.

The Problem: Where Did 5 Seconds Go?

User: "Login took 5 seconds!"
You check each service:
  Login Service:   10ms ✓
  User Service:     5ms ✓
  Auth Service:     8ms ✓
  Database:         2ms ✓
  Total: 25ms... but user experienced 5 seconds!

Where did the other 4.975 seconds go?
Click to expand code...

Answer: Network latency, queues, retry loops, cascading failures.

Solution: Distributed tracing shows the COMPLETE journey.


What is Distributed Tracing?

Distributed tracing tracks a single request as it flows through multiple services, capturing timing, errors, and metadata at each step.

Key Concepts

1. Trace: The complete journey of a request

Trace ID: abc123
Total duration: 5.2s
Services involved: 8
Status: Error
Click to expand code...

2. Span: A single operation within a trace

Span: "Query Users DB"
Duration: 245ms
Parent: "Get User Profile"
Tags: {db: "postgres", query: "SELECT *..."}
Click to expand code...

3. Trace Context: Propagation metadata

traceparent: 00-abc123-def456-01
  version: 00
  trace-id: abc123
  parent-span-id: def456
  flags: 01 (sampled)
Click to expand code...

How It Works

Architecture

mermaid
graph LR
    Client[Client] -->|TraceID: abc123| API[API Gateway]
    API -->|TraceID: abc123<br/>ParentSpan: span1| Auth[Auth Service]
    API -->|TraceID: abc123<br/>ParentSpan: span1| User[User Service]
    User -->|TraceID: abc123<br/>ParentSpan: span3| DB[(Database)]
    
    API -.->|Spans| Collector[Trace Collector]
    Auth -.->|Spans| Collector
    User -.->|Spans| Collector
    DB -.->|Spans| Collector
    
    Collector --> Jaeger[Jaeger UI]
Click to expand code...

Flow

1. Request arrives at API Gateway
   - Generate Trace ID: "abc123"
   - Create root span
   
2. API Gateway calls Auth Service
   - Pass trace ID in headers
   - Create child span
   
3. Each Service:
   - Extracts trace ID from headers
   - Creates its own span
   - Adds metadata (duration, tags, errors)
   - Sends span to collector
   
4. Trace Collector:
   - Assembles all spans by Trace ID
   - Builds complete trace tree
   
5. UI Visualization:
   - Shows waterfall diagram
   - Highlights slow/failing spans
Click to expand code...

##Implementation

OpenTelemetry (Industry Standard)

javascript
// Node.js with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
});

// Configure exporter (Jaeger)
const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces',
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Auto-instrument HTTP and Express
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

const tracer = provider.getTracer('user-service');

// Manual instrumentation
const express = require('express');
const app = express();

app.get('/user/:id', async (req, res) => {
  // Create custom span
  const span = tracer.startSpan('getUserProfile');
  span.setAttribute('user.id', req.params.id);
  
  try {
    // Simulate DB query
    const dbSpan = tracer.startSpan('queryDatabase', {
      parent: span,
      attributes: {
        'db.system': 'postgresql',
        'db.statement': 'SELECT * FROM users WHERE id = $1'
      }
    });
    
    const user = await db.query('SELECT * FROM users WHERE id = $1', [req.params.id]);
    dbSpan.end();
    
    // Simulate external API call
    const apiSpan = tracer.startSpan('callAuthService', {
      parent: span
    });
    
    const authResponse = await fetch(`http://auth-service/verify/${user.id}`, {
      headers: {
        // Propagate trace context
        'traceparent': `00-${span.spanContext().traceId}-${span.spanContext().spanId}-01`
      }
    });
    
    apiSpan.end();
    
    span.setStatus({ code: 0 });  // OK
    res.json(user);
    
  } catch (error) {
    span.setStatus({
      code: 2,  // Error
      message: error.message
    });
    span.recordException(error);
    res.status(500).json({ error: error.message });
    
  } finally {
    span.end();
  }
});
Click to expand code...

Python with OpenTelemetry

python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from flask import Flask, request
import requests

# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

app = Flask(__name__)

# Auto-instrument Flask and Requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

@app.route('/user/<user_id>')
def get_user(user_id):
    # Get current span (auto-created by Flask instrumentation)
    current_span = trace.get_current_span()
    current_span.set_attribute("user.id", user_id)
    
    # Create custom child span
    with tracer.start_as_current_span("fetch_user_from_db") as span:
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", f"SELECT * FROM users WHERE id = {user_id}")
        
        user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
    
    # Another custom span
    with tracer.start_as_current_span("verify_permissions") as span:
        span.set_attribute("auth.service", "auth-service")
        
        # Trace context auto-propagated in headers
        response = requests.get(f"http://auth-service/verify/{user_id}")
        
        if response.status_code != 200:
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.add_event("permission_denied")
    
    return {"user": user}
Click to expand code...

Trace Context Propagation

W3C Trace Context Standard

HTTP Header: traceparent
Format:
  00-<trace-id>-<parent-span-id>-<flags>
  
Example:
  traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
  
  version: 00
  trace-id: 0af7651916cd43dd8448eb211c80319c (128-bit)
  parent-span-id: b7ad6b7169203331 (64-bit)
  flags: 01 (sampled)
Click to expand code...

Manual Propagation

python
def propagate_trace_context(outgoing_request):
    """Manually propagate trace context"""
    current_span = trace.get_current_span()
    
    if current_span:
        trace_id = current_span.get_span_context().trace_id
        span_id = current_span.get_span_context().span_id
        flags = current_span.get_span_context().trace_flags
        
        traceparent = f"00-{trace_id:032x}-{span_id:016x}-{flags:02x}"
        
        outgoing_request.headers['traceparent'] = traceparent
    
    return outgoing_request
Click to expand code...

Sampling Strategies

Problem: Too many traces = storage/performance cost

Solution: Sample strategically

1. Probabilistic Sampling

python
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Sample 10% of traces
sampler = TraceIdRatioBased(0.1)

provider = TracerProvider(sampler=sampler)
Click to expand code...

2. Rate Limiting Sampling

python
from opentelemetry.sdk.trace.sampling import RateLimitingSampler

# Max 100 traces per second
sampler = RateLimitingSampler(100)
Click to expand code...

3. Parent-Based Sampling

python
from opentelemetry.sdk.trace.sampling import ParentBased

# If parent span is sampled, sample children too
sampler = ParentBased(root=TraceIdRatioBased(0.1))
Click to expand code...

4. Custom Sampling (Head-based)

python
class CustomSampler:
    def should_sample(self, context, trace_id, name, attributes):
        # Always sample errors
        if attributes.get('http.status_code', 0) >= 500:
            return Decision(sampled=True)
        
        # Always sample slow requests
        if attributes.get('duration_ms', 0) > 1000:
            return Decision(sampled=True)
        
        # Sample 1% of normal requests
        return Decision(sampled=(trace_id % 100 == 0))
Click to expand code...

5. Tail-based Sampling

Problem: Head-based sampling doesn't know if request will be slow/error
Solution: Sample AFTER request completes

Implementation:
1. Buffer all spans for T seconds
2. After T, decide which traces to keep
3. Send sampled traces to storage
4. Discard others

Trade-off: Adds latency, more memory
Click to expand code...

Visualization & Analysis

Jaeger UI Example

Trace: User Login Flow (5.2s total)
├─ API Gateway (5.2s)
│  ├─ Authenticate User (2.1s) ⚠️ SLOW
│  │  ├─ Query User DB (150ms)
│  │  └─ Verify Password (1.95s) ⚠️ BOTTLENECK
│  │     └─ bcrypt.compare (1.95s)  ← Found it!
│  ├─ Load User Profile (3ms)
│  └─ Generate JWT (2ms)

Found: bcrypt cost factor too high (14 rounds)
Fix: Reduce to 12 rounds
Result: Login now 500ms
Click to expand code...

Key Metrics from Traces

javascript
// Extract from traces
const metrics = {
  p50_latency: 120ms,
  p95_latency: 450ms,
  p99_latency: 1200ms,
  error_rate: 0.5%,
  
  // Service dependencies
  dependencies: {
    'api-gateway -> auth-service': 95% of requests,
    'api-gateway -> user-service': 100% of requests,
  },
  
  // Bottlenecks
  slowest_services: [
    { service: 'auth-service', avg: 1.2s },
    { service: 'payment-service', avg: 800ms },
  ]
};
Click to expand code...

Real-World Use Cases

1. Debugging Production Issues (Uber)

Problem: Some rides take 30s to match
Traditional debugging: Check logs of 50+ microservices❌
Distributed tracing:
  1. Search for traces with >30s duration
  2. Visualize waterfall
  3. Find: "Driver Availability Check" taking 28s
  4. Root cause: DB index missing
  5. Fix: Add index
  6. Result: Matching now <1s
Click to expand code...

2. Performance Optimization (Netflix)

Goal: Optimize video playback start time
Trace analysis shows:
  - 40% time: CDN redirect
  - 30% time: DRM license fetch
  - 20% time: Manifest parsing
  - 10% time: Video buffering

Actions:
1. Cache DRM licenses (save 30%)
2. Preload manifest (save 15%)
Result: 45% faster playback start
Click to expand code...

3. Chaos Engineering (Google)

Inject faults and trace impact:
  
1. Inject 500ms latency in Auth Service
2. Traces show cascading timeouts:
   - API Gateway timeout (1s)
   - User Service retry storm
   - Database connection pool exhaustion
   
3. Fix: Add circuit breakers
4. Verify: Traces show graceful degradation
Click to expand code...

Tools Comparison

ToolTypeBest ForCost
JaegerOpen-sourceSelf-hostedFree
ZipkinOpen-sourceSimple setupFree
Datadog APMCommercialFull-stack observability$$$
New RelicCommercialEnterprise$$$
LightstepCommercialLarge scale$$$$
AWS X-RayCloudAWS workloads$-$$
Google Cloud TraceCloudGCP workloads$-$$

Best Practices

DO:

  • Use semantic conventions (OpenTelemetry standards)
  • Add business context (user_id, order_id)
  • Sample intelligently (errors, slow requests)
  • Instrument async operations
  • Propagate context across boundaries

DON'T:

  • Log entire request/response bodies (PII risk)
  • Sample 100% in production (cost!)
  • Create too many custom spans (noise)
  • Forget to end spans (memory leaks)

Interview Tips 💡

When discussing distributed tracing in interviews:

  1. Problem: "In microservices, hard to track request across 50 services..."
  2. Trace ID propagation: "Pass unique ID in headers, each service adds its span..."
  3. Sampling: "Can't trace everything - sample 1% or only errors/slow requests..."
  4. Tools: "OpenTelemetry for instrumentation, Jaeger for visualization..."
  5. Use case: "Found slow DB query causing 5s login - traced through 8 services..."
  6. Trade-offs: "Adds latency (~5-10ms), storage costs, implementation effort..."

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles