Monitoring vs Observability

Monitoring: Tells you when something is wrong. ("CPU usage is 99%")
Observability: Tells you why it is wrong. ("Service A is retrying due to DB lock contention")

To achieve observability, we need three distinct types of data.

mermaid

graph TD
    Obs[Observability] --> Logs[Logs: Events]
    Obs --> Metrics[Metrics: Aggregates]
    Obs --> Traces[Traces: Request Flow]
    
    Logs --> ELK[Elastic/Loki]
    Metrics --> Prom[Prometheus/Grafana]
    Traces --> Jaeger[Jaeger/Zipkin]

Click to expand code...

1. Logs (The "What")

Discrete events. "Something happened at time T".

Structure:

Unstructured: 2023-10-01 Error: DB failed (Hard to query).
Structured (JSON): {"retry": 3, "service": "payment", "error": "timeout"} (Easy to aggregate).

Centralized Logging Architecture

Don't SSH into servers to tail -f. Ship logs to a central backend.

Application: Writes to stdout/stderr.
Collector (Fluentd/Vector): Reads streams, parses JSON, enriches (adds Kubernetes pod name).
Storage (Elasticsearch/ClickHouse): Indexes fields.
UI (Kibana/Grafana): Search "error" AND service="payment".

2. Metrics (The "Health")

Aggregated numerical data. Cheap to store. Good for alerts.

Key Types:

Counter: Always goes up (Total Requests, Errors). rate() gives requests/sec.
Gauge: Goes up and down (Memory Usage, Queue Size).
Histogram: Distribution of values (Request Latency: p50, p90, p99).

[!TIP] Cardinality Explosion: Avoid putting unique IDs (UserID, IP) in metric labels. It creates millions of time series and kills Prometheus.

Scraping vs Pushing

Prometheus (Pull): Scrapes /metrics endpoint every 15s. Service doesn't need to know about monitoring server.
StatsD (Push): App sends UDP packets to collector. Good for short-lived jobs (Lambdas).

3. Distributed Tracing (The "Where")

Follows a request across microservices.

Structure:

Trace Context: A unique TraceID passed in HTTP Headers (W3C Trace Context).
Span: A unit of work (DB Query, HTTP Call). Has SpanID, ParentID, Start/End time.

Visualization: Waterfalls showing gaps (latency) and errors.

mermaid

gantt
    title Trace ID: a1b2c3d4
    dateFormat  s
    axisFormat %S
    
    section Frontend
    GET /checkout       :a1, 0, 500ms
    
    section Payment Service
    Authorize Card      :a2, 100ms, 400ms
    after a1
    
    section Database
    UPDATE Balance      :a3, 200ms, 150ms
    after a2

Click to expand code...

Integration: Code Example (OpenTelemetry)

OpenTelemetry (OTel) is the standard for generating all three.

python

from opentelemetry import trace, metrics

# 1. Tracing Setup
tracer = trace.get_tracer(__name__)

# 2. Metrics Setup
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("requests_total")

def handle_request():
    # Start a span
    with tracer.start_as_current_span("checkout_flow") as span:
        span.set_attribute("user.id", "123")
        
        # Increment metric
        request_counter.add(1)
        
        try:
            process_payment()
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            # Log automatically correlated with TraceID
            logger.error("Payment failed", exc_info=e)

def process_payment():
    with tracer.start_as_current_span("db_update"):
        # Database logic...
        pass

Click to expand code...

The "Golden Signals" (SRE)

What should you alert on? (Google SRE Book)

Latency: Time to service a request.
Traffic: Demand on system (Req/sec).
Errors: Rate of failed requests (HTTP 500s).
Saturation: How "full" is the service? (CPU/Memory/Queue depth).

Interview Tips 💡

"How do you debug a slow request in microservices?" — Distributed Tracing. Look for the "long bar" in the waterfall.
"Push vs Pull Metrics?" — Prometheus (Pull) is standard for infrastructure. Push (Datadog) better for serverless.
"Log Levels" — Explain DEBUG vs INFO vs WARN vs ERROR. Don't log DEBUG in prod (cost).
"Sampling" — You can't trace 100% of requests (too expensive). Tail-based sampling keeps only the interesting (slow/error) traces.

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Intermediate

Blue-Green Deployment

Zero-downtime deployment strategy using two identical production environments (Blue and Green) to enable instant rollbacks, reduce risk, and allow thorough testing before directing traffic.

DevOpsDeploymentInfrastructure

Intermediate

CI/CD Pipeline Architecture

Designing robust Continuous Integration and Continuous Deployment pipelines. Strategies for artifact promotion, testing pyramids, canary deployments, and rollback mechanisms.

DevOpsCICDAutomation

Intermediate

Distributed Tracing

Complete guide to tracking requests across microservices using distributed tracing, covering trace context propagation, span instrumentation, OpenTelemetry implementation, and production debugging with Jaeger, Zipkin, and Datadog APM.

ObservabilityDebuggingMicroservices

The Observability Stack

Monitoring vs Observability

1. Logs (The "What")

Centralized Logging Architecture

2. Metrics (The "Health")

Scraping vs Pushing

3. Distributed Tracing (The "Where")

Integration: Code Example (OpenTelemetry)

The "Golden Signals" (SRE)

Interview Tips 💡

Related Concepts

About ScaleWiki

Related Articles

Blue-Green Deployment

CI/CD Pipeline Architecture

Distributed Tracing