Monitoring vs Observability
- Monitoring: Tells you when something is wrong. ("CPU usage is 99%")
- Observability: Tells you why it is wrong. ("Service A is retrying due to DB lock contention")
To achieve observability, we need three distinct types of data.
graph TD
Obs[Observability] --> Logs[Logs: Events]
Obs --> Metrics[Metrics: Aggregates]
Obs --> Traces[Traces: Request Flow]
Logs --> ELK[Elastic/Loki]
Metrics --> Prom[Prometheus/Grafana]
Traces --> Jaeger[Jaeger/Zipkin]
1. Logs (The "What")
Discrete events. "Something happened at time T".
Structure:
- Unstructured:
2023-10-01 Error: DB failed(Hard to query). - Structured (JSON):
{"retry": 3, "service": "payment", "error": "timeout"}(Easy to aggregate).
Centralized Logging Architecture
Don't SSH into servers to tail -f. Ship logs to a central backend.
- Application: Writes to
stdout/stderr. - Collector (Fluentd/Vector): Reads streams, parses JSON, enriches (adds Kubernetes pod name).
- Storage (Elasticsearch/ClickHouse): Indexes fields.
- UI (Kibana/Grafana): Search "error" AND service="payment".
2. Metrics (The "Health")
Aggregated numerical data. Cheap to store. Good for alerts.
Key Types:
- Counter: Always goes up (Total Requests, Errors).
rate()gives requests/sec. - Gauge: Goes up and down (Memory Usage, Queue Size).
- Histogram: Distribution of values (Request Latency: p50, p90, p99).
[!TIP] Cardinality Explosion: Avoid putting unique IDs (UserID, IP) in metric labels. It creates millions of time series and kills Prometheus.
Scraping vs Pushing
- Prometheus (Pull): Scrapes
/metricsendpoint every 15s. Service doesn't need to know about monitoring server. - StatsD (Push): App sends UDP packets to collector. Good for short-lived jobs (Lambdas).
3. Distributed Tracing (The "Where")
Follows a request across microservices.
Structure:
- Trace Context: A unique
TraceIDpassed in HTTP Headers (W3C Trace Context). - Span: A unit of work (DB Query, HTTP Call). Has
SpanID,ParentID, Start/End time.
Visualization: Waterfalls showing gaps (latency) and errors.
gantt
title Trace ID: a1b2c3d4
dateFormat s
axisFormat %S
section Frontend
GET /checkout :a1, 0, 500ms
section Payment Service
Authorize Card :a2, 100ms, 400ms
after a1
section Database
UPDATE Balance :a3, 200ms, 150ms
after a2
Integration: Code Example (OpenTelemetry)
OpenTelemetry (OTel) is the standard for generating all three.
from opentelemetry import trace, metrics
# 1. Tracing Setup
tracer = trace.get_tracer(__name__)
# 2. Metrics Setup
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("requests_total")
def handle_request():
# Start a span
with tracer.start_as_current_span("checkout_flow") as span:
span.set_attribute("user.id", "123")
# Increment metric
request_counter.add(1)
try:
process_payment()
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR))
# Log automatically correlated with TraceID
logger.error("Payment failed", exc_info=e)
def process_payment():
with tracer.start_as_current_span("db_update"):
# Database logic...
pass
The "Golden Signals" (SRE)
What should you alert on? (Google SRE Book)
- Latency: Time to service a request.
- Traffic: Demand on system (Req/sec).
- Errors: Rate of failed requests (HTTP 500s).
- Saturation: How "full" is the service? (CPU/Memory/Queue depth).
Interview Tips š”
- "How do you debug a slow request in microservices?" ā Distributed Tracing. Look for the "long bar" in the waterfall.
- "Push vs Pull Metrics?" ā Prometheus (Pull) is standard for infrastructure. Push (Datadog) better for serverless.
- "Log Levels" ā Explain DEBUG vs INFO vs WARN vs ERROR. Don't log DEBUG in prod (cost).
- "Sampling" ā You can't trace 100% of requests (too expensive). Tail-based sampling keeps only the interesting (slow/error) traces.
Related Concepts
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Blue-Green Deployment
Zero-downtime deployment strategy using two identical production environments (Blue and Green) to enable instant rollbacks, reduce risk, and allow thorough testing before directing traffic.
CI/CD Pipeline Architecture
Designing robust Continuous Integration and Continuous Deployment pipelines. Strategies for artifact promotion, testing pyramids, canary deployments, and rollback mechanisms.
Distributed Tracing
Complete guide to tracking requests across microservices using distributed tracing, covering trace context propagation, span instrumentation, OpenTelemetry implementation, and production debugging with Jaeger, Zipkin, and Datadog APM.