Observability as a System Property: Designing for Understanding

Contributor
Feb 12
5 min read

Updated: Jun 22

Monitoring asks: "Is this thing working?" Observability asks: "Why is this thing behaving the way it is?"

The difference matters when your system is complex enough that failures aren't obvious. In a monolith, a stack trace tells you where the error happened. In a distributed system with 15 services, an event bus, three databases, and a cache layer, the stack trace tells you where the symptom appeared — not where the cause lives. The cause might be three services upstream, triggered by a data change that happened 20 minutes ago.

Observability isn't a tool you bolt on. It's a system property you design in. A system is observable when you can understand its internal state by examining its external outputs — logs, metrics, and traces — without deploying new code or adding new instrumentation.

The Three Pillars, Connected

You've heard about the three pillars of observability: logs, metrics, and traces. What's usually missing from that conversation is how they connect.

Logs: What Happened

Structured log entries record discrete events. Not printf debugging — structured events with consistent fields that can be queried, aggregated, and correlated.

{
  "timestamp": "2026-03-20T14:32:01Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user_789",
  "event": "payment_failed",
  "reason": "insufficient_funds",
  "amount": 49.99,
  "currency": "USD",
  "provider_response_ms": 230
}

Every field is queryable. "Show me all payment failures for user_789 in the last 24 hours" is a query, not a grep through text files. The trace_id connects this log to the distributed trace. The span_id connects it to the specific operation within the trace.

Metrics: What's the Pattern

Metrics are aggregated measurements over time. Request rate, error rate, latency percentiles, queue depth, memory usage. They tell you the shape of system behavior — not individual events, but trends, anomalies, and patterns.

The four golden signals (from Google's SRE book) are the minimum:

Latency — How long requests take
Traffic — How many requests you're serving
Errors — How many requests fail
Saturation — How close to capacity you are

Metrics are the first thing you check when something seems wrong. They answer "is something unusual happening?" Logs and traces answer "what specifically is happening and why?"

Traces: What's the Path

A distributed trace follows a single request across every service it touches. Each service adds a span — a record of the work it did, how long it took, and what it called next.

[API Gateway: 450ms]
  └── [Auth Service: 15ms]
  └── [Order Service: 400ms]
       └── [Inventory Service: 50ms]
       └── [Payment Service: 320ms]
            └── [External Payment API: 280ms]

This trace shows immediately that the payment service's external API call is responsible for most of the latency. Without the trace, you'd see a 450ms API response and start guessing which service was slow.

The Connection

The three pillars are only powerful when connected. A metric alert fires (error rate increased). You drill into the metric to find which endpoint is affected. You pull traces for that endpoint to see where the errors occur. You find the relevant log entries using the trace ID to understand the specific failure.

This drill-down path — metric → trace → log — is the observability workflow. If your pillars are disconnected (different tools, no shared correlation IDs), you're doing the correlation manually, which means slowly, which means 3 AM incidents take longer to resolve.

Designing for Observability

Correlation IDs Everywhere

Every request entering your system gets a unique ID. That ID propagates through every service call, every log entry, every trace span, every queue message. When you investigate an issue, the correlation ID is the thread you pull.

OpenTelemetry's trace context propagation handles this automatically for HTTP and gRPC calls. For async communication (message queues, event buses), you need to propagate the trace context explicitly in message headers.

Structured Events, Not Log Lines

logger.info("Processing order") tells you nothing useful.

logger.info("order_processing_started", order_id=order.id, items=len(order.items), total=order.total, user_id=order.user_id) tells you everything you need.

Every log entry should answer: who, what, when, and relevant context. The cost of adding context to log entries is near zero. The cost of not having context during an incident is hours of investigation.

Cardinality Awareness

Metrics with high cardinality (many unique label values) are expensive to store and slow to query. A metric labeled with user_id creates a time series per user — millions of series at scale. A metric labeled with endpoint and status_code creates a manageable number of series.

Design your metrics for the queries you'll actually run. "Error rate per endpoint" is useful and has bounded cardinality. "Error rate per user per endpoint per minute" is theoretically useful and practically unsustainable.

Service Level Objectives (SLOs)

SLOs define what "good" means for your system in measurable terms. "99.9% of requests complete in under 500ms" is an SLO. It transforms the question from "is the system healthy?" (vague) to "are we meeting our commitments?" (measurable).

SLOs create clarity about when to act. If you're at 99.95%, everything's fine. If you're at 99.85% and trending down, that's a signal to investigate before you breach 99.9%. This error budget approach — the gap between your target and 100% — quantifies how much risk you can tolerate and when you need to prioritize reliability over features.

Runbooks Attached to Alerts

Every alert should link to a runbook that explains: what this alert means, what to check first, common causes, and how to remediate. An alert without a runbook is a request for panic. An alert with a runbook is a request for a specific action.

Write runbooks when things are calm, not during the incident. Update them after every incident that reveals gaps.

The Observability Tax

Observability has costs. Log storage, metric storage, trace storage, the compute for collection and aggregation, the network overhead of exporting telemetry. These costs are real and can grow surprisingly fast.

Sampling: You don't need to trace every request. Sampling 10% of requests in a high-traffic service gives you statistical confidence without the full cost. Sample 100% of errors — those you always want full detail on.

Retention tiers: Keep detailed data (individual logs and traces) for days or weeks. Keep aggregated data (metrics) for months or years. The recent detail is for debugging. The long-term aggregates are for trends.

Right-size your logging. Debug-level logging in production generates enormous volume. Use it during incidents (temporarily raise log levels for specific services), not by default.

Key Takeaway

Observability is a system property, not a tool purchase. Design it in through correlation IDs, structured events, connected pillars (metrics → traces → logs), cardinality-aware metrics, and SLOs that define what "good" means. The system should explain its own behavior through its external outputs — and when it can't, that's a gap in the design, not a missing tool.

This completes the Systems Thinking learning path. You've covered the gap between diagrams and reality, data modeling in distributed systems, zero trust architecture, and observability as a design property. The throughline: systems thinking is about understanding the behavior that emerges from the interaction of components — and designing so that behavior is understandable.

ShiftQuality