top of page

Observability by Design, Not by Afterthought

  • ShiftQuality Contributor
  • Jul 10, 2025
  • 5 min read

The previous post in this path covered designing systems for testability. This post covers the companion discipline: designing systems that are observable — systems that tell you what they are doing, why they are doing it, and where they are struggling, without requiring you to attach a debugger to production.

Most teams add monitoring after the system is built. They instrument the edges — request counts, error rates, response times — and call it done. When something goes wrong, they have enough data to know that something is wrong but not enough to know why. The investigation begins: reading logs, correlating timestamps, guessing at causation.

Observability by design means the system is built to explain itself from the start. The instrumentation is not an afterthought bolted onto the surface. It is woven into the architecture so that diagnosing any problem — from a slow endpoint to a cascading failure — takes minutes, not hours.

The Three Pillars, Revisited

Observability has three pillars: metrics, logs, and traces. You have heard this before. The nuance is in how they work together.

Metrics are aggregated measurements over time. Request rate. Error rate. Latency distribution. CPU utilization. Queue depth. Metrics tell you what is happening at a system level. They are the dashboard that shows health at a glance.

Metrics are best for detection. They answer "is something wrong?" quickly and cheaply. A spike in error rate, a drop in throughput, a climb in latency — metrics surface these signals in real time.

Logs are discrete events with context. "User 12345 requested /api/orders at 14:32:07, received 200 in 142ms, query took 98ms." Logs tell you what happened in a specific instance. They are the narrative.

Logs are best for investigation. Once metrics tell you something is wrong, logs tell you what went wrong in specific cases. The key is structured logging — JSON-formatted log entries with consistent fields — rather than unstructured text strings that require regex to parse.

Traces are the causal chain of a single request across services. Request enters the API gateway, routes to the order service, calls the inventory service, queries the database, returns to the user. Traces tell you where time was spent and where failures originated.

Traces are best for diagnosis in distributed systems. When a request is slow, the trace shows which service in the chain is the bottleneck. When a request fails, the trace shows where the failure started and how it propagated.

Each pillar alone is insufficient. Metrics without logs detect problems without explaining them. Logs without metrics explain individual events without showing patterns. Traces without either show the path without the context. The design goal is correlation: the ability to start with a metric anomaly, drill into the relevant logs, and follow the trace to the root cause.

Designing for Correlation

Correlation requires a shared identifier that connects metrics, logs, and traces for the same request. This is the correlation ID — a unique identifier assigned to each request at the entry point and propagated through every service, every log entry, and every trace span.

When the metrics dashboard shows a latency spike at 14:30, you filter logs by the time window and find the slow requests. Each log entry contains a correlation ID. You use that ID to pull the distributed trace, which shows the complete path of the request and where the time was consumed.

This flow — metric to log to trace — should take under five minutes. If it takes longer, the observability design has gaps. Either the correlation ID is not propagated consistently, the logs are not structured for filtering, or the traces are not capturing the right spans.

Designing for correlation means making the correlation ID a first-class citizen in every service. It is set at the entry point, passed through every internal call (via HTTP headers, message metadata, or context propagation), and included in every log entry and trace span. This is a cross-cutting concern that belongs in the framework layer, not in individual service code.

Structured Logging That Works

Unstructured logs — free-text strings — are easy to write and expensive to query. Finding all requests from user 12345 that resulted in errors requires full-text search across millions of log lines. The query is slow, the results are imprecise, and the experience is painful.

Structured logs — JSON objects with consistent fields — are marginally harder to write and dramatically easier to query. Every log entry has a userId field, a statusCode field, a correlationId field, and a durationMs field. Finding all error requests from user 12345 is a filtered query that returns in seconds.

The design principle: log entries are data, not messages. Design them as you would design a database record — with defined fields, consistent types, and enough context to be useful without requiring the reader to look at surrounding log lines.

At minimum, every log entry should include: timestamp, service name, correlation ID, log level, and a structured context object with request-specific data. At the request boundary, add: HTTP method, path, status code, duration, and user identifier.

Health Checks and Readiness Probes

Observable systems expose their own health. A health check endpoint returns the current status of the service and its critical dependencies — database connectivity, external API availability, message queue access.

This is not just for Kubernetes probes (though it serves that purpose). It is for human operators. When the dashboard shows a service is degraded, the health check endpoint tells you which dependency is the cause. "Database: healthy. Redis: healthy. Payment API: unhealthy — connection timeout." Investigation starts at the payment API, not with a guess.

Design health checks to be specific. "Healthy" is not useful when the service has five dependencies and one of them is down. "Healthy: database OK, cache OK, email service degraded (timeout at 14:32)" is actionable.

Alerting on Symptoms, Not Causes

A common observability mistake is alerting on causes: CPU is high, disk is 80% full, memory usage is climbing. These are useful signals but they are not user-facing symptoms. CPU can be high because the system is busy — not because anything is wrong.

Alert on symptoms — the conditions that users actually experience. Error rate exceeds threshold. Latency exceeds SLA. Request throughput drops below baseline. These are the signals that indicate something is affecting users, which is the only definition of "something is wrong" that matters.

Cause-based metrics (CPU, memory, disk) should be available for investigation but should not generate alerts unless they cross critical thresholds. The investigation flow: a symptom alert fires → you check the dashboard → the dashboard shows which component is stressed → the cause-based metrics explain why.

The Takeaway

Observability is not monitoring added after the fact. It is a design property — built into the architecture through structured logging, distributed tracing, correlated metrics, and health endpoints that explain the system's state.

A system designed for observability explains itself when things go wrong. The investigation path — from alert to root cause — is measured in minutes. The alternative — a system where observability was an afterthought — produces investigations measured in hours and root causes that are guesses rather than evidence.

Design the explanation into the system. Future-you will be grateful at 3 AM.

Next in the "Quality Architecture" learning path: We'll cover resilient error handling — designing systems that degrade gracefully, communicate failures clearly, and recover automatically where possible.

Comments


bottom of page