top of page

Distributed Systems: The Lies Your Diagrams Tell

  • ShiftQuality Contributor
  • Oct 1, 2025
  • 5 min read

Your architecture diagram looks clean. Service A calls Service B. Service B reads from Database C. An event flows to Queue D. Arrows connect boxes. Everything looks orderly, reliable, and deterministic.

It is none of those things.

The diagram shows the happy path at rest. It does not show what happens when the network between A and B drops packets for three seconds. It does not show what happens when B's response takes 50ms on average but 8 seconds at p99. It does not show what happens when A sends a request, B processes it, but the response never arrives — so A retries, and B processes it again.

Distributed systems do not behave like the diagrams that describe them. The gap between the diagram and reality is where every production incident lives. This post is about understanding that gap — the failure modes, trade-offs, and design constraints that define distributed systems engineering.

The Eight Fallacies, Condensed

In 1994, Peter Deutsch identified the assumptions developers make when building distributed systems — assumptions that are always wrong. They remain the most useful framework for thinking about distributed system failures.

The network is reliable. It is not. Packets are dropped, connections are reset, and DNS resolution fails. Every network call in your system can fail, and many failures are transient — they resolve on their own if you retry. Designing for this means every remote call needs timeout handling, retry logic, and fallback behavior.

Latency is zero. It is not. A function call within a process takes nanoseconds. A network call between services takes milliseconds — at minimum. Under load, it takes more. This means a workflow that makes ten sequential network calls has a floor latency of the sum of those ten calls, regardless of how fast each service processes the request. Latency accumulates, and it accumulates faster than most architects expect.

The network is secure. It is not. Traffic between services can be intercepted, modified, or spoofed. Mutual TLS, authentication, and authorization between services are not optional in production — they are baseline requirements.

Topology doesn't change. It does. Services move between hosts. Load balancers shift traffic. DNS entries update. Any system that caches a resolved address indefinitely will eventually call a dead endpoint.

The remaining fallacies — bandwidth is infinite, there is one administrator, transport cost is zero, the network is homogeneous — round out the picture. Together, they describe an environment that is unreliable, slow, insecure, and constantly changing. Every design decision in a distributed system must account for this reality.

Partial Failure: The Core Challenge

In a monolithic application, failure is binary. The process is running or it is not. In a distributed system, failure is partial. Service A is running. Service B is down. Service C is running but returning errors for 30% of requests. The database is accepting reads but rejecting writes.

Partial failure means the system is simultaneously working and broken. The user experience depends on which services their request touches and whether those services happen to be healthy at that moment. Two users making the same request at the same time can get different outcomes.

Handling partial failure requires designing every service interaction for the possibility that the other service is unavailable, slow, or returning garbage. Circuit breakers prevent a failing downstream service from dragging the caller down with it. Bulkheads isolate failures so that a problem in one subsystem does not consume resources needed by others. Timeouts prevent slow calls from blocking threads indefinitely.

These patterns are not optimizations. They are survival mechanisms. A distributed system without circuit breakers, bulkheads, and timeouts is a system where a single failing service can cascade into a complete outage.

Consistency vs. Availability

The CAP theorem, stated simply: in the presence of a network partition, a distributed system must choose between consistency (every read returns the most recent write) and availability (every request receives a response).

In practice, network partitions happen. They are rare in a well-maintained data center but inevitable over a long enough timeline. And in cloud environments with multiple availability zones, they are not even that rare.

The practical question is not "CAP or not CAP." It is "which consistency model does each part of your system actually need?"

A shopping cart can tolerate eventual consistency — if two requests to add items happen concurrently and one is briefly lost, the user adds it again. An inventory count for a limited-edition product cannot — overselling because of a consistency lag is a business failure. A user profile update can be eventually consistent — seeing the old name for a few seconds is harmless. A financial transaction must be strongly consistent — a balance that doesn't reflect a recent debit is a compliance issue.

Different data has different consistency requirements. The mistake is applying one model to everything — either paying the performance cost of strong consistency where it isn't needed, or accepting eventual consistency where it isn't safe.

Idempotency: Not Optional

In a distributed system, every operation must be safe to execute more than once. This is idempotency, and it is non-negotiable.

Consider: Service A sends a payment request to Service B. Service B processes the payment and sends a response. The response is lost — the network drops it. Service A doesn't receive confirmation, so it retries. Service B receives the same payment request a second time.

If Service B is not idempotent, the customer is charged twice. If it is idempotent — if it recognizes the duplicate request and returns the original result without processing again — the customer is charged once, and the retry is transparent.

Idempotency is implemented through unique request identifiers. Each operation carries an ID. Before processing, the service checks whether it has already processed that ID. If yes, it returns the stored result. If no, it processes the request and stores the result with the ID.

This adds storage and lookup overhead. It also prevents an entire category of bugs that are nearly impossible to detect through testing — because the conditions that trigger them (network failures during response delivery) are difficult to simulate reliably.

Observability in Distributed Systems

Debugging a monolith means reading one log. Debugging a distributed system means correlating events across multiple services, each with its own logs, metrics, and timelines.

Distributed tracing solves this by assigning a unique trace ID to each request at the entry point and propagating it through every service the request touches. When something fails, you pull the trace ID and see the complete journey: which services were called, in what order, how long each took, and where the failure occurred.

Without distributed tracing, debugging a production issue means guessing which service failed, reading its logs, finding the relevant entries, correlating timestamps with other services' logs, and hoping the clocks are synchronized closely enough for the correlation to be meaningful. This process takes hours. With tracing, it takes minutes.

Tracing is not the only observability tool — metrics (request rates, error rates, latency distributions) and structured logging are equally essential. But tracing is the tool most unique to distributed systems, and the one most often missing.

The Takeaway

Distributed systems are not just systems with a network between them. They are systems operating in an environment where every interaction can fail, every call adds latency, every state can be inconsistent, and every operation might execute more than once.

The diagrams lie because they show structure, not behavior. The behavior — partial failures, consistency trade-offs, retry storms, cascading timeouts — is where the actual engineering happens.

Design for failure. Make operations idempotent. Choose consistency models per use case, not per system. Instrument everything. And always, always be suspicious of the clean diagram that makes the system look simple. It isn't.

Next in the "Systems Thinking" learning path: We'll cover data modeling for distributed systems — how to partition data across services, handle cross-service queries, and avoid the distributed monolith trap where services share a database and are coupled more tightly than the monolith they replaced.

Comments


bottom of page