top of page

Microservices Communication Patterns That Scale

  • ShiftQuality Contributor
  • May 23, 2025
  • 6 min read

The previous posts in this path covered event-driven architecture and service boundaries. This post tackles the problem that makes or breaks a microservices architecture: how services talk to each other — and what happens when that conversation goes wrong.

In a monolith, a function call is a function call. It takes microseconds, it either succeeds or throws an exception, and you can step through it in a debugger. In a microservices architecture, every service-to-service call is a network call. It takes milliseconds to seconds. It can fail in ways a function call cannot — timeouts, partial failures, network partitions, version mismatches. The communication patterns you choose determine whether your distributed system is resilient or fragile.

Synchronous vs. Asynchronous: The Fundamental Choice

Synchronous communication — Service A calls Service B and waits for a response — is the familiar pattern. It maps naturally to HTTP request/response and feels like local function calls. This familiarity is both its strength and its trap.

The trap: synchronous calls create temporal coupling. Service A cannot continue until Service B responds. If Service B is slow, Service A is slow. If Service B is down, Service A is broken. Chain three synchronous calls — A calls B, B calls C, C calls D — and the availability of the chain is the product of each service's availability. If each service is 99.9% available, the chain is 99.7% available. Add more services and the availability drops further.

Asynchronous communication — Service A publishes a message and continues without waiting for a response — breaks this coupling. Service A does not know or care whether Service B is currently running. The message sits in a queue or event stream until Service B is ready to process it. If Service B is down for 30 minutes, the messages accumulate and are processed when it recovers.

The trade-off: asynchronous communication is harder to reason about, harder to debug, and introduces eventual consistency — the state of the system is not immediately consistent after a change. The user updates their profile in Service A, and Service B reflects the change seconds later, not immediately. For many use cases, this delay is acceptable. For some — financial transactions, real-time inventory — it is not.

The guidance: use synchronous communication when you need an immediate response (user-facing API calls, authentication, real-time queries). Use asynchronous communication when you need reliability and decoupling (order processing, notifications, data synchronization, analytics events).

API Gateway: The Front Door

An API gateway sits between external clients and internal services, providing a single entry point that handles cross-cutting concerns: authentication, rate limiting, request routing, response transformation, and protocol translation.

Without a gateway, every service must implement its own authentication, rate limiting, and SSL termination. Clients must know the addresses of multiple services and handle communication with each one directly. A change to the authentication scheme requires updating every service.

The gateway centralizes these concerns. Clients talk to one endpoint. The gateway authenticates the request, routes it to the appropriate service (or services), aggregates the responses if needed, and returns a unified response. Internal services focus on business logic, trusting that authenticated, validated requests arrive from the gateway.

The anti-pattern to avoid: turning the gateway into a business logic layer. The gateway should route and transform, not compute. Business logic in the gateway creates a bottleneck and reintroduces the monolithic coupling that microservices were supposed to eliminate. If the gateway needs a code deployment to change business behavior, it has absorbed too much responsibility.

Circuit Breakers: Failing Gracefully

When Service B is struggling — responding slowly or returning errors — Service A should not keep hammering it with requests. Each failed request consumes Service A's resources (threads, connections, memory) while providing no value. Without protection, Service B's problems cascade to Service A, which cascades to everything that depends on Service A.

The circuit breaker pattern prevents this cascade. It monitors calls to a downstream service and tracks failure rates. When failures exceed a threshold, the circuit "opens" — subsequent calls immediately return an error or a fallback response without attempting the network call. After a timeout period, the circuit enters a "half-open" state, allowing a few test calls through. If those succeed, the circuit closes and normal traffic resumes. If they fail, the circuit stays open.

The fallback strategy is what makes circuit breakers useful rather than just a different way to fail. When the circuit to the recommendation service is open, return a default set of popular products. When the circuit to the pricing service is open, return cached prices. When the circuit to a non-essential service is open, omit that feature from the response entirely. The user gets a degraded but functional experience instead of a complete failure.

Retry Patterns: The Right Way to Try Again

Network calls fail transiently. A service that was unreachable for 200ms might be fine 500ms later. Retrying is the correct response to transient failures — but naive retrying is dangerous.

The naive approach — immediately retry on failure, up to 3 times — creates a retry storm. If 100 clients each retry 3 times against a struggling service, the service receives 400 requests instead of 100. The retries make the overload worse, extending the outage.

Exponential backoff with jitter solves this. The first retry waits 100ms. The second waits 200ms plus a random jitter. The third waits 400ms plus jitter. The randomization prevents all clients from retrying simultaneously. The increasing delays give the downstream service time to recover.

Retry budgets add another safeguard. Instead of "retry each request up to 3 times," enforce "retry no more than 10% of total requests." If 90% of requests are succeeding, allow retries on the 10% that fail. If 50% of requests are failing, retries are consuming budget that should be used for fresh requests. The budget limits the amplification factor regardless of the failure rate.

Idempotency is the prerequisite for safe retries. If the first call succeeded but the response was lost (network partition between request and response), the retry must not create a duplicate effect. A payment service that charges the customer twice because of a retry has a serious bug. Every operation that will be retried must be idempotent — producing the same result regardless of how many times it is called.

Service Mesh: Infrastructure-Level Communication

A service mesh moves communication concerns — retries, circuit breaking, load balancing, mutual TLS, observability — from application code into infrastructure. Instead of each service implementing its own retry logic and circuit breakers, a sidecar proxy (like Envoy, running alongside each service) handles these concerns transparently.

The value proposition: developers write business logic, and the mesh handles communication reliability. Retry policies are configured declaratively, not coded. Circuit breaker thresholds are adjusted without code changes. Mutual TLS encryption between all services is automatic.

The cost: operational complexity. A service mesh adds a proxy to every service, which adds latency (typically 1-3ms per hop), resource consumption, and a new system to configure, monitor, and debug. The mesh control plane becomes critical infrastructure — if it fails, communication policies break.

The guidance: a service mesh is justified when you have many services (dozens to hundreds), when communication patterns are complex, and when you have the platform engineering capacity to operate the mesh. For 5-10 services, implementing circuit breakers and retries in application code is simpler and more transparent.

Saga Pattern: Distributed Transactions

When a business operation spans multiple services — creating an order involves the order service, the inventory service, and the payment service — you need a way to ensure consistency. Either all three steps succeed, or the system is left in a consistent state despite partial failure.

Distributed transactions (two-phase commit) are theoretically correct but practically brittle. They require all participating services to be available simultaneously and introduce significant latency.

The saga pattern replaces a single distributed transaction with a sequence of local transactions, each with a compensating action. The order saga: create the order (compensation: cancel the order), reserve inventory (compensation: release inventory), charge payment (compensation: refund payment). If payment fails, the saga executes compensations in reverse: release inventory, cancel order.

Orchestrated sagas use a central coordinator that manages the sequence. Choreographed sagas use events — each service listens for events and triggers the next step or compensation. Orchestration is easier to understand and debug. Choreography is more loosely coupled but harder to trace.

The Takeaway

Microservices communication is the architecture's nervous system. The patterns you choose — synchronous vs. asynchronous, how you handle failures, how you manage transactions — determine whether your distributed system is resilient or fragile.

Use synchronous calls for real-time user-facing requests. Use asynchronous messaging for reliability and decoupling. Implement circuit breakers to prevent cascade failures. Design retries with backoff, jitter, and budgets. Make operations idempotent. And use the saga pattern for distributed transactions that must be consistent across services.

The communication patterns are not incidental to the architecture. They are the architecture.

Next in the "Architecture for Real Systems" learning path: We'll cover API versioning strategies — how to evolve service interfaces without breaking the clients that depend on them.

Comments


bottom of page