top of page

Debugging in Production Without Losing Your Mind

  • ShiftQuality Contributor
  • Aug 5, 2025
  • 8 min read

The previous post covered testing strategies that catch real bugs before they ship. This one is about what happens when a bug gets through anyway — because bugs always get through.

Your tests were good. Your code review was thorough. Your staging environment passed every check. And at 2:47 PM on a Tuesday, a customer reports that the checkout flow is returning a blank page for some users. Not all users. Some users. Sometimes.

You have no dedicated SRE team. You have no on-call rotation with a runbook. You have a small engineering team, a production system that is misbehaving, and a Slack channel filling up with question marks.

This post is about building the systems and practices that turn that moment from a panic into a process. Not enterprise-grade observability platforms with six-figure contracts. Practical observability for teams where the person debugging the incident is also the person who wrote the code.

Why Production Bugs Are Different

Debugging in development is a controlled activity. You can set breakpoints, inspect state, step through code, and reproduce the problem at will. The environment is yours. The data is yours. The timeline is yours.

Production debugging is none of these things. You cannot attach a debugger to a live system serving real users. You cannot reproduce the problem because you do not know exactly what input triggered it. You cannot take your time because every minute the system is broken is a minute customers are affected.

The information asymmetry is the core problem. The bug is happening in an environment you cannot directly inspect, with data you cannot directly see, under conditions you cannot directly reproduce. Everything you know about the failure comes through indirect channels: logs, metrics, error reports, customer descriptions.

This means the quality of your debugging is determined by the quality of your indirect channels. If your logs say "an error occurred," you are blind. If your logs say "order processing failed for user 4821, payment gateway returned 422, request payload contained an expired promotion code that was valid at cart creation but expired during checkout," you have a lead.

The investment in observability is not operational overhead. It is the difference between finding the bug in twenty minutes and finding it in twenty hours.

Structured Logging: Your Future Self Will Thank You

Most application logging looks like this:

[INFO] Processing order
[INFO] Order processed successfully
[ERROR] Something went wrong

This is logging for humans who are watching the console in real time. It is useless for debugging a production incident after the fact, which is the only time you actually need your logs.

Structured logging looks like this:

{
  "timestamp": "2026-03-15T14:47:23.418Z",
  "level": "error",
  "message": "Order processing failed",
  "orderId": "ord_8f2a9c",
  "userId": "usr_4821",
  "step": "payment_authorization",
  "gatewayResponse": 422,
  "errorCode": "PROMO_EXPIRED",
  "promotionId": "promo_spring25",
  "promoExpiry": "2026-03-15T14:00:00Z",
  "requestDuration": 1247,
  "correlationId": "req_7d3e1f"
}

Every field is queryable. You can search for all errors with errorCode: PROMO_EXPIRED across the last 24 hours. You can filter by user, by order, by gateway response code. You can follow a single request across multiple services using the correlation ID. You can build a timeline of what happened, in what order, and how long each step took.

The implementation cost is low. Most logging libraries support structured output natively — Serilog in .NET, structlog in Python, Winston or Pino in Node.js. The change is not in the logging library. It is in the discipline of including context with every log statement.

The rule is simple: every log entry should contain enough information that someone who has never seen the code can understand what happened. Not just "payment failed." Which payment, for which user, at which step, with what response, and how long did it take?

The Three Pillars, Practically

The observability community talks about three pillars: logs, metrics, and traces. In practice, a small team needs to get the first two right before worrying about the third.

Logs tell you what happened. Structured logs with context, as described above. The minimum investment that every production system needs.

Metrics tell you whether things are healthy in aggregate. Not what happened to a specific request, but how the system is behaving overall. Request rate. Error rate. Response time percentiles. Queue depth. Database connection pool utilization.

You do not need a custom metrics platform. Most cloud providers include basic metrics for their managed services. For application-level metrics, a simple time-series database or a managed service like Datadog, Grafana Cloud, or even CloudWatch custom metrics will work. The important thing is that you have a dashboard — one dashboard, not twenty — that shows the vital signs of your system. When something goes wrong, you look at the dashboard and the anomaly tells you where to focus.

Traces tell you how a single request moved through the system. For a monolith, traces are less critical because everything happens in one process and your logs can tell the story. For a system with multiple services, traces become important because they connect the dots between logs from different services. If you are running a monolith — which you probably should be, per the earlier posts in this series — defer tracing until you need it.

Incident Response for Small Teams

When something breaks in production, the biggest risk is not the bug itself. It is the chaos of the response. Three people investigating the same thing. Nobody writing down what they have tried. Communication happening in side conversations instead of a shared channel. Decisions being made without context.

You do not need a formal incident response framework with severity levels, war rooms, and postmortem templates written by a process consultant. You need a few habits that keep the response organized.

Designate one person as the investigator. Even on a three-person team, one person drives the investigation while the others handle communication or standby. The worst incident response is three people independently grep-ing through logs with no coordination.

Use a single channel for everything. One Slack thread, one document, one place where every finding, hypothesis, and action gets recorded. Not DMs. Not verbal conversations. A persistent written record that anyone can read to get up to speed instantly. This record is also your postmortem draft — you are writing it in real time without the effort of reconstructing it later.

Communicate early and honestly. The moment you know something is wrong, tell your users (or your customer support team, or your stakeholders). "We are aware of an issue affecting checkout for some users. We are investigating." That is enough. People tolerate outages far better when they know you know. They lose trust when the system is broken and nobody acknowledges it.

Fix it first, understand it second. The priority during an incident is restoration, not root cause analysis. If you can roll back the last deployment to restore service, do that. If you can disable a feature flag to bypass the broken code path, do that. If you can redirect traffic away from the affected component, do that. Understanding why it broke is important, but it can happen after the bleeding stops.

The Postmortem That Actually Prevents Recurrence

Every incident should produce a postmortem. Not as punishment. Not as a blame exercise. As the most valuable learning artifact your team can produce.

A useful postmortem answers four questions. What happened? How did we detect it? How did we fix it? What will we change so this specific failure mode cannot recur?

The fourth question is the only one that matters, and it is the one most postmortems skip. "We will be more careful" is not a preventive action. "We will add a validation check that rejects expired promotion codes at the payment step, and we will add an alert that fires when the promotion-expired error rate exceeds two per hour" is a preventive action.

Postmortems work when they produce concrete changes: a new test, a new alert, a new validation, a code change, a process change. They fail when they produce vague commitments to vigilance. Vigilance is not a system. Systems prevent recurrence. Vigilance prevents it until someone is tired.

Keep postmortems blameless. The question is never "who made the mistake." The question is "what about our systems and processes allowed this mistake to reach production?" The person who deployed the bug is not the problem. The missing test, the inadequate validation, the silent failure mode — those are the problems, and they are systemic, not personal.

Building Observability Incrementally

You do not need to instrument everything on day one. You need to start with the code paths that matter most and expand from there.

Week one: structured logging. Switch your logging library to structured output. Add context fields to your existing log statements. This is the highest-value, lowest-cost change you can make.

Week two: error tracking. Use an error tracking service — Sentry, Bugsnag, or Honeybadger are all reasonable choices. These catch unhandled exceptions, group them by root cause, and alert you when new error types appear. This turns "users are reporting a bug" into "we were alerted before any user noticed."

Week three: a health dashboard. One page that shows request rate, error rate, and response time for your application. One page that shows database connection count, query latency, and disk usage. Look at it once a day. You will start noticing patterns — the slow query that runs every morning at nine, the memory usage that creeps up between deployments, the error rate that spikes when a third-party service does maintenance.

Week four onward: alert on what matters. Not on every metric. On the metrics that indicate user-facing impact. Error rate above a threshold. Response time P95 above an acceptable limit. Queue depth growing instead of draining. Background job failure rate increasing. Each alert should have a clear meaning ("users are experiencing slow responses") and a clear first step ("check the database query latency dashboard").

The goal is not comprehensive observability. The goal is enough observability that when something breaks, you have a starting point. You know what is abnormal. You can follow the trail from the symptom to the cause without guessing.

The Information Problem

Production debugging is an information problem from start to finish. The bug exists. The evidence exists. Your ability to find the bug depends entirely on your ability to find and interpret the evidence.

Every observability investment — structured logging, metrics, error tracking, alerting — is an investment in the quality of the information you will have during the worst moments. The moments when the system is broken, customers are affected, and someone is asking how long until it is fixed.

The teams that handle these moments well are not the teams with the best engineers. They are the teams with the best information. They know what happened because their logs told them. They know the scope because their metrics showed them. They know what to fix because their error tracking grouped the symptoms into a cause.

Good information does not prevent incidents. It makes incidents shorter, less stressful, and less likely to recur.

The Takeaway

You do not need an SRE team to debug production effectively. You need structured logs that tell you what happened. Metrics that tell you whether the system is healthy. Error tracking that tells you about problems before users do. And a simple incident response process that keeps the response organized.

Build these incrementally. Start with structured logging — it is the single highest-leverage investment you can make in production reliability. Add error tracking. Add a dashboard. Add alerts. Each layer gives you more information, and more information means faster resolution, calmer incidents, and fewer repeat failures.

The goal is not zero incidents. It is the ability to detect, understand, and resolve incidents quickly — and to learn from each one so the system gets more resilient over time.

Next in the "Testing That Matters" learning path: We'll tackle testing the hard parts — async operations, external dependencies, and the stateful workflows that resist clean testing patterns.

Comments


bottom of page