top of page

Logging That Actually Helps You Debug

  • ShiftQuality Contributor
  • Apr 25
  • 5 min read

The previous posts in this path covered testing that catches real bugs and testing async/stateful systems. This post covers the practice that fills the gap between "the test passed" and "what happened in production": logging — the diagnostic trail that tells you what your application was doing when things went wrong.

Most application logs are useless. They are either too verbose — thousands of "processing request" messages that drown the signal in noise — or too sparse — nothing between "application started" and "unhandled exception." Useful logging sits between these extremes: enough information to reconstruct what happened, structured enough to be searchable, and contextual enough to be meaningful.

The Purpose of Logging

Logging serves three distinct purposes, and confusing them produces logs that serve none well.

Debugging. When something goes wrong, logs tell you what happened leading up to the failure. The request that triggered the error. The state of the data at each processing step. The external calls that were made and their responses. This is the forensic record that lets you reconstruct the incident.

Monitoring. Aggregated log data reveals patterns — error rates trending upward, specific endpoints becoming slow, certain users hitting errors repeatedly. This is not about individual messages but about patterns across messages over time.

Auditing. Certain events must be recorded for compliance — who accessed what data, what changes were made, what decisions were automated. Audit logs have different requirements: they must be tamper-resistant, retained for specific periods, and include identity information.

A single log statement can serve multiple purposes, but the logging strategy should be designed with all three in mind. Debug logs that include user IDs serve auditing. Structured error logs that include endpoint and status code serve monitoring.

Structured Logging: The Foundation

Unstructured logs — human-readable text lines — are easy to write and impossible to query at scale.

[2024-03-15 14:23:45] ERROR: Failed to process order 12345 for user john@example.com - payment service returned 503

This line is readable. But to answer "how many payment service errors occurred in the last hour?" you need regex parsing across millions of lines. To answer "which users were affected?" you need a different regex. Every new question requires a new parser.

Structured logging emits log entries as key-value pairs (typically JSON). The same event becomes:

{"timestamp": "2024-03-15T14:23:45Z", "level": "error", "message": "Payment processing failed", "orderId": "12345", "userId": "john@example.com", "service": "payment", "statusCode": 503, "duration_ms": 2340}

Now "how many payment errors in the last hour?" is a query: filter by service=payment, level=error, last hour. "Which users were affected?" is a query: filter by service=payment, level=error, group by userId. No regex. No custom parsing. The structure makes the data queryable.

Every modern logging framework supports structured logging — Serilog in .NET, structlog in Python, Winston in Node.js, Logback with JSON encoder in Java. The switch from unstructured to structured logging is the single highest-value improvement most teams can make to their logging practice.

Log Levels: Using Them Correctly

Log levels signal severity and control what gets recorded in different environments. Most frameworks provide five levels, and most teams use them inconsistently.

ERROR. Something failed and the operation could not be completed. A request returned a 500. A database query threw an exception. A required external service is unreachable. Errors require investigation — if your error log is noisy with non-actionable messages, you have misclassified them.

WARN. Something unexpected happened but the operation continued. A retry succeeded after a transient failure. A deprecated API was called. A configuration value fell back to a default. Warnings indicate potential problems that do not require immediate action but should be monitored for patterns.

INFO. Significant business events at a high level. Application started. Request received and completed. Order processed. User logged in. Info logs should tell the story of what the application did, readable without drowning in detail.

DEBUG. Detailed diagnostic information for troubleshooting. Variable values, intermediate computation results, branch decisions. Debug logs are typically disabled in production and enabled temporarily when investigating an issue.

TRACE. Extremely detailed execution flow — entry/exit of methods, individual loop iterations. Almost never used in production. Useful in development for understanding unfamiliar code paths.

The practical guideline: production should run at INFO level by default, with the ability to dynamically switch to DEBUG for specific components when investigating an issue. If your INFO level produces more than a few hundred messages per minute per service, you are logging too much at INFO.

Context: The Missing Ingredient

A log message without context is a puzzle piece without the puzzle. "Payment failed" tells you almost nothing. "Payment failed for order 12345, user U-789, amount $49.99, payment method VISA ending 4242, error: insufficient funds, attempt 2 of 3" tells you everything you need.

Context fields should include: the request or transaction ID (to correlate all logs from a single operation), the user or account ID (to understand who is affected), the relevant entity IDs (order, product, session), timing information (how long the operation took), and any data that helps diagnose the specific failure.

Correlation IDs are the most important context field. A single user action — clicking "place order" — might generate logs across five services. Without a correlation ID that flows through all five services, connecting those logs requires guessing based on timestamps. With a correlation ID, every log message from that operation is a single query away.

The implementation: generate a unique ID at the request boundary (the API gateway or the first service to handle the request), propagate it through HTTP headers to downstream services, and include it in every log message. Most structured logging frameworks support "log context" or "scoped properties" that automatically attach the correlation ID to every log message within a request scope.

What Not to Log

Not everything belongs in the log. Sensitive data — passwords, API keys, credit card numbers, personally identifiable information — must never appear in logs. A log message that includes "password": "hunter2" is a security incident waiting to happen. Log aggregation systems are not designed for sensitive data — they are often accessible to broad engineering teams and may be retained for years.

The practice: sanitize or redact sensitive fields before logging. Log the user ID, not the user's email. Log the payment status, not the card number. Log that authentication failed, not the password that was tried.

Health check noise is another common problem. If a load balancer pings your service every 5 seconds and each ping generates a log message, that is 17,000 useless messages per day per service. Exclude health check endpoints from access logging or log them at TRACE level.

High-volume loops should not log individual iterations. A batch process that logs "processing item 1 of 10000" generates 10,000 messages that provide no diagnostic value. Log the batch start, the batch completion, and any individual items that failed.

The Takeaway

Effective logging is structured (JSON, queryable), leveled (appropriate severity, production runs at INFO), contextual (correlation IDs, entity IDs, timing), and disciplined (no sensitive data, no noise). The investment in good logging practices pays for itself the first time you need to diagnose a production issue — the difference between spending five minutes querying structured logs and spending five hours grepping through unstructured text.

Your logs are the story your application tells about what it did. Make that story readable, searchable, and complete enough to answer the questions you will have when something goes wrong.

Next in the "Testing That Matters" learning path: We'll cover performance testing fundamentals — how to measure whether your application is fast enough before your users tell you it is not.

Comments


bottom of page