Quality Metrics That Actually Matter

Contributor
Feb 28
7 min read

Walk into most engineering orgs and you'll find a quality dashboard. Test count: up and to the right. Code coverage: 84%. Bugs closed this sprint: 47. Everyone nods. And none of it answers the only question that matters: is the software actually getting better, or worse?

The problem isn't that teams don't measure. It's that they measure activity — things that are easy to count — and mistake it for quality. Then, because what gets measured gets managed, people optimize the proxy and the real thing quietly degrades. You end up with a team that's very busy producing metrics and a product that's no more reliable than it was a year ago.

This post is about the small set of metrics that resist gaming and actually track quality, and how to use them without turning your team into number-fakers.

The trap: measuring activity instead of outcomes

Here's the tell. Ask whether a metric goes up when people work harder or when the software gets better. Activity metrics go up with effort:

Tests written — a team can write a thousand tests that assert nothing.
Code coverage — measures which lines ran, not whether the test would catch a regression.
Bugs closed — rewards finding-and-fixing, which means it quietly rewards having more bugs.
Story points delivered — velocity is capacity planning, not quality, no matter how often it's misused as both.
Sprint commitment hit rate — rewards committing to less than you can do; a team that always hits 100% has learned to under-commit.
Code review approval rate — reviewers rubber-stamp or pick fights to move the number. The value of a review is in the discussion, not the verdict.

Each of these is gameable precisely because it measures the proxy, not the outcome. This is Goodhart's law in one sentence: when a measure becomes a target, it ceases to be a good measure. The fix isn't a better activity metric. It's measuring the outcome directly.

The metrics that actually track quality

There's a short list of metrics that are hard to game, because gaming them requires actually improving the software. Four carry most of the weight.

1. Escaped defect rate

Of all the defects you found, what fraction reached production before you caught them? This is the single most honest quality metric most teams aren't tracking. It directly measures the thing quality practices exist to prevent: bugs getting to users.

You can't fake it by working harder — the only way to drive escaped defects down is to genuinely catch more before release. Track it as a ratio (defects found in production ÷ total defects) over time, and watch the trend, not the absolute number. A rising trend means your safety net is developing holes, regardless of how many tests you wrote.

2. Change failure rate

What percentage of your changes cause a failure — an incident, a rollback, a hotfix? This is one of the four DORA metrics, and it's a direct read on whether your release process produces stable software. A team shipping ten times a day with a 5% change failure rate is in far better shape than a team shipping monthly with a 30% failure rate, even though the second team "moves more carefully."

Change failure rate is hard to game because the failures are real. You can't pretend a rollback didn't happen.

3. Mean time to recovery (MTTR)

When something does break — and it will — how long until users are okay again? MTTR matters more than most teams admit, because perfect prevention is impossible and recovery speed is often the difference between a blip and a crisis. A team that recovers in eight minutes can tolerate a higher failure rate than one that takes four hours, because the impact of each failure is smaller.

Measuring MTTR also has a useful side effect: it pushes you toward observability, good alerting, and tested rollbacks — quality practices that prevention-only metrics never reward.

4. Change lead time (a maintainability signal)

How long does a change take to go from "started" to "in production"? On its own this is a flow metric, but as a trend it's a strong proxy for maintainability. When lead time creeps up, it usually means the codebase is getting harder to change safely — more coordination, more fragile tests, more fear. Rising lead time is often the first quantitative sign of accumulating technical debt, long before anyone says the words.

Together, escaped defects, change failure rate, MTTR, and lead time give you a balanced picture: are bugs reaching users, are releases stable, do we recover fast, and is the system staying changeable? (The DORA Metrics Tracker template in the Engineering Metrics path gives you a ready-made place to instrument and trend three of these four.)

A few more worth tracking

Once the core four are in place, a handful of supporting metrics can sharpen the picture. Add them only if each maps to a real question you have — not to fill a dashboard.

Incident rate by severity. Count incidents per month, split Sev-1/Sev-2/Sev-3, and trend it against how much you're shipping. Deploy volume up while incident rate stays flat means you're genuinely getting better. Both rising means you've just gotten more tolerant of incidents.
Time to detection. How long between a failure starting and someone noticing. If customers notice before you do, that's a quality failure no matter how cleanly you respond afterward.
Bug recurrence rate. Of the bugs filed this half, how many are variants of bugs filed last half? High recurrence means you're fixing symptoms, not causes. Computing it means someone actually reading the bug tracker — which is exactly why almost nobody tracks it, and why it's worth the effort.
Onboarding time. How long until a new engineer ships their first real change. It's a quality metric in disguise: teams with healthy practices, docs, and tooling onboard fast; teams running on folklore onboard slowly and can't say why.

How to use them without breaking your team

The metrics are the easy part. Using them without triggering the exact gaming they're meant to resist takes discipline.

Measure the team, never the individual. The instant a quality metric is attached to one person's performance review, it stops measuring quality and starts measuring how good that person is at managing the number. Escaped defects are a property of a system — the design, the tests, the review process, the deployment pipeline — not of the last engineer who touched the code. Keep them at the team level where they describe the system, not a person.

Watch trends, not absolutes. A change failure rate of 12% means nothing in isolation. Is it rising or falling? That's the signal. Absolute targets ("get coverage to 90%") invite gaming; trend conversations ("why did escaped defects tick up the last two months?") invite curiosity.

Use them as conversation starters, not scoreboards. The right use of a quality metric is a recurring team discussion: here's the trend, what's driving it, what should we do? The wrong use is a leaderboard. The first builds a quality culture; the second builds a culture of looking good.

Keep the list short. Three to five metrics, total. A dashboard with twenty quality metrics is a dashboard no one acts on. Pick your two or three outcome metrics, add a leading indicator for your single biggest risk, and stop. You can always swap one out when your risks change.

Leading vs. lagging — and why you need both

Escaped defects and change failure rate are lagging indicators: by the time they move, the quality event already happened. They tell you the truth, but slowly. To act earlier, pair them with one or two leading indicators tied to your biggest risk — something that moves before quality degrades.

If your top risk is a fragile checkout path, a leading indicator might be "percentage of checkout changes covered by an end-to-end test." If it's an overloaded on-call rotation, it might be "alert noise ratio." The leading indicator is your early warning; the lagging metric is your reality check. Use the leading one to steer and the lagging one to confirm you steered correctly. Trust the leading indicator to act, but never let it override what the lagging metric is telling you actually happened.

Getting leadership off the wrong metrics

Picking better metrics for your team is the easy half. The harder half is the dashboard leadership has been reading for years. The metrics executives see become the metrics that get optimized, so changing what they see is its own piece of work — and you don't do it with a single email.

What actually works:

Run the new metrics alongside the old ones for a couple of quarters. Don't yank the coverage report. Add change failure rate and MTTR next to it, each with a sentence on what it means and why it matters.
Tie every metric to an outcome leadership already cares about. Change failure rate → customer trust. Lead time → time to market. MTTR → revenue protected during an incident. They don't care about the proxy; make the thing behind it explicit.
Show one concrete example of the old metric misleading them. "Coverage went up four points last quarter, and two incidents shipped through the new tests anyway — here's why." Specific beats abstract every time.
Be patient. Changing what an organization measures takes four to eight quarters of steady pressure. Push harder and you get backlash; push slower and nothing changes.

Start with one

If your team measures none of this today, don't build the dashboard. Pick one metric — escaped defect rate is the highest-leverage place to start — and track it for a quarter. Just making it visible changes behavior: people start asking why a bug escaped, which is the entire point.

Quality metrics aren't about precision. They're about pointing attention at outcomes instead of activity, and starting honest conversations about whether the software is actually getting better. Measure the few things that resist gaming, watch the trends, keep it at the team level, and the metrics will do their real job — not decorating a dashboard, but telling you the truth.

ShiftQuality