The Cost of Quality, Honestly

Contributor
Mar 2
7 min read

The pillar essay on the economics of shift quality. Most quality arguments fail because they are made in language a CFO does not respect. This piece argues that quality has costs, that the costs are sometimes worth paying and sometimes not, and that the honest framing is the only one that wins budget.

The Problem With Quality Arguments

The standard argument for quality investment goes like this: "Quality matters. The cost of fixing bugs later is higher than the cost of preventing them. Therefore we should invest."

A CFO has heard this argument before. They have heard it from the security team, from the platform team, from the data team, and from the engineering manager who wants to refactor the legacy module. They have heard it phrased as "we'll save money in the long run" and "this is technical debt and the interest is compounding" and "the cost of not doing this is much higher than the cost of doing it."

Some of those arguments were right. Some of them were wrong. The CFO cannot tell the difference, because none of them came with a defensible number. So the CFO does what CFOs do: they fund the work that has a clear ROI story and defer the work that has a virtuous-sounding story without specifics. That defer-pile is where most quality investment goes to die.

The honest economic case for quality is more complicated than the standard pitch suggests, and it is more defensible.

The Two Failure Modes

Most discussions of quality investment frame the question as "are we investing enough." This is half the question. The full question is "are we investing the right amount in the right things," and both directions have failure modes.

Underinvestment is the well-known one. Tests are skipped. Observability is deferred. Refactoring is never funded. Engineers compensate with heroics, then leave. Incidents accelerate. Customer trust erodes. Eventually the organization spends three times the cost it would have spent on prevention, paying for it in incident response, rewrites, and re-hiring.

Overinvestment is the less-discussed one. Tests proliferate beyond their value, slowing the build and making changes brittle. Observability stacks become more complex than the systems they observe, requiring their own dedicated team. Refactoring projects start and never end, because their definition of done is "code I am happy with" rather than "code that solves the business problem." Quality work consumes a third of engineering capacity and produces less than its cost in customer-visible value. This is the failure mode of well-meaning engineering organizations that have made quality a virtue rather than a discipline.

A quality-disciplined leader recognizes both failure modes and pushes back on whichever one the team is currently in. Most teams are in underinvestment. Some teams are in overinvestment. A few are in both at once — overinvesting in the wrong quality work while underinvesting in the right quality work.

When Underinvestment Is Rational

There is one scenario where underinvestment in quality is the correct strategic choice: when you do not yet know whether anyone wants what you are building.

In the earliest stages of a product, before you have evidence that the thing you are making is valuable to anyone, heavy quality investment is premature. The cost of building it carefully is paid against the high probability that the entire artifact will be thrown away. The right strategy is to ship rough, learn fast, and accept that the code is disposable.

This is the startup playbook. It works.

The mistake is staying in startup-playbook mode past the point where it was the right call. Most companies do this. The pattern: the company validates that customers want the product, the product starts generating revenue, the codebase that was built to be disposable is now load-bearing, and no one declares the validation phase over. The team continues to ship rough, accumulate debt, and treat quality work as premature optimization. Six quarters later, the team is spending most of its time in incidents and rewrites. The company is no longer in the validation phase. The team's habits have not caught up.

The single highest-leverage quality decision a leader can make at this stage is to name the transition. "We are no longer pre-PMF. Starting today, the bar for new code is different. Here is the new bar. Here is the budget to bring the existing code up to it." The named transition gives the team permission to slow down, and the budget makes the slowdown possible. Without both, the team will continue at validation-phase pace until it breaks.

When Overinvestment Is Happening

The signs of overinvestment:

The test suite takes longer to run than the build takes to produce a binary.
Engineers spend more time maintaining the testing infrastructure than they spend testing.
Refactoring projects are running concurrently in multiple modules with no projected end date.
The platform team has more headcount than the product team and the product team's velocity is dropping.
Code review feedback is dominated by style, naming, and "best practice" comments that have weak connection to defect rates.
The team's response to any production incident is to add tests, regardless of whether tests would have caught the incident.

Overinvestment usually happens in organizations that experienced underinvestment first. The team got burned. They overcorrected. The overcorrection became culture. Years later, the team is shipping carefully tested, well-instrumented software at a third of the pace of competitors who are shipping carefully tested, well-instrumented software at a sensible pace.

The fix for overinvestment is uncomfortable because it requires telling engineers that some of the work they are proud of is not worth doing. That conversation is harder than the conversation about underinvestment, because no one is on fire. The argument is more abstract. Leaders who can have the conversation anyway — naming specific work that should stop, redirecting the capacity into customer-facing outcomes — are the leaders who break the overinvestment pattern.

The Math

For a budget conversation, the numbers a leader needs to produce:

Cost of an incident. Engineer hours during the incident, engineer hours during the post-mortem and remediation, customer-facing cost (churn, refunds, support load), reputational cost (estimate, but estimate it). For a production incident in a mid-stage SaaS company, the all-in cost is typically $50K–$500K depending on severity and customer impact. The numbers vary; the order of magnitude does not.
Cost of attrition. Loss of one senior engineer: roughly twelve months of fully loaded salary, accounting for ramp time of the replacement plus the productivity drag on the rest of the team during transition. For a $300K fully loaded senior engineer, that is $300K in direct cost and another $100K–$200K in indirect cost. Engineers leave for many reasons; quality friction is consistently one of the top three in exit interviews when it appears.
Cost of the proposed investment. Headcount, time, opportunity cost. Be honest about opportunity cost — the work you are not doing while you are doing this.
Expected reduction in incident rate and attrition. This is the hardest number. Quality work pays back invisibly: the incident that does not happen is not on any ledger. The honest defense is to produce a range based on the team's current incident rate and a reasonable assumption about how much of it the proposed work addresses. Then run the math.

A typical quality investment, defended honestly, looks like: $400K of engineering time invested to prevent an estimated 2–4 production incidents per year (at $100K–$300K each) and reduce engineer attrition by one senior departure per year ($400K all-in). Total expected return: $700K–$1.6M against $400K invested. The defensibility is in the range, the assumptions behind it, and the willingness to revise the estimate based on what actually happens.

This is more rigor than most quality budget conversations get. It is also more rigor than most product feature budget conversations get, and the discrepancy is part of the reason quality work loses budget battles it should win.

How to Make the Argument

A few patterns that work:

Lead with the cost of inaction, not the virtue of action. "If we do not fund this, our incident rate trajectory suggests two additional Sev-1 incidents over the next six months at an estimated cost of $200K each." This is the same argument as "quality is important" but in a language the CFO speaks.
Tie the investment to a customer outcome the CFO cares about. Quality work that prevents incidents is quality work that protects customer retention. Frame it that way. Retention is on the CFO's dashboard. Internal engineering hygiene is not.
Propose a tranche, not a forever budget. Six months of investment with specific deliverables and a check-in is much easier to fund than open-ended quality work. The check-in gives the CFO an off-ramp if the work is not producing results, which is exactly the safety they need to greenlight the initial investment.
Show the work, not the conclusion. If you walk in with a three-line ask, you get a three-line answer. If you walk in with a one-page analysis showing your assumptions, your numbers, and your range of outcomes, you get a real conversation. Most leaders do not bring the analysis because the analysis is hard. The leaders who bring it are the ones who get funded.

The Invisible Payback

The deepest economic problem with quality work is that its payback is structurally invisible.

A feature that ships and generates revenue is visible. A migration that completes and reduces cloud spend is visible. An incident that does not happen because the test suite caught the bug last Tuesday is not visible to anyone except the engineer who saw the test fail. The CFO does not see it. The board does not see it. The customer does not see it. The savings are real, and they do not show up on any report.

This is why quality work is chronically underfunded across the industry. It is not because executives do not care about quality. It is because the alternative — funding visible work — has a much easier defense. Until the quality-disciplined leader makes the invisible payback visible, with numbers and ranges and named outcomes, the budget will continue to flow to the things that are easier to point at.

That work — making the invisible visible — is the leader's contribution to the economics of quality. No one else in the organization is positioned to do it. If the engineering leader does not do it, it does not get done.

The CFO is not the obstacle. The CFO is making rational decisions on the information they have. The job is to give them better information.

ShiftQuality