Rolling Back a Change Safely

Contributor
May 15
6 min read

The most dangerous moment in any deploy is the one where you realize you have to roll back. Customer-impacting traffic is hitting the broken version. Someone has to decide quickly, the team is split between fixing forward and going back, and the rollback plan written six weeks ago hasn't been touched since. Now you find out whether it actually works.

A real rollback plan answers questions you don't have time to think about during an incident. This guide is the practical version.

What Rollback Actually Means

Rolling back is more than "deploy the previous version." It's restoring the system to a known-good state, including:

Code reverted
Configuration reverted
Database in a state the previous code can read and write
Caches consistent with the previous state
External integrations consistent with what the previous version expected

For a stateless service, this is straightforward. For anything involving data, migrations, or external state, it's complicated. The rollback complexity tracks the change complexity.

The Categories of Reversibility

Changes fall on a spectrum of reversibility.

Trivially reversible. Stateless service, simple deploy, no schema or data change. Rollback is redeploy previous version. Done in minutes.

Reversible with care. Database schema change that's backward-compatible (e.g., added a column). Old code still works against new schema. Rollback is the code revert; the schema change stays.

Reversible with data work. Schema change that the previous version can't handle (e.g., renamed a column). Rollback requires either renaming back or running the old code against a snapshot.

Hard to reverse. Data migration that transformed values. The original data may be gone. Rollback requires restoring from backup or re-running the transformation in reverse.

One-way door. A change that fundamentally can't be reverted — sending an email to customers, processing a payment, deleting a record permanently. The only "rollback" is compensating action.

Knowing where on this spectrum your change sits determines the rollback design.

Design for Rollback Before You Ship

The best rollback strategy is to design the change to be rollback-friendly in the first place.

Backward-compatible schema changes. Don't rename columns; add a new one and deprecate the old. Don't drop columns; mark them unused. The change becomes reversible because the old code never lost access to data it needed.

Two-phase migrations. Phase 1: deploy code that writes to both old and new state, reads from old. Phase 2: deploy code that reads from new. Rollback during phase 1 is trivial.

Feature flags. New behavior behind a flag, default off. "Rollback" becomes "flip the flag back." This is the most powerful single rollback technique available to modern teams.

Idempotent operations. Operations that produce the same result whether run once or many times. Rollback can re-run an operation safely.

Blue-green or canary deploys. Old and new versions running side by side. "Rollback" is shifting traffic back.

Design for these and rollback becomes a routine operation instead of an emergency.

Write the Rollback Plan with the Change

A rollback plan written after a deploy is fiction. Write it as part of the change request, before the change ships.

The plan should include:

Trigger criteria. What specific symptoms cause us to roll back? "Customer-facing errors above X%." "Latency above Y ms for Z minutes." Not "if things look bad." Specific, observable, falsifiable.
Authority. Who decides? On-call lead? Incident commander? The owner during a planned window?
Steps. Exact commands or actions, in order. Including how to verify each one.
Time estimate. How long does rollback take? If the answer is "more than 30 minutes," that's a red flag worth surfacing.
What state is recovered. What state isn't.

The plan should be readable by someone who didn't write it. If only the original author can execute it, you have a single point of failure.

Test the Rollback

Untested rollback is wishful thinking.

For major changes, test rollback in staging before production. Deploy the change, then execute the rollback. Verify the system is in the expected state.

Things that look obvious and aren't:

Connections cached at the application layer don't reset when you redeploy
Background jobs in queues still hold the new format
Data written during the failed deploy may be in a state the old code can't read
Configuration changes that took effect in external services aren't reversed by code rollback

Each of these has burned someone. Testing catches them before the real incident.

The Trigger Decision

Calling rollback under pressure is hard. The team wants to fix forward. The author of the change is invested in their work. Time pressure makes thinking expensive.

Strategies that help:

Pre-decide. The trigger criteria in the rollback plan are decisions made in advance, when calm. "If 5xx rate exceeds 1% for 3 minutes, we roll back." That decision was made by sober people. The on-call's job at 3am is to execute it, not to redecide it.

Time-box. "We'll try one fix forward. If it's not resolved in 10 minutes, we roll back." Time-boxing forces the decision instead of letting the team chase a fix indefinitely.

Single authority. One named person has rollback authority. Not a committee. Not "we'll decide together." Committees in incidents don't decide.

Bias toward rollback. When in doubt, roll back. Fixing forward feels productive but often takes longer than rollback, and during the time you're trying to fix forward, customers are seeing broken behavior. Rollback gets you back to known-good state immediately; you can investigate at leisure.

Data Rollback Is Special

The hardest rollbacks involve data. Once you've migrated a billion rows, you can't easily un-migrate them.

Strategies:

Backup before migrating. Specifically, a snapshot at the moment of migration. Tested-restorable. Storage is cheap; recovery from no backup is expensive.
Forward-only migrations with compensating changes. Instead of un-migrating, fix forward with a follow-up migration that restores correctness.
Migrate in small batches. A failed migration of 100 rows is easier to recover from than a failed migration of 100 million.
Dual-write periods. During the migration, write to both old and new schemas. Rolling back means switching reads back to old. The data is still there.

Plan the data rollback strategy before the migration runs. Discovering you need to invent one during the incident is not a fun moment.

The Communication Side of Rollback

A rollback affects more than the technical system. Stakeholders need updates.

Customers should see acknowledgment on the status page if the issue was customer-visible. After rollback, an update confirming resolution.
Internal teams that depended on the change should know the rollback happened and what's next.
Leadership should be told for changes with broad blast radius.

Pre-drafted messages for both "rollback in progress" and "rollback complete" save time when you need them.

After the Rollback

The change failed; the system is back. Now what?

Investigate without time pressure. Now you can analyze the failure without customers waiting.
Capture timeline and decisions. What happened, when, who decided what. The basis for the retrospective.
Decide on next steps. Is the change worth retrying with fixes? Is there an underlying issue to address first? Should we kill the change entirely?
Update the rollback plan. What did we learn? Refine the trigger criteria, the steps, the authority.

The retrospective for a rollback is one of the most valuable retrospectives the team can run. The incident is contained, the data is fresh, and the lessons compound.

Common Rollback Failures

Plan exists, no one's read it. Engineers find the plan in a wiki at 2am, realize it references a tool that's been deprecated.

Plan assumes things that have changed. Authentication, IAM, deployment pipeline — all may have evolved since the plan was written.

Rollback takes longer than the original deploy. The team has practiced shipping forward; they haven't practiced rollback. The procedure is unfamiliar.

Rollback fails partway. Now you're in an unknown state, neither old nor new. This is the worst case. Mitigated by testing.

Rollback succeeds but the data is wrong. The code is reverted; the data written during the failed deploy is still there in a state the old code mishandles.

Each of these is preventable with practice and discipline.

The Game Day

For high-stakes systems, periodically run "game days" — deliberate rollback drills. Deploy something to staging, intentionally cause it to fail, execute rollback. Time it. Find the broken parts.

This is the closest thing to insurance for rollback capability. It costs a day; it can save hours during a real incident.

Key Takeaway

A rollback plan is only as good as your willingness to test it. Design changes to be rollback-friendly: backward-compatible schemas, feature flags, two-phase migrations. Write the plan as part of the change request, with specific trigger criteria and a single decision authority. Test rollback in staging for major changes. Bias toward rolling back in production when in doubt — fixing forward feels productive but often takes longer. After rollback, retrospect thoroughly and update the plan. Untested rollback is hope, not strategy.

ShiftQuality