Disaster Recovery That Actually Works

Contributor
Dec 17, 2025
5 min read

A backup you've never restored is not a backup. It's a hope.

Disaster recovery is the set of practices that determine what happens when things go seriously wrong — database corruption, region outage, ransomware, catastrophic misconfiguration, or the nightmare scenario where multiple things fail simultaneously. Every organization says they have a DR plan. Most have a document written two years ago that nobody's tested, referencing infrastructure that's since been restructured.

The gap between "we have a plan" and "we've practiced the plan and it works" is where disasters become catastrophes.

The Numbers That Matter

Recovery Time Objective (RTO)

How long can you be down before the business impact becomes unacceptable? For an e-commerce site during Black Friday, the answer might be minutes. For an internal reporting tool, it might be hours. For a documentation site, it might be a day.

RTO determines how fast your recovery process needs to be — which determines the infrastructure investment. A 15-minute RTO requires hot standby environments that can take over instantly. A 4-hour RTO can use cold recovery from backups. The cost difference is significant.

Recovery Point Objective (RPO)

How much data can you afford to lose? RPO is measured in time — if your RPO is 1 hour, you can lose up to 1 hour of data. If it's 0 (zero data loss), you need synchronous replication.

RPO determines your backup frequency and replication strategy. Daily backups mean up to 24 hours of data loss. Continuous replication means near-zero data loss. Again, the cost difference is significant.

The conversation: RTO and RPO should be set by the business, not by engineering. Engineering determines what's technically possible and what it costs. The business decides what level of risk is acceptable given the cost. A 5-minute RTO and 0 RPO for every service is technically achievable and financially ruinous. Prioritize: which services need the tightest guarantees?

Backup Strategy

The 3-2-1 Rule

Three copies of your data. Two different storage types (e.g., local disk and cloud object storage). One copy offsite (different region or provider).

This isn't paranoia. A single backup on the same infrastructure as the primary database fails in exactly the scenarios where you need it most — data center failure, cloud provider outage, ransomware that encrypts everything it can reach.

What to Back Up

Databases: The obvious one. Automated snapshots on a schedule that meets your RPO. Test restores regularly.

Configuration: Infrastructure as code, application configuration, secrets (encrypted). If your infrastructure burns down, can you rebuild it from code? If not, you're missing something.

State that isn't in databases: File uploads, generated artifacts, cache warming data, search indexes. Identify everything that's stateful and ensure it's backed up or reproducible.

The recovery procedure itself. If your DR runbook is stored on the server that's down, it's not useful. Store it somewhere independent — a printed copy, a separate cloud account, a shared drive that's not tied to the primary infrastructure.

Testing Backups

A backup test isn't "the backup job completed successfully." It's "I restored from this backup to a clean environment and the application works correctly with the restored data."

Schedule restore tests monthly. Automate them if possible — a pipeline that restores last night's backup to a test environment and runs smoke tests. If the restore fails or the tests fail, you know immediately, not during an actual disaster.

Recovery Strategies by Tier

Tier 1: Backup and Restore (Cold Recovery)

Restore from backups to new infrastructure. Cheapest. Slowest. Appropriate for services with RTOs measured in hours.

Process: Provision new infrastructure (from IaC), restore databases from backups, deploy application code, update DNS or load balancers, verify functionality.

RTO: Hours, depending on data volume and infrastructure provisioning time.

Tier 2: Warm Standby

A scaled-down copy of your production environment runs continuously in a secondary region. Data is replicated asynchronously. During disaster, you scale up the standby and redirect traffic.

Process: Scale standby infrastructure to production capacity, verify data replication is current, redirect traffic, verify functionality.

RTO: 15-60 minutes.

Tier 3: Hot Standby / Active-Active

Full production environments in multiple regions, handling traffic simultaneously. If one region fails, the other absorbs all traffic automatically.

Process: Automatic failover via health checks and load balancing. Manual verification that the failover was clean.

RTO: Minutes or less.

Cost: Roughly 2x your infrastructure cost, plus the engineering complexity of multi-region data consistency. Worth it for services where downtime is measured in lost revenue per minute.

The Disaster Recovery Drill

A DR plan that isn't practiced is a theory. Theories fail under pressure in ways you can't predict.

Tabletop Exercise

Walk through a disaster scenario as a team. "It's Tuesday at 2 PM. The primary database is corrupted. Walk me through what happens." Each person describes their role, what they'd do first, what tools they'd use, who they'd notify.

This surfaces gaps immediately: "I'd restore from the backup" → "Where is the backup?" → "I think it's in S3" → "Which bucket?" → silence.

Tabletop exercises are cheap (1-2 hours, no infrastructure cost) and revealing. Run them quarterly.

Simulated Recovery

Actually perform a recovery — restore from backup, provision new infrastructure, redirect traffic — in a non-production environment. Time the process. Identify steps that are slower than expected, require access that isn't readily available, or depend on knowledge that only one person has.

Run simulated recoveries semi-annually.

Chaos Testing

Intentionally inject failures into production (in a controlled way) to verify that automated recovery works. Kill a database replica and verify failover. Disable a region and verify traffic redirects. Introduce network latency and verify timeouts and retries behave correctly.

This is the most advanced and most valuable form of DR testing. It verifies not just that recovery is possible, but that it's automatic.

The Runbook

Every service with an RTO should have a disaster recovery runbook that covers:

Detection: How do you know a disaster has occurred? What alerts fire?
Assessment: How do you determine the scope and severity?
Communication: Who needs to be notified? Customers? Stakeholders? The on-call team?
Recovery steps: Step-by-step instructions, specific enough that someone who's never done this before can follow them under pressure.
Verification: How do you confirm recovery is complete and the service is healthy?
Post-incident: What happens after recovery? Data reconciliation? Customer communication? Incident analysis?

Write the runbook for the person who will read it at 3 AM, stressed, with incomplete information. Clear, specific, and no assumptions about prior knowledge.

Key Takeaway

Disaster recovery that works is defined by measurable objectives (RTO/RPO), follows the 3-2-1 backup rule, is tested regularly through tabletop exercises, simulated recoveries, and chaos testing, and is documented in runbooks written for stressed humans at 3 AM. The investment in DR correlates with the cost of downtime — prioritize your most critical services and ensure the recovery process works before you need it.

This completes the Production-Ready Infrastructure learning path. You've covered infrastructure as code, Terraform state management, CI/CD pipelines, and disaster recovery. The throughline: production-ready infrastructure is infrastructure you can rebuild, recover, and trust — because you've tested it before it mattered.

ShiftQuality