Incident Command for Software Teams

Contributor
Jun 23, 2025
6 min read

It is 2 AM. Your monitoring dashboard lights up. Customer complaints are flooding in. Your payment processing is down. Three engineers are in a Slack channel, all investigating different theories. Nobody knows who is making decisions. Somebody restarts a service without telling anyone, which breaks something else. The CEO is asking for updates. Nobody has updates.

This is what incidents look like without structure. It does not matter how talented your engineers are — under pressure, without a framework, smart people make uncoordinated decisions that make things worse.

Incident command is a structured approach to managing incidents that originated in emergency services — firefighting, specifically — and has been adapted for software teams. It gives every person a clear role, establishes communication channels, and ensures that someone is always making decisions.

The Incident Command System Adapted for Software

The traditional Incident Command System (ICS) was developed after a series of wildfire management failures in the 1970s. The core insight was that disasters are not chaotic because the problems are unsolvable — they are chaotic because the people responding are uncoordinated. Structure solves this.

For software teams, the system simplifies to three core roles:

Incident Commander (IC)

The IC is the decision-maker. They do not debug. They do not write code. They do not investigate. They coordinate.

The IC's job is to understand the current state of the incident, assign tasks to responders, make decisions when the team is stuck, communicate status to stakeholders, and decide when the incident is resolved.

This is the hardest role to fill because it requires someone who can resist the urge to dive into the technical details. Good ICs are often senior engineers or engineering managers who trust their team to do the technical work while they manage the process.

Communications Lead

The communications lead handles all external communication. Status page updates, stakeholder notifications, customer-facing messages, and executive updates all flow through this role.

Separating communication from investigation is critical. When the person debugging the database is also writing the status page update, both tasks suffer. The communications lead takes that burden off the technical responders.

Subject Matter Experts (SMEs)

These are the engineers doing the actual investigation and remediation. They report to the IC, work on assigned tasks, and communicate their findings through the established channels.

SMEs should be selected based on expertise relevant to the incident — the database engineer for database issues, the networking specialist for network issues. The IC assigns and reassigns SMEs as the incident evolves.

Severity Levels That Mean Something

Most organizations have severity levels. Most severity level definitions are vague enough to be useless. "Major impact on customers" could describe anything from a slow page load to complete data loss.

Effective severity levels are specific and actionable:

SEV-1 (Critical): Complete loss of a core service affecting all or most customers. Revenue impact is occurring. Data loss is possible or confirmed. Response: all-hands, wake people up, executive notification immediate.

SEV-2 (High): Significant degradation of a core service or complete loss of a non-core service. Workarounds may exist but are inadequate. Response: on-call team plus relevant SMEs, stakeholder notification within 30 minutes.

SEV-3 (Medium): Partial degradation that affects a subset of customers or a single feature. Workarounds are available. Response: on-call team during business hours, stakeholder notification within 2 hours.

SEV-4 (Low): Minor issue with minimal customer impact. No urgency. Response: normal ticket workflow.

The key is that severity determines response, not just urgency. A SEV-1 has a different staffing model, communication cadence, and escalation path than a SEV-3. When you declare a severity, everyone should know exactly what happens next.

The First 15 Minutes

The first 15 minutes of an incident set the tone for everything that follows. Here is a playbook:

Minute 0-2: Acknowledge and assess. The person who detects the issue acknowledges it in the incident channel. Initial assessment: what is broken, how many customers are affected, is it getting worse?

Minute 2-5: Declare severity and assign IC. Based on the initial assessment, declare a severity level and assign an incident commander. If the severity is unclear, default to higher — you can always downgrade. The IC takes over coordination from this point.

Minute 5-10: Assemble the team. The IC identifies which SMEs are needed and pages them. The communications lead begins drafting the initial stakeholder notification.

Minute 10-15: Establish the investigation plan. The IC works with SMEs to identify the most likely causes and assigns investigation tracks. Each SME reports back with findings on a regular cadence (every 10-15 minutes for SEV-1).

This playbook should be documented, rehearsed, and accessible to everyone who might respond to an incident. When you are woken up at 2 AM, you should not have to remember the process — you should be able to follow a checklist.

Communication Templates

Under stress, writing clear communication is hard. Templates help.

Internal Status Update

Incident: [brief description]
Severity: [SEV level]
Status: [investigating / identified / mitigating / resolved]
Impact: [who is affected and how]
Current theory: [what we think is wrong]
Next steps: [what we are doing about it]
Next update: [when]

Customer-Facing Update

We are aware of an issue affecting [service/feature].
[X]% of customers may experience [specific symptom].
Our team is actively investigating.
We will provide an update by [time].

Executive Update

What is happening: [one sentence]
Customer impact: [scope and severity]
Business impact: [revenue, reputation, contractual]
Current status: [what is being done]
Estimated resolution: [honest assessment or "investigating"]

Keep updates factual. Do not speculate. Do not promise timelines you cannot keep. "We are investigating and will update in 30 minutes" is better than "We expect this to be resolved in an hour" when you do not actually know.

Escalation Paths

Every incident response plan needs clear escalation paths. When does the on-call engineer wake up the team lead? When does the team lead involve the VP of Engineering? When does someone notify the CEO?

Escalation should be based on severity and duration, not on the responder's comfort level. Define triggers:

SEV-1 not mitigated within 30 minutes: escalate to engineering leadership
SEV-1 with confirmed data loss: immediate executive notification
Any incident lasting more than 2 hours: leadership briefing
Any incident with external regulatory implications: legal notification

Escalation is not failure. It is the system working correctly. Create a culture where escalating early is rewarded, not penalized.

Post-Incident Process

The incident is resolved. Customers are back to normal. Now what?

The Post-Incident Review

Within 48 hours of resolution, conduct a post-incident review (sometimes called a postmortem, though many teams are moving away from that term). The review should cover what happened (timeline of events), why it happened (root causes and contributing factors), what we did well (response actions that helped), what we could improve (response actions that hindered), and action items (specific changes to prevent recurrence).

The review must be blameless. The goal is understanding the system failures that made the incident possible, not identifying the person who made a mistake. If someone deployed bad code, the question is not "who approved it" but "why did our deployment process allow it."

Action Items That Actually Get Done

The post-incident review is only valuable if the action items are completed. Assign each action item to a specific person with a specific deadline. Track completion. Report on post-incident action item completion rates — if they are low, your incident process is generating lessons that are not being learned.

Practice Through Game Days

You do not want the first time your team uses incident command to be during an actual incident. Game days — planned exercises that simulate incidents — build muscle memory.

A game day does not need to be elaborate. Inject a realistic failure into a non-production environment. Declare an incident. Run the full process: assign roles, investigate, communicate, resolve. Then debrief on what worked and what did not.

Run game days quarterly. Rotate who plays which role — everyone should experience being IC at least once. Include scenarios that require escalation, cross-team coordination, and difficult communication decisions.

The teams that handle real incidents best are the teams that practice. There is no substitute for experience, but simulated experience is far better than none.

Building the Culture

Incident command is a framework, but it only works if the culture supports it. That means respecting the IC's authority during incidents. It means making post-incident reviews genuinely blameless. It means treating incidents as learning opportunities rather than failures. It means investing time in game days even when there is pressure to ship features.

The organizations that handle incidents well are not the ones with the best engineers — they are the ones with the best processes. And the best processes are the ones that are practiced, documented, and continuously improved.

ShiftQuality