AI Incident Response: When Your Model Does Something Unexpected

Contributor
Jan 29
6 min read

Updated: Jun 22

The previous post in this path covered AI governance for engineering teams — the policies, processes, and oversight structures that govern AI development. This post covers the operational reality that governance must prepare for: AI incidents — the moments when a model produces output that causes harm, violates policy, or behaves in ways that were not anticipated during development.

Traditional software has well-established incident response practices. The server goes down, the on-call engineer is paged, the runbook is followed, the service is restored. AI incidents are different. The model did not crash. It returned a response. The response was technically valid — correct format, no errors — but it was wrong, harmful, biased, or misleading in a way that the system's monitoring did not detect.

AI incident response requires its own playbooks, its own detection mechanisms, and its own organizational structure. Bolting AI incidents onto existing incident response is like bolting airplane safety onto automobile regulations — the failure modes are different enough that the response must be too.

What Constitutes an AI Incident

Not every bad output is an incident. Models produce imperfect outputs regularly — that is the nature of probabilistic systems. An AI incident is an event where model behavior causes or risks causing meaningful harm: to users, to the organization, or to third parties.

Categories of AI incidents include: discriminatory outputs (the model treats demographic groups differently in ways that cause material harm), hallucinated facts (the model presents fabricated information as truth, leading to decisions based on false data), privacy violations (the model reveals personal information from training data or context), safety violations (the model produces content that could cause physical harm), and compliance violations (the model produces output that violates regulatory requirements).

The boundary between "bad output" and "incident" should be defined before an incident occurs, not during one. The governance framework should specify thresholds: what level of harm, what scope of impact, and what categories of failure constitute an incident requiring formal response versus a quality issue handled through normal channels.

Detection: Finding Incidents Before Users Report Them

The most dangerous AI incidents are the ones nobody notices. A recommendation system that gradually shifts toward biased recommendations over months causes harm long before anyone connects the pattern to the model. A chatbot that occasionally leaks PII might do so hundreds of times before a user reports it.

Detection requires multiple layers. Automated classifiers scan model outputs for known categories of problematic content — toxicity, PII, hallucination indicators, bias signals. Statistical monitoring tracks output distributions over time and alerts when patterns shift — if the model's response characteristics change significantly, something has changed that warrants investigation.

User feedback is a detection mechanism, not the only one. Users report the incidents they notice and find objectionable enough to complain about. They do not report the incidents they do not notice — the subtle biases, the plausible-sounding hallucinations, the recommendations that seem reasonable but are based on flawed logic.

Red team exercises — deliberate attempts to elicit problematic behavior — should run regularly against production systems. The adversarial prompts that worked six months ago may have been fixed, but new failure modes emerge as the model, the prompts, and the data evolve.

The Response Framework

When an AI incident is detected, the response follows a structured framework that balances speed with thoroughness.

Triage. Assess the severity and scope. How many users were affected? Is the incident ongoing? What is the potential for harm if the model continues operating? Severity determines the response level — a single bad output to a single user requires a different response than a systematic bias affecting thousands of users.

Containment. Stop the harm. This might mean rolling back to a previous model version, activating a more restrictive content filter, routing affected queries to human review, or in extreme cases, taking the model offline. The containment action should be proportional to the severity — taking a revenue-critical model offline has business consequences, and those consequences should be weighed against the ongoing harm.

Investigation. Understand why the incident occurred. What input triggered the problematic output? Is this a systematic failure or a rare edge case? Was the failure foreseeable given the model's known limitations? Did the existing monitoring miss it, and if so, why?

The investigation must go beyond "the model did X." It must explain the causal chain: what properties of the training data, the prompt, the retrieval context, or the model architecture contributed to the failure. This causal understanding is what prevents recurrence.

Remediation. Fix the root cause. This might involve retraining with corrected data, modifying prompts, adding output filters, adjusting retrieval logic, or implementing new monitoring for the failure mode. The remediation should be tested — ideally against evaluation cases that reproduce the incident — before it reaches production.

Communication. Inform affected stakeholders. This might include users who received harmful output, internal teams that depend on the model, regulatory bodies if compliance was violated, or the public if the incident was visible. The communication should be transparent about what happened, what harm occurred, and what steps were taken.

Building the Incident Playbook

AI incident playbooks differ from traditional incident playbooks in several ways.

Rollback is not always straightforward. Rolling back a model version means returning to a previous version's behaviors — which might include different bugs or limitations. The "last known good" state for an AI system is harder to define than for traditional software.

Root cause analysis is probabilistic. Traditional software bugs have deterministic causes: a specific code path produces a specific error. AI incidents often have probabilistic causes: a combination of input features, model weights, and context that produces problematic output with some probability. The root cause might be "the training data contains biased patterns that the model learned" — which is not fixable with a code change.

Testing the fix is uncertain. After remediating a traditional bug, you can write a test that definitively proves the bug is fixed. After remediating an AI incident, you can test that the specific triggering input no longer produces the problematic output — but you cannot guarantee that similar inputs will not trigger similar failures.

The playbook should account for these differences. It should include escalation criteria (when to involve legal, PR, or executive leadership), communication templates (pre-approved language for different incident severities), and a decision matrix for containment actions (what level of degradation is acceptable to prevent what level of harm).

Post-Incident Learning

Every AI incident is a learning opportunity — about the model's limitations, about the monitoring system's blind spots, and about the organization's preparedness.

The post-incident review should produce three outputs. A timeline of what happened, from first occurrence to detection to resolution. An analysis of what monitoring or evaluation gaps allowed the incident to occur or persist. And a set of action items — specific, assigned, and tracked — that reduce the probability or impact of similar incidents.

Post-incident reviews should be blameless. The engineer who deployed the model change that triggered the incident was operating within the process the organization provided. If the process allowed a harmful model to reach production, the process is the root cause — not the engineer.

The action items from post-incident reviews should feed into the governance framework. If an incident reveals that the pre-deployment evaluation did not test for a category of harm, the evaluation suite should be expanded. If monitoring missed a pattern, new monitoring rules should be added. The governance framework evolves through incident learning.

The Takeaway

AI incidents are inevitable in production AI systems. The model will produce unexpected, harmful, or incorrect output. The question is not whether this will happen but whether the organization is prepared to detect it quickly, contain it effectively, investigate it thoroughly, and learn from it systematically.

An AI incident response capability is not optional for organizations operating AI in production. It is the operational expression of responsible AI — the mechanism that turns principles into action when principles are tested by reality.

Next in the "AI Governance at Scale" learning path: We'll cover continuous AI auditing — building the ongoing evaluation and monitoring practices that catch governance failures before they become incidents.

ShiftQuality