About this article
As the twelfth installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains incident response.
Incidents always happen. Designs that pray they donât crumble after they do. This article handles the sequence of detection, alerting, notification, response, recovery, and review, on-call regime, Severity definitions, and postmortem culture (the GitLab 2017 textbook case) - the systematization of noticing fast and recovering quickly.
Other articles in this category
What incident response is
Incident response is the activity of responding to system failures and abnormal events and recovering quickly. Beyond merely âfixing,â it designs the sequence from detection to review. On the premise that failures will always occur, the goal is systematizing how fast you notice and how quickly you recover.
Good incident response doesnât depend on individual heroic effort. It has processes, tools, and drills letting whoeverâs on duty operate at consistent quality, turning a learning cycle of not causing the same incident twice.
Incidents happen. âHow many minutes to notice, how many minutes to recoverâ is operational skill.
Why incident response is needed
Response delay expands damage
In SaaS where 1 minute of downtime affects tens of thousands, 1 minute of response links directly to millions in losses. Fast detection and response defends company value.
Person-locking loses reproducibility
âVeteran fixed it overnightâ-type response collapses without the same person next time. Mechanisms letting anyone respond at the same quality are needed.
Recurrence prevention is the biggest value
The true purpose of incident response is recurrence prevention. Whether you can reflect lessons from one incident into system improvement is the dividing line.
Incident-response phases
Design incident response in 6 phases. With clear responsibles and procedures per phase, person-locking decreases.
flowchart LR
D[Detection<br/>monitoring/alerts] --> T[Triage<br/>severity judgment]
T --> R[Response<br/>mitigation/recovery]
R --> C[Communication<br/>stakeholders]
C --> RES[Resolution<br/>root-cause removal]
RES --> PM[Post-mortem<br/>review/recurrence prevention]
PM -.|reflect learning| D
classDef detect fill:#fee2e2,stroke:#dc2626;
classDef respond fill:#fef3c7,stroke:#d97706;
classDef comm fill:#dbeafe,stroke:#2563eb;
classDef recover fill:#dcfce7,stroke:#16a34a;
classDef learn fill:#fae8ff,stroke:#a21caf;
class D,T detect;
class R respond;
class C comm;
class RES recover;
class PM learn;
| Phase | Content |
|---|---|
| Detection | Notice via monitoring / alerts |
| Triage | Judge severity / impact range |
| Response | Mitigation / recovery response |
| Communication | Stakeholder situation transmission |
| Resolution | Remove root cause |
| Post-mortem | Review and recurrence prevention |
Severity levels
Stage incident severity and vary response regime. Responding to all incidents at the same intensity exhausts personnel, so per-severity escalation is required.
| Level | Content | Response |
|---|---|---|
| SEV 1 | Total stop / major data loss | All hands, 24-hour response |
| SEV 2 | Major-feature failure | On-call + SRE immediate response |
| SEV 3 | Partial-feature failure | Respond within business hours |
| SEV 4 | Minor defect | Normal backlog |
Without pre-documenting severity-judgment criteria, judgment varies per site and chaos ensues.
On-call regime
On-call uses the same thinking as a fire-station rotation. Decide an on-call schedule covering 24/7, distribute load via rotation, and avoid concentration on a single person. Standard operational form for SRE organizations, treated as a central chapter in Googleâs SRE book.
| Design item | Content |
|---|---|
| Rotation | Weekly / bi-weekly is standard |
| Primary / Secondary | 2-stage backup |
| SLA | Within how many minutes for first response |
| Allowance | Compensation for nights / holidays |
| Handoff | Handover at shift change |
On-call is heavy load, so SREâs continuous effort is lowering on-call frequency via alert reduction and auto-recovery. Late-night calls 2+ times monthly is a sign of overload.
Promotion from alert to incident
Not all alerts are incidents. Many auto-recover or finish with response within business hours. Design discerning what should truly be an incident is needed.
| Situation | Response |
|---|---|
| Auto-recovery | Just record the alert |
| In-business-hours response possible | Create ticket |
| Immediate response needed | Declare incident |
| Severe damage | Major incident, all hands |
Incident declaration is an explicit act, flipping the switch of âstop normal operations and concentrate when declared.â
Command center (war regime)
For major incidents (SEV 1-2), launch an Incident Command Center (ICC) where all gather. Split roles for parallel work, clarifying âwhat to doâ for everyone.
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Overall command / decision |
| Operations Lead | Technical response |
| Communications Lead | Customer / internal communication |
| Scribe | Timeline recording |
Itâs important that IC doesnât make technical decisions, dedicating to overall situation grasp and decision-making. The division of labor leaving technical investigation to Operations Lead is efficient.
Notification and communication
Information transmission during incidents is as important as response itself. Without pre-deciding who, what, and when to communicate, info gets disparate among customers, management, and the field.
| Recipient | Content | Channel |
|---|---|---|
| Response team | Tech details / progress | Slack channel |
| Management | Impact range / ETA (Estimated Time of Arrival, recovery target time) | Email / Slack |
| Customers | Situation / recovery target | Status page |
| Support | FAQ / response template | Internal wiki |
Statuspage.io and Atlassian Statuspage are SaaS for customer-facing situation announcement, important tools maintaining trust during incidents.
Postmortems
Once an incident is resolved, always conduct review - the iron rule of Google SRE. Document âwhat happened,â âwhy it happened,â âhow to improveâ - turning into organizational asset.
| Item | Content |
|---|---|
| Overview | What, when, how much impact |
| Timeline | Time-of-day events |
| Root cause | Why it happened |
| Mitigation | Contents of first aid |
| Recurrence prevention | Permanent countermeasure |
| Improvement actions | Who by when |
The major principle is donât blame people (Blameless Post-mortem). Treat as system / process flaws, not individual fault.
Root cause analysis (RCA)
The technique of finding the true cause beyond surface causes is RCA (Root Cause Analysis). The â5 Whysâ of repeating âwhyâ 5 times is famous, but the reality in complex systems is combinations of multiple factors rather than single causes.
Symptom: DB went down
|- why? Connections overflowed
|- why? Connection leak in new feature
|- why? Not noticed in code review
|- why? Review viewpoints have no "connection management"
|- why? Review guide stayed old
The true cause becomes the conclusion of flaw in review culture, not âcode bugâ - that becomes the improvement target.
Runbooks
Documented typical incident-response procedures are Runbooks. Write âif this alert fires, respond this way,â letting anyone respond at the same quality.
| Runbook contents | |
|---|---|
| Firing condition | Which alert |
| Initial check | What to verify |
| Diagnosis steps | Triage flow |
| Recovery steps | Commands to execute |
| Escalation criteria | When to whom |
Standard operation is placing in Notion / Confluence / Git repos, linked directly from alerts.
Decision criterion 1: org scale
Incident-response heaviness varies with scale. Small needs simplicity, large needs division-of-labor regime.
| Scale | Recommended |
|---|---|
| Startup | Slack + PagerDuty + Google Docs |
| Mid-size enterprise | PagerDuty + Statuspage + Notion |
| Large enterprise | ServiceNow / Jira Service Management |
| Global | Follow-the-Sun (relay daytime hours across regions) on-call regime |
Decision criterion 2: SLA strictness
Stricter customer SLA means stricter incident-response SLA.
| Customer SLA | Response regime |
|---|---|
| 99% | Business-hours response is enough |
| 99.9% | Night on-call needed |
| 99.95%+ | 24/7 dedicated SRE team |
| 99.99%+ | Per-region on-call + multi-layer SRE |
How to choose by case
Personal dev / small in-house tools
UptimeRobot + email/Slack notifications is enough. On-call regime unneeded, business-hours response OK. Place a simple Runbook in Notion to document personal knowledge and just prevent person-locking.
Startup / growth-stage SaaS
PagerDuty + Statuspage.io + Slack channel. 2-3 on-call weekly rotation, operate Blameless postmortems on Google Docs. Narrow SEV criteria to 3 stages (SEV 1/2/3) for simplicity.
Mid-size enterprise / microservices ops
PagerDuty / Opsgenie + Runbook as Code + game-day drills. Per-service on-call separation, Runbooks Git-managed with PR review, quarterly incident drills. SRE-dedicated team of 2-5 coordinates overall.
Finance / medical / global enterprise
ServiceNow / Jira Service Management + Follow-the-Sun + AIOps (AI for IT Operations). SEV 1 auto-notifies up to management, region-based on-call (Tokyo, Europe, North America) handoffs, regulator-reporting process also built in. Automate routine response with Resolve AI etc.
Common misconceptions
Veteran fixing is faster
The work style of âthe one most familiar with the system fixes everything overnightâ looks fastest short-term. But cases where incidents with the same symptoms occurred the week after that personâs resignation, no one could reach in, customer first response took 3 hours are repeatedly told in the field. Person-locking is the work style with highest cost when lost, and mechanisms letting anyone follow the same procedures are stronger long-term - the lesson.
Hunt for blame in postmortems
The worst antipattern. Blameless is the rule. Once blame-hunting starts, no one speaks honestly anymore.
Do all recurrence-prevention measures
Resources donât suffice. Set priorities. From large impact / high recurrence-rate.
SEV 1 rarely happens
Without periodic drills, you canât move when needed. Drill via game day.
Incident-response numerical gates / SLA
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
Incident response doesnât function without numerically defining âwhat to do in how many minutes.â Below are industry-standard SLAs.
| Metric | SEV 1 | SEV 2 | SEV 3 | SEV 4 |
|---|---|---|---|---|
| First response (MTTA) | Within 5 min | Within 15 min | Within 1 hour | Next business day |
| Recovery (MTTR) target | Within 1 hour | Within 4 hours | Within 1 day | Within 1 week |
| Notification channel | PagerDuty + phone | PagerDuty | Slack | Jira |
| Escalation | Immediate + management | IC + OpsLead | On-call | In business hours |
| Postmortem | Required (within 1 week) | Required (within 2 weeks) | Optional | Unneeded |
| Status page | Immediate update | Update | As needed | Unneeded |
| Recurrence prevention | Company-wide rollout | Team rollout | Within team | - |
For on-call health metrics, late-night calls 2+ times monthly is a sign of overload, alert-firing rate 50%+ false positives means alert review, recurrence rate over 10% means revisiting postmortem quality. With AWSâs 4 golden signals (Latency / Traffic / Errors / Saturation) as basis, pre-define which SEV fires.
First response within 5 minutes is reliabilityâs lifeline. Systematize via Runbooks and on-call regime.
Incident-response pitfalls and forbidden moves
Typical accident patterns in incident response. All result in raising recurrence rate or exhausting the org.
| Forbidden move | Why itâs bad |
|---|---|
| Hunt for blame in postmortems | Hotbed of info-hiding, next incidents become invisible. Blameless is the rule |
| Leave everything to one veteran | Collapse on resignation, the 2019 US-major-bank insider-fraud pattern |
| Operate without on-call SLA | First response becomes hours, customer churn |
| Manage Runbooks in Word/PDF | Person-locked, irreproducible, AI canât run them |
| Leave SEV criteria to field judgment | Severity varies, response regime confused |
| Vague incident-declaration criteria | Either all-hands on every alert or missing major incidents - both extremes |
| Donât update status page | Customer-inquiry flood, trust collapse |
| No game-day drills | Canât move when SEV1 occurs, learning at first response |
| Postmortem written and done | Action items unimplemented, same incident recurs 3 months later |
| IC also does technical investigation | Canât grasp overall, response delayed |
The January 31, 2017 GitLab DB-deletion incident (engineer mistook prod and dev for rm -rf, 5 backup types all failed) is a case where the stance of livestreaming recovery work + publishing detailed postmortem - learning without hiding - was praised. In contrast, Uber 2016 data leak (initial concealment, paying attackers $100k for silence â subsequent litigation $148M settlement) showed the cost of concealment.
Incidents are on the premise of happening. Preparation to receive via mechanism and culture is everything.
AI-era perspective
When AI-driven dev (vibe coding) and AI usage are the premise, incident response is evolving into the area of AI agents handling first response. Datadog Bits AI, PagerDuty AIOps, Resolve AI, etc. automate initial triage and root-cause-candidate presentation.
| AI-era change | Content |
|---|---|
| Initial-diagnosis automation | AI analyzes logs / metrics |
| Auto-execution of Runbooks | AI auto-conducts common responses |
| Postmortem drafts | AI generates timelines |
| Anomaly prediction | Sign-detection before failure |
Division of labor where humans focus on critical judgments and recurrence prevention, with routine first response left to AI, is starting. Code-ize and structure Runbooks, and the future of AI auto-running them is near.
In the AI era, divide labor: routine response to AI, judgment and learning to humans.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Severity criteria (SEV 1-4 definitions)
- On-call regime (rotation, SLA)
- Notification tools (PagerDuty / Opsgenie)
- Command-center rules (declaration criteria, roles)
- Status page (customer info dissemination)
- Postmortem rules (Blameless, scope of disclosure)
- Runbook management (location, update rules)
How to make the final call
The core of incident response is the thinking of resolving via mechanism, not individual heroes. The modern design is having Runbooks, on-call, and command center where whoeverâs on duty operates at consistent quality, switching response regime by severity. Veteran-dependent response is fast short-term but breeds person-locking, knowledge outflow, and burnout long-term. Drawing maximum learning from one incident with Blameless postmortems and turning the learning cycle reflected to recurrence prevention - the largest value.
Another decisive axis is the division of labor where AI agents handle first response, humans handle judgment and learning. In the era when Datadog Bits AI / PagerDuty AIOps / Resolve AI handle log analysis / root-cause-candidate presentation / Runbook auto-execution, human added value concentrates on critical judgment and recurrence-prevention design. The premise is code-izing and structuring Runbooks for AI-readable / runnable form.
Selection priorities
- Switch response regime by severity - pre-document SEV 1-4, donât respond to all incidents at same intensity
- Lower on-call load via alert reduction - 2+ late-night calls monthly is overload, prevent via auto-recovery
- Learn via Blameless postmortems - no blame-hunting, treat as system / process flaws
- Code-ize Runbooks and entrust to AI - routine response to AI, concentrate humans on judgment / recurrence prevention
âResolve via mechanism, not heroes.â Routine to AI, judgment and learning to humans.
Summary
This article covered incident response, including phases, SEV levels, on-call, command center, Blameless postmortems, Runbooks, and AIOps.
Switch regime by severity, lower on-call load via alert reduction, learn via Blameless postmortems, code-ize Runbooks and entrust to AI. That is the practical answer for incident response in 2026.
Next time weâll cover SRE practice (toil reduction, chaos engineering).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (65/89)