[DevOps Architecture] Incident Response - Resolve via Mechanism, Not Heroes

About this article

As the twelfth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains incident response.

Incidents always happen. Designs that pray they don’t crumble after they do. This article handles the sequence of detection, alerting, notification, response, recovery, and review, on-call regime, Severity definitions, and postmortem culture (the GitLab 2017 textbook case) - the systematization of noticing fast and recovering quickly.

What incident response is

Incident response is the activity of responding to system failures and abnormal events and recovering quickly. Beyond merely “fixing,” it designs the sequence from detection to review. On the premise that failures will always occur, the goal is systematizing how fast you notice and how quickly you recover.

Good incident response doesn’t depend on individual heroic effort. It has processes, tools, and drills letting whoever’s on duty operate at consistent quality, turning a learning cycle of not causing the same incident twice.

Incidents happen. “How many minutes to notice, how many minutes to recover” is operational skill.

Why incident response is needed

Response delay expands damage

In SaaS where 1 minute of downtime affects tens of thousands, 1 minute of response links directly to millions in losses. Fast detection and response defends company value.

Person-locking loses reproducibility

“Veteran fixed it overnight”-type response collapses without the same person next time. Mechanisms letting anyone respond at the same quality are needed.

Recurrence prevention is the biggest value

The true purpose of incident response is recurrence prevention. Whether you can reflect lessons from one incident into system improvement is the dividing line.

Incident-response phases

Design incident response in 6 phases. With clear responsibles and procedures per phase, person-locking decreases.

flowchart LR
    D[Detection<br/>monitoring/alerts] --> T[Triage<br/>severity judgment]
    T --> R[Response<br/>mitigation/recovery]
    R --> C[Communication<br/>stakeholders]
    C --> RES[Resolution<br/>root-cause removal]
    RES --> PM[Post-mortem<br/>review/recurrence prevention]
    PM -.|reflect learning| D
    classDef detect fill:#fee2e2,stroke:#dc2626;
    classDef respond fill:#fef3c7,stroke:#d97706;
    classDef comm fill:#dbeafe,stroke:#2563eb;
    classDef recover fill:#dcfce7,stroke:#16a34a;
    classDef learn fill:#fae8ff,stroke:#a21caf;
    class D,T detect;
    class R respond;
    class C comm;
    class RES recover;
    class PM learn;

Phase	Content
Detection	Notice via monitoring / alerts
Triage	Judge severity / impact range
Response	Mitigation / recovery response
Communication	Stakeholder situation transmission
Resolution	Remove root cause
Post-mortem	Review and recurrence prevention

Severity levels

Stage incident severity and vary response regime. Responding to all incidents at the same intensity exhausts personnel, so per-severity escalation is required.

Level	Content	Response
SEV 1	Total stop / major data loss	All hands, 24-hour response
SEV 2	Major-feature failure	On-call + SRE immediate response
SEV 3	Partial-feature failure	Respond within business hours
SEV 4	Minor defect	Normal backlog

Without pre-documenting severity-judgment criteria, judgment varies per site and chaos ensues.

On-call regime

On-call uses the same thinking as a fire-station rotation. Decide an on-call schedule covering 24/7, distribute load via rotation, and avoid concentration on a single person. Standard operational form for SRE organizations, treated as a central chapter in Google’s SRE book.

Design item	Content
Rotation	Weekly / bi-weekly is standard
Primary / Secondary	2-stage backup
SLA	Within how many minutes for first response
Allowance	Compensation for nights / holidays
Handoff	Handover at shift change

On-call is heavy load, so SRE’s continuous effort is lowering on-call frequency via alert reduction and auto-recovery. Late-night calls 2+ times monthly is a sign of overload.

Promotion from alert to incident

Not all alerts are incidents. Many auto-recover or finish with response within business hours. Design discerning what should truly be an incident is needed.

Situation	Response
Auto-recovery	Just record the alert
In-business-hours response possible	Create ticket
Immediate response needed	Declare incident
Severe damage	Major incident, all hands

Incident declaration is an explicit act, flipping the switch of “stop normal operations and concentrate when declared.”

Command center (war regime)

For major incidents (SEV 1-2), launch an Incident Command Center (ICC) where all gather. Split roles for parallel work, clarifying “what to do” for everyone.

Role	Responsibility
Incident Commander (IC)	Overall command / decision
Operations Lead	Technical response
Communications Lead	Customer / internal communication
Scribe	Timeline recording

It’s important that IC doesn’t make technical decisions, dedicating to overall situation grasp and decision-making. The division of labor leaving technical investigation to Operations Lead is efficient.

Notification and communication

Information transmission during incidents is as important as response itself. Without pre-deciding who, what, and when to communicate, info gets disparate among customers, management, and the field.

Recipient	Content	Channel
Response team	Tech details / progress	Slack channel
Management	Impact range / ETA (Estimated Time of Arrival, recovery target time)	Email / Slack
Customers	Situation / recovery target	Status page
Support	FAQ / response template	Internal wiki

Statuspage.io and Atlassian Statuspage are SaaS for customer-facing situation announcement, important tools maintaining trust during incidents.

Postmortems

Once an incident is resolved, always conduct review - the iron rule of Google SRE. Document “what happened,” “why it happened,” “how to improve” - turning into organizational asset.

Item	Content
Overview	What, when, how much impact
Timeline	Time-of-day events
Root cause	Why it happened
Mitigation	Contents of first aid
Recurrence prevention	Permanent countermeasure
Improvement actions	Who by when

The major principle is don’t blame people (Blameless Post-mortem). Treat as system / process flaws, not individual fault.

Root cause analysis (RCA)

The technique of finding the true cause beyond surface causes is RCA (Root Cause Analysis). The “5 Whys” of repeating “why” 5 times is famous, but the reality in complex systems is combinations of multiple factors rather than single causes.

Symptom: DB went down
|- why? Connections overflowed
|- why? Connection leak in new feature
|- why? Not noticed in code review
|- why? Review viewpoints have no "connection management"
|- why? Review guide stayed old

The true cause becomes the conclusion of flaw in review culture, not “code bug” - that becomes the improvement target.

Runbooks

Documented typical incident-response procedures are Runbooks. Write “if this alert fires, respond this way,” letting anyone respond at the same quality.

Runbook contents
Firing condition	Which alert
Initial check	What to verify
Diagnosis steps	Triage flow
Recovery steps	Commands to execute
Escalation criteria	When to whom

Standard operation is placing in Notion / Confluence / Git repos, linked directly from alerts.

Decision criterion 1: org scale

Incident-response heaviness varies with scale. Small needs simplicity, large needs division-of-labor regime.

Scale	Recommended
Startup	Slack + PagerDuty + Google Docs
Mid-size enterprise	PagerDuty + Statuspage + Notion
Large enterprise	ServiceNow / Jira Service Management
Global	Follow-the-Sun (relay daytime hours across regions) on-call regime

Decision criterion 2: SLA strictness

Stricter customer SLA means stricter incident-response SLA.

Customer SLA	Response regime
99%	Business-hours response is enough
99.9%	Night on-call needed
99.95%+	24/7 dedicated SRE team
99.99%+	Per-region on-call + multi-layer SRE

How to choose by case

Personal dev / small in-house tools

UptimeRobot + email/Slack notifications is enough. On-call regime unneeded, business-hours response OK. Place a simple Runbook in Notion to document personal knowledge and just prevent person-locking.

Startup / growth-stage SaaS

PagerDuty + Statuspage.io + Slack channel. 2-3 on-call weekly rotation, operate Blameless postmortems on Google Docs. Narrow SEV criteria to 3 stages (SEV 1/2/3) for simplicity.

Mid-size enterprise / microservices ops

PagerDuty / Opsgenie + Runbook as Code + game-day drills. Per-service on-call separation, Runbooks Git-managed with PR review, quarterly incident drills. SRE-dedicated team of 2-5 coordinates overall.

Finance / medical / global enterprise

ServiceNow / Jira Service Management + Follow-the-Sun + AIOps (AI for IT Operations). SEV 1 auto-notifies up to management, region-based on-call (Tokyo, Europe, North America) handoffs, regulator-reporting process also built in. Automate routine response with Resolve AI etc.

Common misconceptions

Veteran fixing is faster

The work style of “the one most familiar with the system fixes everything overnight” looks fastest short-term. But cases where incidents with the same symptoms occurred the week after that person’s resignation, no one could reach in, customer first response took 3 hours are repeatedly told in the field. Person-locking is the work style with highest cost when lost, and mechanisms letting anyone follow the same procedures are stronger long-term - the lesson.

Hunt for blame in postmortems

The worst antipattern. Blameless is the rule. Once blame-hunting starts, no one speaks honestly anymore.

Do all recurrence-prevention measures

Resources don’t suffice. Set priorities. From large impact / high recurrence-rate.

SEV 1 rarely happens

Without periodic drills, you can’t move when needed. Drill via game day.

Incident-response numerical gates / SLA

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

Incident response doesn’t function without numerically defining “what to do in how many minutes.” Below are industry-standard SLAs.

Metric	SEV 1	SEV 2	SEV 3	SEV 4
First response (MTTA)	Within 5 min	Within 15 min	Within 1 hour	Next business day
Recovery (MTTR) target	Within 1 hour	Within 4 hours	Within 1 day	Within 1 week
Notification channel	PagerDuty + phone	PagerDuty	Slack	Jira
Escalation	Immediate + management	IC + OpsLead	On-call	In business hours
Postmortem	Required (within 1 week)	Required (within 2 weeks)	Optional	Unneeded
Status page	Immediate update	Update	As needed	Unneeded
Recurrence prevention	Company-wide rollout	Team rollout	Within team	-

For on-call health metrics, late-night calls 2+ times monthly is a sign of overload, alert-firing rate 50%+ false positives means alert review, recurrence rate over 10% means revisiting postmortem quality. With AWS’s 4 golden signals (Latency / Traffic / Errors / Saturation) as basis, pre-define which SEV fires.

First response within 5 minutes is reliability’s lifeline. Systematize via Runbooks and on-call regime.

Incident-response pitfalls and forbidden moves

Typical accident patterns in incident response. All result in raising recurrence rate or exhausting the org.

Forbidden move	Why it’s bad
Hunt for blame in postmortems	Hotbed of info-hiding, next incidents become invisible. Blameless is the rule
Leave everything to one veteran	Collapse on resignation, the 2019 US-major-bank insider-fraud pattern
Operate without on-call SLA	First response becomes hours, customer churn
Manage Runbooks in Word/PDF	Person-locked, irreproducible, AI can’t run them
Leave SEV criteria to field judgment	Severity varies, response regime confused
Vague incident-declaration criteria	Either all-hands on every alert or missing major incidents - both extremes
Don’t update status page	Customer-inquiry flood, trust collapse
No game-day drills	Can’t move when SEV1 occurs, learning at first response
Postmortem written and done	Action items unimplemented, same incident recurs 3 months later
IC also does technical investigation	Can’t grasp overall, response delayed

The January 31, 2017 GitLab DB-deletion incident (engineer mistook prod and dev for rm -rf, 5 backup types all failed) is a case where the stance of livestreaming recovery work + publishing detailed postmortem - learning without hiding - was praised. In contrast, Uber 2016 data leak (initial concealment, paying attackers $100k for silence → subsequent litigation $148M settlement) showed the cost of concealment.

Incidents are on the premise of happening. Preparation to receive via mechanism and culture is everything.

AI-era perspective

When AI-driven dev (vibe coding) and AI usage are the premise, incident response is evolving into the area of AI agents handling first response. Datadog Bits AI, PagerDuty AIOps, Resolve AI, etc. automate initial triage and root-cause-candidate presentation.

AI-era change	Content
Initial-diagnosis automation	AI analyzes logs / metrics
Auto-execution of Runbooks	AI auto-conducts common responses
Postmortem drafts	AI generates timelines
Anomaly prediction	Sign-detection before failure

Division of labor where humans focus on critical judgments and recurrence prevention, with routine first response left to AI, is starting. Code-ize and structure Runbooks, and the future of AI auto-running them is near.

In the AI era, divide labor: routine response to AI, judgment and learning to humans.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

Severity criteria (SEV 1-4 definitions)
On-call regime (rotation, SLA)
Notification tools (PagerDuty / Opsgenie)
Command-center rules (declaration criteria, roles)
Status page (customer info dissemination)
Postmortem rules (Blameless, scope of disclosure)
Runbook management (location, update rules)

How to make the final call

The core of incident response is the thinking of resolving via mechanism, not individual heroes. The modern design is having Runbooks, on-call, and command center where whoever’s on duty operates at consistent quality, switching response regime by severity. Veteran-dependent response is fast short-term but breeds person-locking, knowledge outflow, and burnout long-term. Drawing maximum learning from one incident with Blameless postmortems and turning the learning cycle reflected to recurrence prevention - the largest value.

Another decisive axis is the division of labor where AI agents handle first response, humans handle judgment and learning. In the era when Datadog Bits AI / PagerDuty AIOps / Resolve AI handle log analysis / root-cause-candidate presentation / Runbook auto-execution, human added value concentrates on critical judgment and recurrence-prevention design. The premise is code-izing and structuring Runbooks for AI-readable / runnable form.

Selection priorities

Switch response regime by severity - pre-document SEV 1-4, don’t respond to all incidents at same intensity
Lower on-call load via alert reduction - 2+ late-night calls monthly is overload, prevent via auto-recovery
Learn via Blameless postmortems - no blame-hunting, treat as system / process flaws
Code-ize Runbooks and entrust to AI - routine response to AI, concentrate humans on judgment / recurrence prevention

“Resolve via mechanism, not heroes.” Routine to AI, judgment and learning to humans.

Summary

This article covered incident response, including phases, SEV levels, on-call, command center, Blameless postmortems, Runbooks, and AIOps.

Switch regime by severity, lower on-call load via alert reduction, learn via Blameless postmortems, code-ize Runbooks and entrust to AI. That is the practical answer for incident response in 2026.

Next time we’ll cover SRE practice (toil reduction, chaos engineering).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.