DevOps Architecture

[DevOps Architecture] Incident Response - Resolve via Mechanism, Not Heroes

[DevOps Architecture] Incident Response - Resolve via Mechanism, Not Heroes

About this article

As the twelfth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains incident response.

Incidents always happen. Designs that pray they don’t crumble after they do. This article handles the sequence of detection, alerting, notification, response, recovery, and review, on-call regime, Severity definitions, and postmortem culture (the GitLab 2017 textbook case) - the systematization of noticing fast and recovering quickly.

DevOps Architecture Overview — One Pipeline for Build, Ship, and Runen.senkohome.com/arch-intro-devops-overview/[DevOps Architecture] DevOps and SRE Overview - Speed and Stability Coexisten.senkohome.com/arch-intro-devops-sre/[DevOps Architecture] Version Control - Git + Monorepo + GitHub Flow Is the Standarden.senkohome.com/arch-intro-devops-vcs/[DevOps Architecture] Dev Environment and Local Execution - Half a Day to First Commiten.senkohome.com/arch-intro-devops-devenv/[DevOps Architecture] Code Review - PR 300 Lines + 1 Approver + CODEOWNERSen.senkohome.com/arch-intro-devops-review/[DevOps Architecture] Test Design - Pyramid + Testcontainers + Branch Coverageen.senkohome.com/arch-intro-devops-test/[DevOps Architecture] CI/CD - GitHub Actions + OIDC + Feature Flag Is the Standarden.senkohome.com/arch-intro-devops-cicd/[DevOps Architecture] Deploy Strategy - Raise Frequency, Lower Risken.senkohome.com/arch-intro-devops-deploy/[DevOps Architecture] Monitoring and Observability - Three Pillars + OpenTelemetry + SLO Alertsen.senkohome.com/arch-intro-devops-observability/[DevOps Architecture] Log Design - Structured JSON + No PII + Phased Cold-Tieringen.senkohome.com/arch-intro-devops-logging/[DevOps Architecture] SLO and SLI - Don't Pursue 100%, Buy Speed With Error Budgeten.senkohome.com/arch-intro-devops-slo/[DevOps Architecture] SRE Practices - Toil Reduction and Chaos Drillsen.senkohome.com/arch-intro-devops-sre-practice/[DevOps Architecture] Documentation - Lean README + ADR + OpenAPI Toward Giten.senkohome.com/arch-intro-devops-docs/[DevOps Architecture] Ticket and Project Management - Epic/Story/Task + 1-Day Granularityen.senkohome.com/arch-intro-devops-ticket/

What incident response is

Incident response is the activity of responding to system failures and abnormal events and recovering quickly. Beyond merely “fixing,” it designs the sequence from detection to review. On the premise that failures will always occur, the goal is systematizing how fast you notice and how quickly you recover.

Good incident response doesn’t depend on individual heroic effort. It has processes, tools, and drills letting whoever’s on duty operate at consistent quality, turning a learning cycle of not causing the same incident twice.

Incidents happen. “How many minutes to notice, how many minutes to recover” is operational skill.

Why incident response is needed

Response delay expands damage

In SaaS where 1 minute of downtime affects tens of thousands, 1 minute of response links directly to millions in losses. Fast detection and response defends company value.

Person-locking loses reproducibility

“Veteran fixed it overnight”-type response collapses without the same person next time. Mechanisms letting anyone respond at the same quality are needed.

Recurrence prevention is the biggest value

The true purpose of incident response is recurrence prevention. Whether you can reflect lessons from one incident into system improvement is the dividing line.

Incident-response phases

Design incident response in 6 phases. With clear responsibles and procedures per phase, person-locking decreases.

flowchart LR
    D[Detection<br/>monitoring/alerts] --> T[Triage<br/>severity judgment]
    T --> R[Response<br/>mitigation/recovery]
    R --> C[Communication<br/>stakeholders]
    C --> RES[Resolution<br/>root-cause removal]
    RES --> PM[Post-mortem<br/>review/recurrence prevention]
    PM -.|reflect learning| D
    classDef detect fill:#fee2e2,stroke:#dc2626;
    classDef respond fill:#fef3c7,stroke:#d97706;
    classDef comm fill:#dbeafe,stroke:#2563eb;
    classDef recover fill:#dcfce7,stroke:#16a34a;
    classDef learn fill:#fae8ff,stroke:#a21caf;
    class D,T detect;
    class R respond;
    class C comm;
    class RES recover;
    class PM learn;
PhaseContent
DetectionNotice via monitoring / alerts
TriageJudge severity / impact range
ResponseMitigation / recovery response
CommunicationStakeholder situation transmission
ResolutionRemove root cause
Post-mortemReview and recurrence prevention

Severity levels

Stage incident severity and vary response regime. Responding to all incidents at the same intensity exhausts personnel, so per-severity escalation is required.

LevelContentResponse
SEV 1Total stop / major data lossAll hands, 24-hour response
SEV 2Major-feature failureOn-call + SRE immediate response
SEV 3Partial-feature failureRespond within business hours
SEV 4Minor defectNormal backlog

Without pre-documenting severity-judgment criteria, judgment varies per site and chaos ensues.

On-call regime

On-call uses the same thinking as a fire-station rotation. Decide an on-call schedule covering 24/7, distribute load via rotation, and avoid concentration on a single person. Standard operational form for SRE organizations, treated as a central chapter in Google’s SRE book.

Design itemContent
RotationWeekly / bi-weekly is standard
Primary / Secondary2-stage backup
SLAWithin how many minutes for first response
AllowanceCompensation for nights / holidays
HandoffHandover at shift change

On-call is heavy load, so SRE’s continuous effort is lowering on-call frequency via alert reduction and auto-recovery. Late-night calls 2+ times monthly is a sign of overload.

Promotion from alert to incident

Not all alerts are incidents. Many auto-recover or finish with response within business hours. Design discerning what should truly be an incident is needed.

SituationResponse
Auto-recoveryJust record the alert
In-business-hours response possibleCreate ticket
Immediate response neededDeclare incident
Severe damageMajor incident, all hands

Incident declaration is an explicit act, flipping the switch of “stop normal operations and concentrate when declared.”

Command center (war regime)

For major incidents (SEV 1-2), launch an Incident Command Center (ICC) where all gather. Split roles for parallel work, clarifying “what to do” for everyone.

RoleResponsibility
Incident Commander (IC)Overall command / decision
Operations LeadTechnical response
Communications LeadCustomer / internal communication
ScribeTimeline recording

It’s important that IC doesn’t make technical decisions, dedicating to overall situation grasp and decision-making. The division of labor leaving technical investigation to Operations Lead is efficient.

Notification and communication

Information transmission during incidents is as important as response itself. Without pre-deciding who, what, and when to communicate, info gets disparate among customers, management, and the field.

RecipientContentChannel
Response teamTech details / progressSlack channel
ManagementImpact range / ETA (Estimated Time of Arrival, recovery target time)Email / Slack
CustomersSituation / recovery targetStatus page
SupportFAQ / response templateInternal wiki

Statuspage.io and Atlassian Statuspage are SaaS for customer-facing situation announcement, important tools maintaining trust during incidents.

Postmortems

Once an incident is resolved, always conduct review - the iron rule of Google SRE. Document “what happened,” “why it happened,” “how to improve” - turning into organizational asset.

ItemContent
OverviewWhat, when, how much impact
TimelineTime-of-day events
Root causeWhy it happened
MitigationContents of first aid
Recurrence preventionPermanent countermeasure
Improvement actionsWho by when

The major principle is don’t blame people (Blameless Post-mortem). Treat as system / process flaws, not individual fault.

Root cause analysis (RCA)

The technique of finding the true cause beyond surface causes is RCA (Root Cause Analysis). The “5 Whys” of repeating “why” 5 times is famous, but the reality in complex systems is combinations of multiple factors rather than single causes.

Symptom: DB went down
|- why? Connections overflowed
|- why? Connection leak in new feature
|- why? Not noticed in code review
|- why? Review viewpoints have no "connection management"
|- why? Review guide stayed old

The true cause becomes the conclusion of flaw in review culture, not “code bug” - that becomes the improvement target.

Runbooks

Documented typical incident-response procedures are Runbooks. Write “if this alert fires, respond this way,” letting anyone respond at the same quality.

Runbook contents
Firing conditionWhich alert
Initial checkWhat to verify
Diagnosis stepsTriage flow
Recovery stepsCommands to execute
Escalation criteriaWhen to whom

Standard operation is placing in Notion / Confluence / Git repos, linked directly from alerts.

Decision criterion 1: org scale

Incident-response heaviness varies with scale. Small needs simplicity, large needs division-of-labor regime.

ScaleRecommended
StartupSlack + PagerDuty + Google Docs
Mid-size enterprisePagerDuty + Statuspage + Notion
Large enterpriseServiceNow / Jira Service Management
GlobalFollow-the-Sun (relay daytime hours across regions) on-call regime

Decision criterion 2: SLA strictness

Stricter customer SLA means stricter incident-response SLA.

Customer SLAResponse regime
99%Business-hours response is enough
99.9%Night on-call needed
99.95%+24/7 dedicated SRE team
99.99%+Per-region on-call + multi-layer SRE

How to choose by case

Personal dev / small in-house tools

UptimeRobot + email/Slack notifications is enough. On-call regime unneeded, business-hours response OK. Place a simple Runbook in Notion to document personal knowledge and just prevent person-locking.

Startup / growth-stage SaaS

PagerDuty + Statuspage.io + Slack channel. 2-3 on-call weekly rotation, operate Blameless postmortems on Google Docs. Narrow SEV criteria to 3 stages (SEV 1/2/3) for simplicity.

Mid-size enterprise / microservices ops

PagerDuty / Opsgenie + Runbook as Code + game-day drills. Per-service on-call separation, Runbooks Git-managed with PR review, quarterly incident drills. SRE-dedicated team of 2-5 coordinates overall.

Finance / medical / global enterprise

ServiceNow / Jira Service Management + Follow-the-Sun + AIOps (AI for IT Operations). SEV 1 auto-notifies up to management, region-based on-call (Tokyo, Europe, North America) handoffs, regulator-reporting process also built in. Automate routine response with Resolve AI etc.

Common misconceptions

Veteran fixing is faster

The work style of “the one most familiar with the system fixes everything overnight” looks fastest short-term. But cases where incidents with the same symptoms occurred the week after that person’s resignation, no one could reach in, customer first response took 3 hours are repeatedly told in the field. Person-locking is the work style with highest cost when lost, and mechanisms letting anyone follow the same procedures are stronger long-term - the lesson.

Hunt for blame in postmortems

The worst antipattern. Blameless is the rule. Once blame-hunting starts, no one speaks honestly anymore.

Do all recurrence-prevention measures

Resources don’t suffice. Set priorities. From large impact / high recurrence-rate.

SEV 1 rarely happens

Without periodic drills, you can’t move when needed. Drill via game day.

Incident-response numerical gates / SLA

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

Incident response doesn’t function without numerically defining “what to do in how many minutes.” Below are industry-standard SLAs.

MetricSEV 1SEV 2SEV 3SEV 4
First response (MTTA)Within 5 minWithin 15 minWithin 1 hourNext business day
Recovery (MTTR) targetWithin 1 hourWithin 4 hoursWithin 1 dayWithin 1 week
Notification channelPagerDuty + phonePagerDutySlackJira
EscalationImmediate + managementIC + OpsLeadOn-callIn business hours
PostmortemRequired (within 1 week)Required (within 2 weeks)OptionalUnneeded
Status pageImmediate updateUpdateAs neededUnneeded
Recurrence preventionCompany-wide rolloutTeam rolloutWithin team-

For on-call health metrics, late-night calls 2+ times monthly is a sign of overload, alert-firing rate 50%+ false positives means alert review, recurrence rate over 10% means revisiting postmortem quality. With AWS’s 4 golden signals (Latency / Traffic / Errors / Saturation) as basis, pre-define which SEV fires.

First response within 5 minutes is reliability’s lifeline. Systematize via Runbooks and on-call regime.

Incident-response pitfalls and forbidden moves

Typical accident patterns in incident response. All result in raising recurrence rate or exhausting the org.

Forbidden moveWhy it’s bad
Hunt for blame in postmortemsHotbed of info-hiding, next incidents become invisible. Blameless is the rule
Leave everything to one veteranCollapse on resignation, the 2019 US-major-bank insider-fraud pattern
Operate without on-call SLAFirst response becomes hours, customer churn
Manage Runbooks in Word/PDFPerson-locked, irreproducible, AI can’t run them
Leave SEV criteria to field judgmentSeverity varies, response regime confused
Vague incident-declaration criteriaEither all-hands on every alert or missing major incidents - both extremes
Don’t update status pageCustomer-inquiry flood, trust collapse
No game-day drillsCan’t move when SEV1 occurs, learning at first response
Postmortem written and doneAction items unimplemented, same incident recurs 3 months later
IC also does technical investigationCan’t grasp overall, response delayed

The January 31, 2017 GitLab DB-deletion incident (engineer mistook prod and dev for rm -rf, 5 backup types all failed) is a case where the stance of livestreaming recovery work + publishing detailed postmortem - learning without hiding - was praised. In contrast, Uber 2016 data leak (initial concealment, paying attackers $100k for silence → subsequent litigation $148M settlement) showed the cost of concealment.

Incidents are on the premise of happening. Preparation to receive via mechanism and culture is everything.

AI-era perspective

When AI-driven dev (vibe coding) and AI usage are the premise, incident response is evolving into the area of AI agents handling first response. Datadog Bits AI, PagerDuty AIOps, Resolve AI, etc. automate initial triage and root-cause-candidate presentation.

AI-era changeContent
Initial-diagnosis automationAI analyzes logs / metrics
Auto-execution of RunbooksAI auto-conducts common responses
Postmortem draftsAI generates timelines
Anomaly predictionSign-detection before failure

Division of labor where humans focus on critical judgments and recurrence prevention, with routine first response left to AI, is starting. Code-ize and structure Runbooks, and the future of AI auto-running them is near.

In the AI era, divide labor: routine response to AI, judgment and learning to humans.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Severity criteria (SEV 1-4 definitions)
  • On-call regime (rotation, SLA)
  • Notification tools (PagerDuty / Opsgenie)
  • Command-center rules (declaration criteria, roles)
  • Status page (customer info dissemination)
  • Postmortem rules (Blameless, scope of disclosure)
  • Runbook management (location, update rules)

How to make the final call

The core of incident response is the thinking of resolving via mechanism, not individual heroes. The modern design is having Runbooks, on-call, and command center where whoever’s on duty operates at consistent quality, switching response regime by severity. Veteran-dependent response is fast short-term but breeds person-locking, knowledge outflow, and burnout long-term. Drawing maximum learning from one incident with Blameless postmortems and turning the learning cycle reflected to recurrence prevention - the largest value.

Another decisive axis is the division of labor where AI agents handle first response, humans handle judgment and learning. In the era when Datadog Bits AI / PagerDuty AIOps / Resolve AI handle log analysis / root-cause-candidate presentation / Runbook auto-execution, human added value concentrates on critical judgment and recurrence-prevention design. The premise is code-izing and structuring Runbooks for AI-readable / runnable form.

Selection priorities

  1. Switch response regime by severity - pre-document SEV 1-4, don’t respond to all incidents at same intensity
  2. Lower on-call load via alert reduction - 2+ late-night calls monthly is overload, prevent via auto-recovery
  3. Learn via Blameless postmortems - no blame-hunting, treat as system / process flaws
  4. Code-ize Runbooks and entrust to AI - routine response to AI, concentrate humans on judgment / recurrence prevention

“Resolve via mechanism, not heroes.” Routine to AI, judgment and learning to humans.

Summary

This article covered incident response, including phases, SEV levels, on-call, command center, Blameless postmortems, Runbooks, and AIOps.

Switch regime by severity, lower on-call load via alert reduction, learn via Blameless postmortems, code-ize Runbooks and entrust to AI. That is the practical answer for incident response in 2026.

Next time we’ll cover SRE practice (toil reduction, chaos engineering).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.