DevOps Architecture

[DevOps Architecture] Incident Response

[DevOps Architecture] Incident Response

About this article

As the twelfth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains incident response.

Incidents always happen. Designs that pray they don’t crumble after they do. This article handles the sequence of detection, alerting, notification, response, recovery, and review, on-call regime, Severity definitions, and postmortem culture (the GitLab 2017 textbook case) - the systematization of noticing fast and recovering quickly.

What is incident response, anyway?

Six Phases of Incident Response

Imagine a fire department’s deployment system. You never know when a fire will break out, but the station has 24/7 duty rotations, deployment procedures, area maps, and past fire records ready to go. It’s not individual bravery that makes a fire department effective — it’s the system that reliably extinguishes fires.

Incident response is the fire department of system operations. When failures or abnormal events occur, it’s the activity of systematizing the entire process from detection, notification, initial response, recovery, to review. It maintains processes, tools, and training that let whoever is on duty operate at consistent quality, running a learning cycle that prevents the same incident from happening twice.

Without an incident response system, every failure becomes a matter of a veteran pulling an all-nighter to heroically fix things. If that person is on vacation, recovery is delayed by hours, and the same failures repeat over and over.

Why incident response is needed

Response delay expands damage

In SaaS where 1 minute of downtime affects tens of thousands, 1 minute of response links directly to millions in losses. Fast detection and response defends company value.

Person-locking loses reproducibility

“Veteran fixed it overnight”-type response collapses without the same person next time. Mechanisms letting anyone respond at the same quality are needed.

Recurrence prevention is the biggest value

The true purpose of incident response is recurrence prevention. Whether you can reflect lessons from one incident into system improvement is the dividing line.

Incident-response phases

Design incident response in 6 phases. With clear responsibles and procedures per phase, person-locking decreases.

PhaseContent
DetectionNotice via monitoring / alerts
TriageJudge severity / impact range
ResponseMitigation / recovery response
CommunicationStakeholder situation transmission
ResolutionRemove root cause
Post-mortemReview and recurrence prevention

Severity levels

Stage incident severity and vary response regime. Responding to all incidents at the same intensity exhausts personnel, so per-severity escalation is required.

LevelContentResponse
SEV 1Total stop / major data lossAll hands, 24-hour response
SEV 2Major-feature failureOn-call + SRE immediate response
SEV 3Partial-feature failureRespond within business hours
SEV 4Minor defectNormal backlog

Without pre-documenting severity-judgment criteria, judgment varies per site and chaos ensues.

On-call regime

On-call uses the same thinking as a fire-station rotation. Decide an on-call schedule covering 24/7, distribute load via rotation, and avoid concentration on a single person. Standard operational form for SRE organizations, treated as a central chapter in Google’s SRE book.

Design itemContent
RotationWeekly / bi-weekly is standard
Primary / Secondary2-stage backup
SLAWithin how many minutes for first response
AllowanceCompensation for nights / holidays
HandoffHandover at shift change

On-call is heavy load, so SRE’s continuous effort is lowering on-call frequency via alert reduction and auto-recovery. Late-night calls 2+ times monthly is a sign of overload.

Promotion from alert to incident

Not all alerts are incidents. Many auto-recover or finish with response within business hours. Design discerning what should truly be an incident is needed.

SituationResponse
Auto-recoveryJust record the alert
In-business-hours response possibleCreate ticket
Immediate response neededDeclare incident
Severe damageMajor incident, all hands

Incident declaration is an explicit act, flipping the switch of “stop normal operations and concentrate when declared.”

Command center (war regime)

For major incidents (SEV 1-2), launch an Incident Command Center (ICC) where all gather. Split roles for parallel work, clarifying “what to do” for everyone.

RoleResponsibility
Incident Commander (IC)Overall command / decision
Operations LeadTechnical response
Communications LeadCustomer / internal communication
ScribeTimeline recording

It’s important that IC doesn’t make technical decisions, dedicating to overall situation grasp and decision-making. The division of labor leaving technical investigation to Operations Lead is efficient.

Notification and communication

Information transmission during incidents is as important as response itself. Without pre-deciding who, what, and when to communicate, info gets disparate among customers, management, and the field.

RecipientContentChannel
Response teamTech details / progressSlack channel
ManagementImpact range / ETA (Estimated Time of Arrival, recovery target time)Email / Slack
CustomersSituation / recovery targetStatus page
SupportFAQ / response templateInternal wiki

Statuspage.io and Atlassian Statuspage are SaaS for customer-facing situation announcement, important tools maintaining trust during incidents.

Postmortems

Once an incident is resolved, always conduct review - the iron rule of Google SRE. Document “what happened,” “why it happened,” “how to improve” - turning into organizational asset.

ItemContent
OverviewWhat, when, how much impact
TimelineTime-of-day events
Root causeWhy it happened
MitigationContents of first aid
Recurrence preventionPermanent countermeasure
Improvement actionsWho by when

The major principle is don’t blame people (Blameless Post-mortem). Treat as system / process flaws, not individual fault.

Root cause analysis (RCA)

The technique of finding the true cause beyond surface causes is RCA (Root Cause Analysis). The “5 Whys” of repeating “why” 5 times is famous, but the reality in complex systems is combinations of multiple factors rather than single causes.

Symptom: DB went down
|- why? Connections overflowed
|- why? Connection leak in new feature
|- why? Not noticed in code review
|- why? Review viewpoints have no "connection management"
|- why? Review guide stayed old

The true cause becomes the conclusion of flaw in review culture, not “code bug” - that becomes the improvement target.

Runbooks

Documented typical incident-response procedures are Runbooks. Write “if this alert fires, respond this way,” letting anyone respond at the same quality.

Runbook contents
Firing conditionWhich alert
Initial checkWhat to verify
Diagnosis stepsTriage flow
Recovery stepsCommands to execute
Escalation criteriaWhen to whom

Standard operation is placing in Notion / Confluence / Git repos, linked directly from alerts.

Decision criterion 1: org scale

Incident-response heaviness varies with scale. Small needs simplicity, large needs division-of-labor regime.

ScaleRecommended
StartupSlack + PagerDuty + Google Docs
Mid-size enterprisePagerDuty + Statuspage + Notion
Large enterpriseServiceNow / Jira Service Management
GlobalFollow-the-Sun (relay daytime hours across regions) on-call regime

Decision criterion 2: SLA strictness

Stricter customer SLA means stricter incident-response SLA.

Customer SLAResponse regime
99%Business-hours response is enough
99.9%Night on-call needed
99.95%+24/7 dedicated SRE team
99.99%+Per-region on-call + multi-layer SRE

How to choose by case

Personal dev / small in-house tools

UptimeRobot + email/Slack notifications is enough. On-call regime unneeded, business-hours response OK. Place a simple Runbook in Notion to document personal knowledge and just prevent person-locking.

Startup / growth-stage SaaS

PagerDuty + Statuspage.io + Slack channel. 2-3 on-call weekly rotation, operate Blameless postmortems on Google Docs. Narrow SEV criteria to 3 stages (SEV 1/2/3) for simplicity.

Mid-size enterprise / microservices ops

PagerDuty / Opsgenie + Runbook as Code + game-day drills. Per-service on-call separation, Runbooks Git-managed with PR review, quarterly incident drills. SRE-dedicated team of 2-5 coordinates overall.

Finance / medical / global enterprise

ServiceNow / Jira Service Management + Follow-the-Sun + AIOps (AI for IT Operations). SEV 1 auto-notifies up to management, region-based on-call (Tokyo, Europe, North America) handoffs, regulator-reporting process also built in. Automate routine response with Resolve AI etc.

Incident-response numerical gates / SLA

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

Incident response doesn’t function without numerically defining “what to do in how many minutes.” Below are industry-standard SLAs.

MetricSEV 1SEV 2SEV 3SEV 4
First response (MTTA)Within 5 minWithin 15 minWithin 1 hourNext business day
Recovery (MTTR) targetWithin 1 hourWithin 4 hoursWithin 1 dayWithin 1 week
Notification channelPagerDuty + phonePagerDutySlackJira
EscalationImmediate + managementIC + OpsLeadOn-callIn business hours
PostmortemRequired (within 1 week)Required (within 2 weeks)OptionalUnneeded
Status pageImmediate updateUpdateAs neededUnneeded
Recurrence preventionCompany-wide rolloutTeam rolloutWithin team-

For on-call health metrics, late-night calls 2+ times monthly is a sign of overload, alert-firing rate 50%+ false positives means alert review, recurrence rate over 10% means revisiting postmortem quality. With AWS’s 4 golden signals (Latency / Traffic / Errors / Saturation) as basis, pre-define which SEV fires.

First response within 5 minutes is reliability’s lifeline. Systematize via Runbooks and on-call regime.

Incident-response pitfalls and forbidden moves

Typical accident patterns in incident response. All result in raising recurrence rate or exhausting the org.

Forbidden moveWhy it’s bad
Hunt for blame in postmortemsHotbed of info-hiding, next incidents become invisible. Blameless is the rule
Leave everything to one veteranCollapse on resignation, the 2019 US-major-bank insider-fraud pattern
Operate without on-call SLAFirst response becomes hours, customer churn
Manage Runbooks in Word/PDFPerson-locked, irreproducible, AI can’t run them
Leave SEV criteria to field judgmentSeverity varies, response regime confused
Vague incident-declaration criteriaEither all-hands on every alert or missing major incidents - both extremes
Don’t update status pageCustomer-inquiry flood, trust collapse
No game-day drillsCan’t move when SEV1 occurs, learning at first response
Postmortem written and doneAction items unimplemented, same incident recurs 3 months later
IC also does technical investigationCan’t grasp overall, response delayed
”Veteran fixing is faster” and person-lockCollapse on resignation/vacation, org-wide response capability doesn’t grow
”Do all recurrence-prevention measures”Resources scatter, everything half-done; narrow to the 1-2 highest-impact measures

The January 31, 2017 GitLab DB-deletion incident (engineer mistook prod and dev for rm -rf, 5 backup types all failed) is a case where the stance of livestreaming recovery work + publishing detailed postmortem - learning without hiding - was praised. In contrast, Uber 2016 data leak (initial concealment, paying attackers $100k for silence → subsequent litigation $148M settlement) showed the cost of concealment.

Incidents are on the premise of happening. Preparation to receive via mechanism and culture is everything.

AI decision axes

AI-favoredAI-disfavored
AI-driven initial-diagnosis automationManual log analysis only
Code-ized Runbooks with auto-executionWord / PDF Runbooks
AI-drafted postmortemsManual timeline construction
Anomaly prediction and sign-detectionPost-hoc response only
  1. Switch response regime by severity - pre-document SEV 1-4, don’t respond to all incidents at same intensity
  2. Lower on-call load via alert reduction - 2+ late-night calls monthly is overload, prevent via auto-recovery
  3. Learn via Blameless postmortems - no blame-hunting, treat as system / process flaws
  4. Code-ize Runbooks and entrust to AI - routine response to AI, concentrate humans on judgment / recurrence prevention

AI automation of incident initial response

The initial response during incidents (identifying impact scope, collecting related logs, listing recent deploy changes) is an area AI can automate. Configurations where PagerDuty or OpsGenie alerts trigger AI to auto-execute the following are becoming widespread:

  • Aggregate error logs from the last hour and generate a summary
  • Retrieve recent deploy history and summarize changes
  • Identify affected SLIs and notify via Slack
  • Suggest Runbook execution if a matching one exists

AI covering the minutes it takes for the human on-call engineer to wake up and grasp the situation shortens MTTR.

AI generates postmortem drafts

Creating postmortems after incident response is time-consuming work involving timeline construction, impact-scope organization, and root-cause documentation. Passing Slack logs during incident response, alert history, and deploy logs to AI and auto-generating a postmortem draft (timeline, impact scope, direct cause, root cause, action-item proposals) significantly reduces documentation burden.

Humans review the AI-generated draft and focus on accuracy verification and action-item prioritization.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Severity criteria (SEV 1-4 definitions)
  • On-call regime (rotation, SLA)
  • Notification tools (PagerDuty / Opsgenie)
  • Command-center rules (declaration criteria, roles)
  • Status page (customer info dissemination)
  • Postmortem rules (Blameless, scope of disclosure)
  • Runbook management (location, update rules)

https://en.senkohome.com/arch-intro-devops-devenv/ https://en.senkohome.com/arch-intro-devops-review/ https://en.senkohome.com/arch-intro-devops-slo/

Summary

This article covered incident response, including phases, SEV levels, on-call, command center, Blameless postmortems, Runbooks, and AIOps.

Switch regime by severity, lower on-call load via alert reduction, learn via Blameless postmortems, code-ize Runbooks and entrust to AI. That is the practical answer for incident response in 2026.

Next time we’ll cover SRE practice (toil reduction, chaos engineering).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.

📚 Series: Architecture Crash Course for the Generative-AI Era (65/89)