About this article
This article is the ninth deep dive in the “System Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering the system-architecture-level overall map of monitoring and operations.
Production without monitoring is flying without instruments; recovery time becomes a matter of luck. This article focuses on system-architecture-stage monitoring requirements (3 pillars of observability, 4 golden signals, platform selection, phased rollout); operational implementation (OpenTelemetry, log design, SLO operation, on-call) lives in the “DevOps Architecture” category.
What is monitoring design in the first place
Monitoring design is, roughly speaking, “setting up a system to continuously check your system’s health.”
Imagine a car’s dashboard. The speedometer (response time), fuel gauge (resource usage), and engine warning light (error alert) let you catch anomalies while driving. Without any instruments, you might not notice until the engine seizes. System monitoring works the same way — metrics, logs, and traces serve as three “instruments” for continuous checks and early anomaly detection.
Why monitoring design matters
What happens if you run production without monitoring? It’s flying without instruments — recovery time becomes a matter of luck. Systems with poor monitoring foundations collapse team morale after one major incident. Without knowing the cause, the team stumbles through response, and the anxiety of “it could happen again anytime” drags on development.
Conversely, teams with solid monitoring digest incidents as learning opportunities, building robustness over time. Monitoring is table-stakes quality — set it up before launch.
Scope of this article
Monitoring and operations cuts across the system, so the series splits design-stage and operational-stage articles. This article focuses on “how to embed monitoring requirements at the system-architecture stage.”
| Article | Scope |
|---|---|
| This article (overall design) | 3 pillars of observability, 4 golden signals, platform selection, phased rollout |
| Monitoring & observability (other category) | Metrics / logs / traces implementation, OpenTelemetry, dashboards |
| Logging design (other category) | Log levels, structured logs, retention, PII |
| SLO and SLI (other category) | SLO setting, error budget, burn-rate alerts |
| Incident response (other category) | On-call, runbooks, postmortems |
This article doesn’t go into “monthly SLO operation,” “PagerDuty escalation,” or “how to write a postmortem.” It draws the overall map at design time; details go to the other category.
The question for this article is “what do you decide to monitor at system-architecture time?” Operational implementation is the other category’s territory.
Flying without instruments
Monitoring is the always-on system that checks via numbers, logs, and traces whether the system is healthy. Operations is the daily activity built on top: incident response, preventive maintenance, performance improvement, cost optimization. The pair is inseparable: “notice via monitoring, fix via operations” is one set. Setting up monitoring before launch is an investment as important as feature development, often more.
A system with weak monitoring loses team morale after one major incident. Continuing response without knowing the cause leaves “when will it happen again?” anxiety dragging on development. Conversely, teams with good monitoring digest incidents as learning opportunities, building robustness over time.
Monitoring is table-stakes quality. Build it before launch — that’s the rule.
The three pillars of observability
Observability is the degree to which a system’s internal state can be inferred externally. Broader than mere “monitoring” — combining metrics, logs, traces to reach “why is it slow, where did it fail” at the root.
flowchart LR
M[Metrics<br/>Time-series numbers<br/>CPU/latency/error rate]
L[Logs<br/>Individual events<br/>Access/error/audit]
T[Traces<br/>Per-request paths<br/>Distributed tracing]
Q1{Notice}
Q2{Locate}
Q3{Root cause}
M --> Q1
L --> Q2
T --> Q3
Q1 --> Q2 --> Q3 --> FIX[Fix]
classDef metric fill:#dbeafe,stroke:#2563eb;
classDef log fill:#fef3c7,stroke:#d97706;
classDef trace fill:#fae8ff,stroke:#a21caf;
classDef step fill:#f0f9ff,stroke:#0369a1;
classDef fix fill:#dcfce7,stroke:#16a34a;
class M metric;
class L log;
class T trace;
class Q1,Q2,Q3 step;
class FIX fix;
| Pillar | Purpose | Examples |
|---|---|---|
| Metrics | Time-series numbers | CPU, latency, error rate |
| Logs | Individual event detail | Access logs, error logs, audit logs |
| Traces | Path of a single request | Distributed tracing |
Only with all three: notice “the API is slow” in metrics, locate “which request was slow” in logs, find “which service caused it” in traces. Missing any one leaves you in the “don’t know where to start fixing” state.
The four golden signals
The four basic indicators Google’s SRE (Site Reliability Engineering) team proposed as “what to look at first.” Trying to monitor everything breaks down — anchor on these four.
| Signal | Meaning | Examples |
|---|---|---|
| Latency | Request processing time | API response time, p99 |
| Traffic | Request count, bandwidth | RPS (Requests Per Second), concurrent connections |
| Errors | Failed-request rate | 5xx error rate, exception count |
| Saturation | Resource pressure | CPU/memory/disk usage |
Latency is read on P95 / P99, not P50. Averages hide “fast for the majority but unbearable for 1%.” Outages usually start in the slow tail’s 1%.
Start with the four signals. “Monitor everything” is impossible.
Choosing a monitoring platform
Pick from three families: cloud-native vs SaaS vs OSS. Driven by ops headcount, cost tolerance, and existing stack.
| Product | Strong at | Fits |
|---|---|---|
| CloudWatch / Azure Monitor / GCP Operations | Cloud-native, free tiers | Single cloud, startup |
| Datadog | Integrated, excellent UI | Budget exists, want one pane |
| New Relic | APM-focused | App-performance focus |
| Grafana + Prometheus + Loki | OSS integrated | Cost focus, can self-operate |
| Splunk | Veteran log analytics | Enterprise, large scale |
Datadog has overwhelming feature richness but tends to hit thousands of dollars monthly. Grafana Stack (OSS) is cheap but needs ops staff. CloudWatch is the best cost-effectiveness in single-cloud environments.
Reliability targets to set at design time
Monthly SLO operation belongs to the other category, but at design time set the rough framing of “how much downtime is acceptable for this service?” This determines redundancy level, monitoring investment, and on-call need.
| Service type | Target availability | Monthly downtime budget |
|---|---|---|
| Personal blog, experimental | 99% (hard not to clear) | ~7 hours |
| General B2C web | 99.9% | ~43 minutes |
| B2B SaaS (business hours required) | 99.95% | ~22 minutes |
| Finance / payment / healthcare | 99.99%+ | ~4 minutes |
Setting SLA (contract) first locks the design too tight, so the practical order is “set rough SLO (target) first, finalize SLA from sales requirements later.” Detailed SLI selection and burn-rate design go to the other category.
At design time, set only the framing of “how many minutes of downtime is OK.” Refinement happens in another article.
Phased monitoring-rollout roadmap
“Everything at once” is impossible; phase the buildout. Below is in order of regret-when-an-incident-happens.
| Phase | Minimum | Target / signal | Annual monitoring cost |
|---|---|---|---|
| 1 MVP | Health check + error notification (Sentry, etc.) | Daily Slack notification is fine | $0 |
| 2 Early ops | + 4 golden signals + structured logs | P95 < 500ms, error rate < 1% | $300+ |
| 3 Scale ops | + Distributed tracing (OTel), SLO / error budget | SLO 99.9%, burn-rate monitoring | $3k+ |
| 4 Multi-service | + AIOps (AI for IT Operations) tools, anomaly detection | False-positive rate <= 5%, MTTR <= 30 min | $30k+ |
| 5 On-call ops | PagerDuty/Opsgenie, runbooks, postmortems | MTTA (acknowledgement time) <= 5 min | + Headcount |
Default alert thresholds: P99 latency > 2x normal -> WARN, > 3x -> PAGE; error rate > 0.5% -> WARN, > 2% -> PAGE. Static thresholds always go stale, so switching to SLO-based within 6 months of operation is the rule.
Stage monitoring rollout with the phase. Datadog at $3k/month for an MVP is obvious over-investment.
Monitoring / alert design traps
What breaks monitoring is more about operational design than product selection. Get this wrong and no product saves you.
| Forbidden move | Why |
|---|---|
| Leaving DEBUG logs in production | CloudWatch Logs / Datadog bills exceed $10k/month often. Filter at INFO+ as the rule |
| Continuing static CPU/memory thresholds (alert at >80%) | Load variation accumulates false positives; nobody looks. Switch to SLO-violation basis |
| Sending all alerts to one Slack channel | Real incident notifications get buried in daily noise. Split notification destinations by severity |
| Runbooks managed in Word / PDF / oral tradition | Tribal, not reproducible. Markdown + Git is the modern standard |
| Postmortem culture that finds the culprit | Information gets hidden, the next incident becomes invisible. Blameless is the rule |
| Not monitoring monitoring-platform cost | Datadog bills explode 5x — recurring stories. Monitoring needs monitoring |
| Microservices without traces | ”Which service is slow” can’t be identified; MTTR balloons to hours/days |
| Letting alerts pile up unwatched | Same as alerts not firing. Monthly alert review to cull staleness |
| Endless log retention | S3 storage + Athena search costs accumulate. Tiered storage + auto-deletion required |
| Judging latency by P50 (average) | Averages hide the slow 1% of users. Early outage signs show up in P95/P99; missing tail latency means missing problems |
| Thinking more monitoring items = safer system | More items means more alert fatigue and everything gets muted. Value comes from whether action follows, not quantity |
The 2017 February 28 AWS S3 us-east-1 outage was a classic typo that took down hundreds of services for 4 hours. Even teams with monitoring suffered delayed response without “parent-cloud outage notification” routing.
An alert’s value is decided by “how many people moved, not how many fires fired.”
AI decision axes
Monitoring-platform selection lands on two axes: “can it be instrumented with standard protocols like OpenTelemetry?” and “can AI analyze the logs and metrics?”
| AI-era favorable | AI-era unfavorable |
|---|---|
| OpenTelemetry (standard, vendor-neutral) | Vendor-specific instrumentation SDKs |
| Structured logs (JSON, AI-parseable) | Unstructured text logs |
| Runbook / alert definitions in code | Procedures in Word, PDF, oral |
| AIOps-supporting Datadog / New Relic | Closed-UI-only monitoring platforms |
- Instrument the 4 golden signals first (full monitoring is impossible).
- Define SLO / error budget, alert on SLO violations.
- Instrument with OpenTelemetry (avoid lock-in, AI-era ready).
- Systematize on-call and runbooks, no tribal knowledge.
”Mute it or you can’t get any work done” alert channel (industry case)
A new engineer’s first day, a senior tells them with a straight face: “Mute this Slack channel or you can’t get any work done.” That channel was a noise pile of dozens of CPU-threshold alerts daily; the entire team had stopped looking.
Months later, a real incident alert sat in the same channel, detection delayed by hours.
Similar scenes recur in many teams. “Nobody looks at this channel” becomes implicit shared knowledge; only newcomers notice it every time. Lesson: “alerts’ value is decided by how many people could move, not how many fires fired.”
A channel nobody watches is the same as no channel. Cull noise, alert only on SLO violations — the courage to commit to that design saves the team in the long run.
“Alerts you’re scared to delete” are already stale.
What to decide (design stage)
- Monitoring-platform direction (CloudWatch / Datadog / Grafana family)
- Log aggregation and retention (30 days / 90 days / annual)
- Distributed tracing (OpenTelemetry adoption)
- Reliability target (rough SLO, allowable downtime)
- Monitoring cost ceiling
Alert design, SLO operation, on-call, runbooks, and postmortems are decided in the “DevOps Architecture” category articles.
Summary
This article covered the monitoring and operations overall map at the system-architecture level.
The default order: 4 golden signals -> 3 observability pillars -> SLO-based alerts. “Monitor everything” is impossible — phase the rollout and have the courage to delete noise alerts.
The next article covers BCP (business continuity planning, RPO/RTO, DR strategy).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (14/89)