System Architecture

Monitoring & Ops Overall Design — The 3 Pillars of Observability and the 4 Golden Signals

Monitoring & Ops Overall Design — The 3 Pillars of Observability and the 4 Golden Signals

About this article

This article is the ninth deep dive in the “System Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering the system-architecture-level overall map of monitoring and operations.

Production without monitoring is flying without instruments; recovery time becomes a matter of luck. This article focuses on system-architecture-stage monitoring requirements (3 pillars of observability, 4 golden signals, platform selection, phased rollout); operational implementation (OpenTelemetry, log design, SLO operation, on-call) lives in the “DevOps Architecture” category.

What is monitoring design in the first place

Monitoring design is, roughly speaking, “setting up a system to continuously check your system’s health.”

Imagine a car’s dashboard. The speedometer (response time), fuel gauge (resource usage), and engine warning light (error alert) let you catch anomalies while driving. Without any instruments, you might not notice until the engine seizes. System monitoring works the same way — metrics, logs, and traces serve as three “instruments” for continuous checks and early anomaly detection.

Why monitoring design matters

What happens if you run production without monitoring? It’s flying without instruments — recovery time becomes a matter of luck. Systems with poor monitoring foundations collapse team morale after one major incident. Without knowing the cause, the team stumbles through response, and the anxiety of “it could happen again anytime” drags on development.

Conversely, teams with solid monitoring digest incidents as learning opportunities, building robustness over time. Monitoring is table-stakes quality — set it up before launch.

Scope of this article

Monitoring and operations cuts across the system, so the series splits design-stage and operational-stage articles. This article focuses on “how to embed monitoring requirements at the system-architecture stage.”

ArticleScope
This article (overall design)3 pillars of observability, 4 golden signals, platform selection, phased rollout
Monitoring & observability (other category)Metrics / logs / traces implementation, OpenTelemetry, dashboards
Logging design (other category)Log levels, structured logs, retention, PII
SLO and SLI (other category)SLO setting, error budget, burn-rate alerts
Incident response (other category)On-call, runbooks, postmortems

This article doesn’t go into “monthly SLO operation,” “PagerDuty escalation,” or “how to write a postmortem.” It draws the overall map at design time; details go to the other category.

The question for this article is “what do you decide to monitor at system-architecture time?” Operational implementation is the other category’s territory.

Flying without instruments

Monitoring is the always-on system that checks via numbers, logs, and traces whether the system is healthy. Operations is the daily activity built on top: incident response, preventive maintenance, performance improvement, cost optimization. The pair is inseparable: “notice via monitoring, fix via operations” is one set. Setting up monitoring before launch is an investment as important as feature development, often more.

A system with weak monitoring loses team morale after one major incident. Continuing response without knowing the cause leaves “when will it happen again?” anxiety dragging on development. Conversely, teams with good monitoring digest incidents as learning opportunities, building robustness over time.

Monitoring is table-stakes quality. Build it before launch — that’s the rule.

The three pillars of observability

Observability is the degree to which a system’s internal state can be inferred externally. Broader than mere “monitoring” — combining metrics, logs, traces to reach “why is it slow, where did it fail” at the root.

flowchart LR
    M[Metrics<br/>Time-series numbers<br/>CPU/latency/error rate]
    L[Logs<br/>Individual events<br/>Access/error/audit]
    T[Traces<br/>Per-request paths<br/>Distributed tracing]
    Q1{Notice}
    Q2{Locate}
    Q3{Root cause}
    M --> Q1
    L --> Q2
    T --> Q3
    Q1 --> Q2 --> Q3 --> FIX[Fix]
    classDef metric fill:#dbeafe,stroke:#2563eb;
    classDef log fill:#fef3c7,stroke:#d97706;
    classDef trace fill:#fae8ff,stroke:#a21caf;
    classDef step fill:#f0f9ff,stroke:#0369a1;
    classDef fix fill:#dcfce7,stroke:#16a34a;
    class M metric;
    class L log;
    class T trace;
    class Q1,Q2,Q3 step;
    class FIX fix;
PillarPurposeExamples
MetricsTime-series numbersCPU, latency, error rate
LogsIndividual event detailAccess logs, error logs, audit logs
TracesPath of a single requestDistributed tracing

Only with all three: notice “the API is slow” in metrics, locate “which request was slow” in logs, find “which service caused it” in traces. Missing any one leaves you in the “don’t know where to start fixing” state.

The four golden signals

The four basic indicators Google’s SRE (Site Reliability Engineering) team proposed as “what to look at first.” Trying to monitor everything breaks down — anchor on these four.

SignalMeaningExamples
LatencyRequest processing timeAPI response time, p99
TrafficRequest count, bandwidthRPS (Requests Per Second), concurrent connections
ErrorsFailed-request rate5xx error rate, exception count
SaturationResource pressureCPU/memory/disk usage

Latency is read on P95 / P99, not P50. Averages hide “fast for the majority but unbearable for 1%.” Outages usually start in the slow tail’s 1%.

Start with the four signals. “Monitor everything” is impossible.

Choosing a monitoring platform

Pick from three families: cloud-native vs SaaS vs OSS. Driven by ops headcount, cost tolerance, and existing stack.

ProductStrong atFits
CloudWatch / Azure Monitor / GCP OperationsCloud-native, free tiersSingle cloud, startup
DatadogIntegrated, excellent UIBudget exists, want one pane
New RelicAPM-focusedApp-performance focus
Grafana + Prometheus + LokiOSS integratedCost focus, can self-operate
SplunkVeteran log analyticsEnterprise, large scale

Datadog has overwhelming feature richness but tends to hit thousands of dollars monthly. Grafana Stack (OSS) is cheap but needs ops staff. CloudWatch is the best cost-effectiveness in single-cloud environments.

Reliability targets to set at design time

Monthly SLO operation belongs to the other category, but at design time set the rough framing of “how much downtime is acceptable for this service?” This determines redundancy level, monitoring investment, and on-call need.

Service typeTarget availabilityMonthly downtime budget
Personal blog, experimental99% (hard not to clear)~7 hours
General B2C web99.9%~43 minutes
B2B SaaS (business hours required)99.95%~22 minutes
Finance / payment / healthcare99.99%+~4 minutes

Setting SLA (contract) first locks the design too tight, so the practical order is “set rough SLO (target) first, finalize SLA from sales requirements later.” Detailed SLI selection and burn-rate design go to the other category.

At design time, set only the framing of “how many minutes of downtime is OK.” Refinement happens in another article.

Phased monitoring-rollout roadmap

“Everything at once” is impossible; phase the buildout. Below is in order of regret-when-an-incident-happens.

PhaseMinimumTarget / signalAnnual monitoring cost
1 MVPHealth check + error notification (Sentry, etc.)Daily Slack notification is fine$0
2 Early ops+ 4 golden signals + structured logsP95 < 500ms, error rate < 1%$300+
3 Scale ops+ Distributed tracing (OTel), SLO / error budgetSLO 99.9%, burn-rate monitoring$3k+
4 Multi-service+ AIOps (AI for IT Operations) tools, anomaly detectionFalse-positive rate <= 5%, MTTR <= 30 min$30k+
5 On-call opsPagerDuty/Opsgenie, runbooks, postmortemsMTTA (acknowledgement time) <= 5 min+ Headcount

Default alert thresholds: P99 latency > 2x normal -> WARN, > 3x -> PAGE; error rate > 0.5% -> WARN, > 2% -> PAGE. Static thresholds always go stale, so switching to SLO-based within 6 months of operation is the rule.

Stage monitoring rollout with the phase. Datadog at $3k/month for an MVP is obvious over-investment.

Monitoring / alert design traps

What breaks monitoring is more about operational design than product selection. Get this wrong and no product saves you.

Forbidden moveWhy
Leaving DEBUG logs in productionCloudWatch Logs / Datadog bills exceed $10k/month often. Filter at INFO+ as the rule
Continuing static CPU/memory thresholds (alert at >80%)Load variation accumulates false positives; nobody looks. Switch to SLO-violation basis
Sending all alerts to one Slack channelReal incident notifications get buried in daily noise. Split notification destinations by severity
Runbooks managed in Word / PDF / oral traditionTribal, not reproducible. Markdown + Git is the modern standard
Postmortem culture that finds the culpritInformation gets hidden, the next incident becomes invisible. Blameless is the rule
Not monitoring monitoring-platform costDatadog bills explode 5x — recurring stories. Monitoring needs monitoring
Microservices without traces”Which service is slow” can’t be identified; MTTR balloons to hours/days
Letting alerts pile up unwatchedSame as alerts not firing. Monthly alert review to cull staleness
Endless log retentionS3 storage + Athena search costs accumulate. Tiered storage + auto-deletion required
Judging latency by P50 (average)Averages hide the slow 1% of users. Early outage signs show up in P95/P99; missing tail latency means missing problems
Thinking more monitoring items = safer systemMore items means more alert fatigue and everything gets muted. Value comes from whether action follows, not quantity

The 2017 February 28 AWS S3 us-east-1 outage was a classic typo that took down hundreds of services for 4 hours. Even teams with monitoring suffered delayed response without “parent-cloud outage notification” routing.

An alert’s value is decided by “how many people moved, not how many fires fired.”

AI decision axes

Monitoring-platform selection lands on two axes: “can it be instrumented with standard protocols like OpenTelemetry?” and “can AI analyze the logs and metrics?”

AI-era favorableAI-era unfavorable
OpenTelemetry (standard, vendor-neutral)Vendor-specific instrumentation SDKs
Structured logs (JSON, AI-parseable)Unstructured text logs
Runbook / alert definitions in codeProcedures in Word, PDF, oral
AIOps-supporting Datadog / New RelicClosed-UI-only monitoring platforms
  1. Instrument the 4 golden signals first (full monitoring is impossible).
  2. Define SLO / error budget, alert on SLO violations.
  3. Instrument with OpenTelemetry (avoid lock-in, AI-era ready).
  4. Systematize on-call and runbooks, no tribal knowledge.

”Mute it or you can’t get any work done” alert channel (industry case)

A new engineer’s first day, a senior tells them with a straight face: “Mute this Slack channel or you can’t get any work done.” That channel was a noise pile of dozens of CPU-threshold alerts daily; the entire team had stopped looking.

Months later, a real incident alert sat in the same channel, detection delayed by hours.

Similar scenes recur in many teams. “Nobody looks at this channel” becomes implicit shared knowledge; only newcomers notice it every time. Lesson: “alerts’ value is decided by how many people could move, not how many fires fired.”

A channel nobody watches is the same as no channel. Cull noise, alert only on SLO violations — the courage to commit to that design saves the team in the long run.

“Alerts you’re scared to delete” are already stale.

What to decide (design stage)

  • Monitoring-platform direction (CloudWatch / Datadog / Grafana family)
  • Log aggregation and retention (30 days / 90 days / annual)
  • Distributed tracing (OpenTelemetry adoption)
  • Reliability target (rough SLO, allowable downtime)
  • Monitoring cost ceiling

Alert design, SLO operation, on-call, runbooks, and postmortems are decided in the “DevOps Architecture” category articles.

Summary

This article covered the monitoring and operations overall map at the system-architecture level.

The default order: 4 golden signals -> 3 observability pillars -> SLO-based alerts. “Monitor everything” is impossible — phase the rollout and have the courage to delete noise alerts.

The next article covers BCP (business continuity planning, RPO/RTO, DR strategy).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.