Monitoring & Ops Overall Design — The 3 Pillars of Observability and the 4 Golden Signals

About this article

This article is the ninth deep dive in the “System Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering the system-architecture-level overall map of monitoring and operations.

Production without monitoring is flying without instruments; recovery time becomes a matter of luck. This article focuses on system-architecture-stage monitoring requirements (3 pillars of observability, 4 golden signals, platform selection, phased rollout); operational implementation (OpenTelemetry, log design, SLO operation, on-call) lives in the “DevOps Architecture” category.

What is monitoring design in the first place

Monitoring design is, roughly speaking, “setting up a system to continuously check your system’s health.”

Imagine a car’s dashboard. The speedometer (response time), fuel gauge (resource usage), and engine warning light (error alert) let you catch anomalies while driving. Without any instruments, you might not notice until the engine seizes. System monitoring works the same way — metrics, logs, and traces serve as three “instruments” for continuous checks and early anomaly detection.

Why monitoring design matters

What happens if you run production without monitoring? It’s flying without instruments — recovery time becomes a matter of luck. Systems with poor monitoring foundations collapse team morale after one major incident. Without knowing the cause, the team stumbles through response, and the anxiety of “it could happen again anytime” drags on development.

Conversely, teams with solid monitoring digest incidents as learning opportunities, building robustness over time. Monitoring is table-stakes quality — set it up before launch.

Scope of this article

Monitoring and operations cuts across the system, so the series splits design-stage and operational-stage articles. This article focuses on “how to embed monitoring requirements at the system-architecture stage.”

Article	Scope
This article (overall design)	3 pillars of observability, 4 golden signals, platform selection, phased rollout
Monitoring & observability (other category)	Metrics / logs / traces implementation, OpenTelemetry, dashboards
Logging design (other category)	Log levels, structured logs, retention, PII
SLO and SLI (other category)	SLO setting, error budget, burn-rate alerts
Incident response (other category)	On-call, runbooks, postmortems

This article doesn’t go into “monthly SLO operation,” “PagerDuty escalation,” or “how to write a postmortem.” It draws the overall map at design time; details go to the other category.

The question for this article is “what do you decide to monitor at system-architecture time?” Operational implementation is the other category’s territory.

Flying without instruments

Monitoring is the always-on system that checks via numbers, logs, and traces whether the system is healthy. Operations is the daily activity built on top: incident response, preventive maintenance, performance improvement, cost optimization. The pair is inseparable: “notice via monitoring, fix via operations” is one set. Setting up monitoring before launch is an investment as important as feature development, often more.

A system with weak monitoring loses team morale after one major incident. Continuing response without knowing the cause leaves “when will it happen again?” anxiety dragging on development. Conversely, teams with good monitoring digest incidents as learning opportunities, building robustness over time.

Monitoring is table-stakes quality. Build it before launch — that’s the rule.

The three pillars of observability

Observability is the degree to which a system’s internal state can be inferred externally. Broader than mere “monitoring” — combining metrics, logs, traces to reach “why is it slow, where did it fail” at the root.

flowchart LR
    M[Metrics<br/>Time-series numbers<br/>CPU/latency/error rate]
    L[Logs<br/>Individual events<br/>Access/error/audit]
    T[Traces<br/>Per-request paths<br/>Distributed tracing]
    Q1{Notice}
    Q2{Locate}
    Q3{Root cause}
    M --> Q1
    L --> Q2
    T --> Q3
    Q1 --> Q2 --> Q3 --> FIX[Fix]
    classDef metric fill:#dbeafe,stroke:#2563eb;
    classDef log fill:#fef3c7,stroke:#d97706;
    classDef trace fill:#fae8ff,stroke:#a21caf;
    classDef step fill:#f0f9ff,stroke:#0369a1;
    classDef fix fill:#dcfce7,stroke:#16a34a;
    class M metric;
    class L log;
    class T trace;
    class Q1,Q2,Q3 step;
    class FIX fix;

Pillar	Purpose	Examples
Metrics	Time-series numbers	CPU, latency, error rate
Logs	Individual event detail	Access logs, error logs, audit logs
Traces	Path of a single request	Distributed tracing

Only with all three: notice “the API is slow” in metrics, locate “which request was slow” in logs, find “which service caused it” in traces. Missing any one leaves you in the “don’t know where to start fixing” state.

The four golden signals

The four basic indicators Google’s SRE (Site Reliability Engineering) team proposed as “what to look at first.” Trying to monitor everything breaks down — anchor on these four.

Signal	Meaning	Examples
Latency	Request processing time	API response time, p99
Traffic	Request count, bandwidth	RPS (Requests Per Second), concurrent connections
Errors	Failed-request rate	5xx error rate, exception count
Saturation	Resource pressure	CPU/memory/disk usage

Latency is read on P95 / P99, not P50. Averages hide “fast for the majority but unbearable for 1%.” Outages usually start in the slow tail’s 1%.

Start with the four signals. “Monitor everything” is impossible.

Choosing a monitoring platform

Pick from three families: cloud-native vs SaaS vs OSS. Driven by ops headcount, cost tolerance, and existing stack.

Product	Strong at	Fits
CloudWatch / Azure Monitor / GCP Operations	Cloud-native, free tiers	Single cloud, startup
Datadog	Integrated, excellent UI	Budget exists, want one pane
New Relic	APM-focused	App-performance focus
Grafana + Prometheus + Loki	OSS integrated	Cost focus, can self-operate
Splunk	Veteran log analytics	Enterprise, large scale

Datadog has overwhelming feature richness but tends to hit thousands of dollars monthly. Grafana Stack (OSS) is cheap but needs ops staff. CloudWatch is the best cost-effectiveness in single-cloud environments.

Reliability targets to set at design time

Monthly SLO operation belongs to the other category, but at design time set the rough framing of “how much downtime is acceptable for this service?” This determines redundancy level, monitoring investment, and on-call need.

Service type	Target availability	Monthly downtime budget
Personal blog, experimental	99% (hard not to clear)	~7 hours
General B2C web	99.9%	~43 minutes
B2B SaaS (business hours required)	99.95%	~22 minutes
Finance / payment / healthcare	99.99%+	~4 minutes

Setting SLA (contract) first locks the design too tight, so the practical order is “set rough SLO (target) first, finalize SLA from sales requirements later.” Detailed SLI selection and burn-rate design go to the other category.

At design time, set only the framing of “how many minutes of downtime is OK.” Refinement happens in another article.

Phased monitoring-rollout roadmap

“Everything at once” is impossible; phase the buildout. Below is in order of regret-when-an-incident-happens.

Phase	Minimum	Target / signal	Annual monitoring cost
1 MVP	Health check + error notification (Sentry, etc.)	Daily Slack notification is fine	$0
2 Early ops	+ 4 golden signals + structured logs	P95 < 500ms, error rate < 1%	$300+
3 Scale ops	+ Distributed tracing (OTel), SLO / error budget	SLO 99.9%, burn-rate monitoring	$3k+
4 Multi-service	+ AIOps (AI for IT Operations) tools, anomaly detection	False-positive rate <= 5%, MTTR <= 30 min	$30k+
5 On-call ops	PagerDuty/Opsgenie, runbooks, postmortems	MTTA (acknowledgement time) <= 5 min	+ Headcount

Default alert thresholds: P99 latency > 2x normal -> WARN, > 3x -> PAGE; error rate > 0.5% -> WARN, > 2% -> PAGE. Static thresholds always go stale, so switching to SLO-based within 6 months of operation is the rule.

Stage monitoring rollout with the phase. Datadog at $3k/month for an MVP is obvious over-investment.

Monitoring / alert design traps

What breaks monitoring is more about operational design than product selection. Get this wrong and no product saves you.

Forbidden move	Why
Leaving DEBUG logs in production	CloudWatch Logs / Datadog bills exceed $10k/month often. Filter at INFO+ as the rule
Continuing static CPU/memory thresholds (alert at >80%)	Load variation accumulates false positives; nobody looks. Switch to SLO-violation basis
Sending all alerts to one Slack channel	Real incident notifications get buried in daily noise. Split notification destinations by severity
Runbooks managed in Word / PDF / oral tradition	Tribal, not reproducible. Markdown + Git is the modern standard
Postmortem culture that finds the culprit	Information gets hidden, the next incident becomes invisible. Blameless is the rule
Not monitoring monitoring-platform cost	Datadog bills explode 5x — recurring stories. Monitoring needs monitoring
Microservices without traces	”Which service is slow” can’t be identified; MTTR balloons to hours/days
Letting alerts pile up unwatched	Same as alerts not firing. Monthly alert review to cull staleness
Endless log retention	S3 storage + Athena search costs accumulate. Tiered storage + auto-deletion required
Judging latency by P50 (average)	Averages hide the slow 1% of users. Early outage signs show up in P95/P99; missing tail latency means missing problems
Thinking more monitoring items = safer system	More items means more alert fatigue and everything gets muted. Value comes from whether action follows, not quantity

The 2017 February 28 AWS S3 us-east-1 outage was a classic typo that took down hundreds of services for 4 hours. Even teams with monitoring suffered delayed response without “parent-cloud outage notification” routing.

An alert’s value is decided by “how many people moved, not how many fires fired.”

AI decision axes

Monitoring-platform selection lands on two axes: “can it be instrumented with standard protocols like OpenTelemetry?” and “can AI analyze the logs and metrics?”

AI-era favorable	AI-era unfavorable
OpenTelemetry (standard, vendor-neutral)	Vendor-specific instrumentation SDKs
Structured logs (JSON, AI-parseable)	Unstructured text logs
Runbook / alert definitions in code	Procedures in Word, PDF, oral
AIOps-supporting Datadog / New Relic	Closed-UI-only monitoring platforms

Instrument the 4 golden signals first (full monitoring is impossible).
Define SLO / error budget, alert on SLO violations.
Instrument with OpenTelemetry (avoid lock-in, AI-era ready).
Systematize on-call and runbooks, no tribal knowledge.

”Mute it or you can’t get any work done” alert channel (industry case)

A new engineer’s first day, a senior tells them with a straight face: “Mute this Slack channel or you can’t get any work done.” That channel was a noise pile of dozens of CPU-threshold alerts daily; the entire team had stopped looking.

Months later, a real incident alert sat in the same channel, detection delayed by hours.

Similar scenes recur in many teams. “Nobody looks at this channel” becomes implicit shared knowledge; only newcomers notice it every time. Lesson: “alerts’ value is decided by how many people could move, not how many fires fired.”

A channel nobody watches is the same as no channel. Cull noise, alert only on SLO violations — the courage to commit to that design saves the team in the long run.

“Alerts you’re scared to delete” are already stale.

What to decide (design stage)

Monitoring-platform direction (CloudWatch / Datadog / Grafana family)
Log aggregation and retention (30 days / 90 days / annual)
Distributed tracing (OpenTelemetry adoption)
Reliability target (rough SLO, allowable downtime)
Monitoring cost ceiling

Alert design, SLO operation, on-call, runbooks, and postmortems are decided in the “DevOps Architecture” category articles.

Summary

This article covered the monitoring and operations overall map at the system-architecture level.

The default order: 4 golden signals -> 3 observability pillars -> SLO-based alerts. “Monitor everything” is impossible — phase the rollout and have the courage to delete noise alerts.

The next article covers BCP (business continuity planning, RPO/RTO, DR strategy).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.