[DevOps Architecture] Monitoring and Observability - Three Pillars + OpenTelemetry + SLO Alerts

About this article

As the ninth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains monitoring and observability.

What answers “is it running?” is monitoring, what answers “why did it break?” is observability. This article covers the 3 pillars of monitoring (Metrics/Logs/Traces), the 4 golden signals (Latency/Traffic/Errors/Saturation), OpenTelemetry, AIOps, and operational implementation (system-architecture-stage monitoring requirements live in the separate “System Architecture” article).

Why it’s needed

Damage expands when you don’t notice incidents

There are still many organizations that first notice incidents from user complaints. Without monitoring, when you notice it’s already a major accident.

Cause-tracking is hard in microservices

In systems where 100 services interact, identifying where the slowness originates requires an observability foundation like distributed tracing.

Numerical SLO/SLA

To promise “99.9% uptime,” you need mechanisms measuring actuals. The principle is what can’t be measured can’t be defended.

Three Pillars

The 3 fundamental data types of observability. Any one alone is insufficient, and the whole picture appears only when combining all 3. “3 pillars” is now classic, with the modern mainstream being multi-faceted approaches adding Events and Profiles.

flowchart TB
    APP([App execution])
    APP --> METRIC[Metrics<br/>numeric time series<br/>CPU/error rate/latency]
    APP --> LOG[Logs<br/>event records<br/>access/error details]
    APP --> TRACE[Traces<br/>request paths<br/>distributed tracing]
    METRIC --> Q1[What's happening?]
    LOG --> Q2[What was recorded?]
    TRACE --> Q3[How did it move?]
    Q1 --> ANSWER([Whole picture<br/>= Observability])
    Q2 --> ANSWER
    Q3 --> ANSWER
    classDef app fill:#fef3c7,stroke:#d97706;
    classDef m fill:#dbeafe,stroke:#2563eb;
    classDef l fill:#fae8ff,stroke:#a21caf;
    classDef t fill:#dcfce7,stroke:#16a34a;
    classDef goal fill:#fef3c7,stroke:#d97706,stroke-width:2px;
    class APP app;
    class METRIC,Q1 m;
    class LOG,Q2 l;
    class TRACE,Q3 t;
    class ANSWER goal;

Type	Content	Representatives
Metrics	Numeric time-series data	Prometheus, Datadog
Logs	String event records	Loki, Elasticsearch
Traces	Request paths	Jaeger, Tempo

Metrics show “what’s happening,” logs show “what was recorded,” traces show “how it moved.”

Metrics

Numerically-quantified time-series data, recording CPU usage, request count, error rate, latency, etc. Storage-efficient and good for aggregation/alerting - the basis of monitoring.

Typical metrics	Content
System	CPU, memory, disk I/O, network
App	Request count, error rate, latency
Business	Order count, signup count, revenue
USE	Utilization, Saturation, Errors
RED	Rate, Errors, Duration

USE/RED are famous metric-design frameworks used as guides for “what to measure.”

Logs

Time-stamped text events recording detailed info output by apps. Structured logs (JSON format) are the modern standard, linkable with metrics and traces. Logs are covered in detail in the next article.

Log type	Use case
App logs	Business-processing records
Access logs	HTTP requests
Audit logs	Permission ops, important changes
System logs	OS / middleware events

Logs occur in massive amounts, so storage cost easily becomes a problem - retention period, compression, and sampling strategies are operational topics.

Traces

Data tracking the process of one request transiting multiple services. In microservices, with chains like “the order API calls the inventory API and the payment API,” you can visualize where it’s slow and where it failed.

Traces have Spans (individual processing) connected by Trace ID, expressing the whole flow as a DAG (Directed Acyclic Graph). Representative tech is OpenTelemetry (industry standard), sent to Jaeger, Tempo, Datadog APM (Application Performance Monitoring), etc.

Trace: order-abc123
 |- Span: POST /order (100ms)
 |   |- Span: check stock (30ms)
 |   |- Span: charge card (50ms)
 |       |- Span: Stripe API (45ms)

Main tools (cloud-native OSS)

OSS-based observability has Prometheus + Grafana + Loki + Tempo + OpenTelemetry as the de facto standard combination. Developing under CNCF (Cloud Native Computing Foundation), with high future prospects.

Tool	Role
Prometheus	Metric collection, storage, alerting
Grafana	Visualization dashboards
Loki	Log aggregation (Grafana Labs)
Tempo	Distributed tracing (Grafana Labs)
OpenTelemetry	Standard for measurement-data collection
Alertmanager	Alert notifications

Integrated as the LGTM stack (Loki, Grafana, Tempo, Mimir), rapidly spreading recently.

Main tools (SaaS)

If you don’t want to build the foundation in-house, using observability SaaS is a quick choice. Pricing is high, but you get high-feature monitoring foundation with near-zero operational burden.

Service	Characteristics
Datadog	Strongest features, expensive
New Relic	Veteran, all-feature integrated
Splunk	King of log analytics
Dynatrace	AI-driven auto-analysis
Honeycomb	Strong on high cardinality
Grafana Cloud	Managed LGTM stack

Datadog has the strongest features, but often hits hundreds of thousands to millions of yen monthly - small scales realistically use Grafana Cloud or New Relic free tier.

OpenTelemetry (measurement standard)

The industry standard unifying collection of metrics, logs, and traces. Born in 2019 from the merger of OpenTracing and OpenCensus, advancing from CNCF Incubating to Graduated - the modern common standard.

Using OpenTelemetry, instrumentation code can be written vendor-neutrally. Even when switching from Datadog to Grafana, no need to rewrite app-side code. Just swap send destination on the Collector side, so vendor-lock-in avoidance kicks in not as desk argument but at implementation level.

Furthermore, with unified instrumentation API, using multiple tools together becomes easy. For example: production metrics to Datadog, long-term log archive to S3 + Loki, dev-env traces to Jaeger - distribute by use case from the same instrumentation code. Custom SDKs require redoing instrumentation per tool, so this difference becomes non-negligible at scale.

Feature	Content
Vendor-neutral	Send to any backend
Language support	Almost all major languages
Auto-instrumentation	Major-framework-supporting
Unified measurement	Metrics + logs + traces integrated

For new builds, OpenTelemetry is the top candidate. Avoids vendor lock-in.

Dashboards and alerts

Just collecting monitoring data is meaningless - it becomes operationally useful only with visualization and alerting. The standard operation is overviewing state with Grafana or each SaaS dashboard, and notifying via Slack or PagerDuty when thresholds are exceeded.

Dashboard type	Content
Service health	Per-API state, error rate
Infrastructure	CPU, memory, network
Business	Revenue, user count, conversion
SLO	SLI actual vs target

Alerts have the dichotomy of too few = miss it, too many = paralyzed, so per-importance routing (Slack for warnings, PagerDuty for critical) matters.

Alert design

Conditions for good alerts: (1) response definitely needed, (2) response possible, (3) should move now - 3 satisfied. Alerts off these produce alert fatigue and cause missing real critical events.

Good alerts	Bad alerts
SLO violation	CPU over 80% (auto-recovers)
Sudden error rate spike	One-shot error
Clear user impact	”Somehow slow”
Has response procedures	Unclear who does what

Alerts should fire on user impact, not “symptoms.” High CPU itself doesn’t impact users.

“Whether the machine is suffering” and “whether humans are suffering” are different things - the standard lesson of monitoring design. Cases where teams set CPU usage 80% alerts to feel safe, with no one’s dashboard red but Twitter flowing with “can’t log in” reports - are commonly heard. The cause is DB connection pool exhaustion, with CPU actually idle and only the app having everyone wait - the typical pattern.

SLO-based alert implementation example

The modern way for “SLO-violation-based” alerts is firing by error-budget consumption speed (burn rate). Not mere threshold exceeding, but a mechanism detecting “at this pace, the budget will be exhausted” - recommended as standard by the Google SRE Workbook.

Severity	Condition	Destination
Critical (immediate response)	2% budget consumed in 1 hour (burn rate > 14.4x)	PagerDuty
High (within hours)	5% budget consumed in 6 hours (burn rate > 6x)	PagerDuty
Warning (within business hours)	10% budget consumed in 3 days (burn rate > 1x)	Slack

# Prometheus example (availability SLO 99.9%, monthly budget 43.2 min)
alert: ErrorBudgetBurnRateCritical
expr:  (1 - availability_slo:ratio_rate5m) > (14.4 * 0.001)
  and  (1 - availability_slo:ratio_rate1h) > (14.4 * 0.001)
for: 2m

“Fire when CPU exceeds 80%” is outdated. Fire by burn rate - the modern standard.

Decision criterion 1: system scale and complexity

Observability heaviness is decided by system complexity. A monolithic single server is fine with lightweight monitoring; microservices need a full set.

System scale	Recommended
Single server / monolith	CloudWatch / Datadog lightweight plan
Few services	Prometheus + Grafana
Microservices (10-50)	LGTM + OpenTelemetry
Large (100+)	Datadog / Dynatrace enterprise

Decision criterion 2: operations team skills

OSS stacks have heavy operations and need a dedicated team. With SaaS, operational burden decreases but cost spikes.

Team regime	Recommended
No dedicated SRE	SaaS (Datadog / New Relic)
Few SREs	Grafana Cloud
Rich SREs / cost-cutting-oriented	In-house LGTM + OpenTelemetry

How to choose by case

Personal dev / small SaaS

Cloud standards (CloudWatch / Cloud Monitoring) + UptimeRobot. Runs from thousands of yen monthly. Add Grafana Cloud free tier if you want custom dashboards.

Startup / 0-1 SREs

Free tier of Datadog or New Relic + OpenTelemetry SDK. To minimize operational load, SaaS is the only choice. Instrument with OTel and you can switch backends in the future. At scale, billing spikes, so review past 100 hosts.

Mid-size SaaS / microservices-ization

Grafana Cloud (LGTM) + OpenTelemetry. There’s LGTM-stack learning cost, but considerably cheaper than Datadog. Operable with 2-3 SREs.

Large enterprise / can self-operate

Self-built Prometheus + Grafana + Loki + Tempo + OpenTelemetry. Composition avoiding vendor lock-in and not putting confidential data outside. Operations need 5+ dedicated SREs.

Common misconceptions

Take all logs and you’re fine

You enter log hell. Sampling, structuring, and per-importance design are needed - log volume is money.

Monitoring CPU spikes is enough

CPU spikes sometimes don’t impact users. User-impact-based (SLO) is the correct approach.

Email alerts to everyone

No one looks anymore. Should reach responders via appropriate channels - on-call design matters.

OpenTelemetry is new, wait and see

Already industry standard. Adopting now means OTel only - vendor-specific SDKs are becoming outdated.

Monitoring-cost / alert-operation numerical gates

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For monitoring, not “install and feel safe” - the key of operations is tracking cost and signal quality numerically.

Metric	Recommended	What to do if exceeded
Monthly monitoring foundation cost	5-10% of infrastructure cost	Sampling, retention shortening
Production DEBUG-log output	Forbidden	Narrow to INFO+
Alerts fired / week	10 or fewer	Noise reduction, move to SLO-based
Alert response rate	90%+	Delete if firing but no one looks
MTTA (Mean Time To Acknowledge)	Within 5 min	Review on-call regime
MTTR (Mean Time To Resolve)	Within 30 min	Maintain Runbooks
Lighthouse Observability score	90 or more	Review metric design
Structured-log rate	100%	Don’t accept non-JSON
OpenTelemetry adoption rate	100% for new	Avoid vendor-specific SDKs

Alerts firing 10+ /week is a “noise-ization sign.” Datadog over $3k/month is a guideline for over-investment at startup scale. Continuing to output DEBUG logs in production easily hits CloudWatch Logs $10k+/month - the typical accident.

Monitoring cost has 10% of infrastructure cost as upper bound. Cut via sampling and retention if exceeded.

Monitoring-design pitfalls and forbidden moves

Typical accident patterns in monitoring. All have the structure of configured but not operating.

Forbidden move	Why it’s bad
Alert by CPU/memory static threshold (over 80% etc.)	Load fluctuations accumulate false positives, becoming fires but no one looks
Continue outputting DEBUG-level logs in production	CloudWatch Logs / Datadog billing exceeds $10k/month
Send all alerts to one Slack channel	Real incident notifications buried in daily noise
Track performance only by Average	The slow 1% of users invisible. Track with P95/P99
Manage Runbooks in Word/PDF/verbally	Person-locked, irreproducible, AI can’t run them
Hunt for blame in postmortems	Hotbed of info-hiding. Blameless is the rule
Don’t monitor monitoring foundation cost	Datadog billing 5x accidents frequent, monitoring of monitoring needed
Operate microservices without traces	Can’t identify which service is slow, MTTR becomes hours
Instrument with custom SDK	Vendor lock-in, full rewrite on future migration
Don’t conduct alert reviews	Monthly reviews delete formalized alerts
Operate monitoring and production on same network	Same structure as October 2021 Facebook 6-hour outage (BGP-config error and monitoring tools also invisible)

The October 2021 Facebook/Instagram 6-hour outage (BGP config error made servers invisible from outside, and recovery delayed because internal monitoring tools and entry/exit systems all depended on the same network, estimated $60M+ loss) showed the structural problem of no monitoring of monitoring.

Monitoring fires on “human pain.” Fire on SLO violations, not CPU 80%.

AI-era perspective

When AI-driven dev (vibe coding) and AI usage are the premise, observability is evolving into the area where AI auto-detects and diagnoses anomalies. AI-driven monitoring like Datadog Watchdog and Dynatrace Davis goes beyond legacy threshold-based to auto-discover anomalies.

Favored in the AI era	Disfavored in the AI era
High cardinality support (Honeycomb)	Only fixed metrics
OpenTelemetry standard	Custom SDK
Structured logs AI can read	Non-structured string logs
SRE Agent / LLM (Large Language Model) integration	Manual cause-tracking

To have AI investigate incidents, traces and structured logs are required. AI also can’t analyze raw string logs alone. “Output measurement data in form AI can read” - the new standard.

The future of observability is AI agents operating it. Data structure is everything.

Author’s note - cases of “all-green dashboards” with fires beneath

Cases of “monitored but couldn’t notice” have become standard talking points in operational fields.

At a certain SaaS, dashboards of CPU, memory, and network all stayed green while Twitter had hundreds of “can’t log in” reports flowing - a near-miss. The cause was DB connection pool exhaustion, with CPU actually idle and only the app having everyone wait. The trap was “feeling safe looking at infrastructure metrics,” with not measuring SLO (user impact) the root cause.

Another, the October 2021 Facebook/Instagram 6-hour outage, where BGP-config error made servers invisible from outside, but also internal monitoring tools and entry/exit systems all depended on the same network, so engineers couldn’t enter the data center and recovery was delayed - told as a case of “no monitoring of monitoring.” Estimated $60M+ in ad-revenue loss alone.

Both have design gaps in “what to measure” and “monitoring system independence” as lethal blows, slapping home that firing on user impact and separating monitoring from target are both required.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

Measurement SDK (OpenTelemetry recommended / vendor-specific)
Backend (OSS LGTM / Datadog / New Relic)
Metric design (USE, RED, SLI)
Log strategy (collection scope, retention)
Distributed tracing (all requests / sampling)
Alert design (SLO-based, channel separation)
Dashboards (service health, business)

How to make the final call

The core of monitoring and observability is understanding the role difference: monitoring answers “is it running?” and observability answers “why did it break?” Monolithic is enough with legacy monitoring, but the moment microservices-ization progresses, the 3 pillars of Metrics/Logs/Traces and unified measurement via OpenTelemetry become required. Lean alert design toward SLO-violation-like “user impact” rather than CPU-spike-like “symptoms” - the key to operational sustainability.

Another decisive axis is data structure for AI agents to operate. Foundations with OpenTelemetry, structured logs, and high-cardinality support let AI auto-diagnose incidents and ride the new operational models like Datadog Watchdog and SRE Agent. Custom SDK, non-structured string logs, and only fixed metrics become outdated in the AI era.

Selection priorities

Unify measurement with OpenTelemetry - vendor-neutral, prepare for future backend switching
Decide SaaS vs OSS by ops regime - no dedicated SRE → Datadog, few → Grafana Cloud, rich → in-house LGTM
Alerts on user-impact basis - fire on SLO violations, don’t fire on CPU spike alone
Structured data AI can read - JSON / Traces / high cardinality, prepare for AI diagnosis

“Unify three pillars with OpenTelemetry.” Align measurement in form AI can read, fire on user impact.

Summary

This article covered monitoring and observability, including the 3 pillars, OpenTelemetry, main tools, SLO burn-rate alerts, and structured data for AI diagnosis.

Unify measurement with OpenTelemetry, decide SaaS vs OSS by ops regime, alerts on user-impact basis, organize structured data AI can read. That is the practical answer for monitoring and observability in 2026.

Next time we’ll cover log design (structured logs, PII protection, retention).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.