DevOps Architecture

[DevOps Architecture] Monitoring and Observability - Three Pillars + OpenTelemetry + SLO Alerts

[DevOps Architecture] Monitoring and Observability - Three Pillars + OpenTelemetry + SLO Alerts

About this article

As the ninth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains monitoring and observability.

What answers “is it running?” is monitoring, what answers “why did it break?” is observability. This article covers the 3 pillars of monitoring (Metrics/Logs/Traces), the 4 golden signals (Latency/Traffic/Errors/Saturation), OpenTelemetry, AIOps, and operational implementation (system-architecture-stage monitoring requirements live in the separate “System Architecture” article).

DevOps Architecture Overview — One Pipeline for Build, Ship, and Runen.senkohome.com/arch-intro-devops-overview/[DevOps Architecture] DevOps and SRE Overview - Speed and Stability Coexisten.senkohome.com/arch-intro-devops-sre/[DevOps Architecture] Version Control - Git + Monorepo + GitHub Flow Is the Standarden.senkohome.com/arch-intro-devops-vcs/[DevOps Architecture] Dev Environment and Local Execution - Half a Day to First Commiten.senkohome.com/arch-intro-devops-devenv/[DevOps Architecture] Code Review - PR 300 Lines + 1 Approver + CODEOWNERSen.senkohome.com/arch-intro-devops-review/[DevOps Architecture] Test Design - Pyramid + Testcontainers + Branch Coverageen.senkohome.com/arch-intro-devops-test/[DevOps Architecture] CI/CD - GitHub Actions + OIDC + Feature Flag Is the Standarden.senkohome.com/arch-intro-devops-cicd/[DevOps Architecture] Deploy Strategy - Raise Frequency, Lower Risken.senkohome.com/arch-intro-devops-deploy/[DevOps Architecture] Log Design - Structured JSON + No PII + Phased Cold-Tieringen.senkohome.com/arch-intro-devops-logging/[DevOps Architecture] SLO and SLI - Don't Pursue 100%, Buy Speed With Error Budgeten.senkohome.com/arch-intro-devops-slo/[DevOps Architecture] Incident Response - Resolve via Mechanism, Not Heroesen.senkohome.com/arch-intro-devops-incident/[DevOps Architecture] SRE Practices - Toil Reduction and Chaos Drillsen.senkohome.com/arch-intro-devops-sre-practice/[DevOps Architecture] Documentation - Lean README + ADR + OpenAPI Toward Giten.senkohome.com/arch-intro-devops-docs/[DevOps Architecture] Ticket and Project Management - Epic/Story/Task + 1-Day Granularityen.senkohome.com/arch-intro-devops-ticket/

Why it’s needed

Damage expands when you don’t notice incidents

There are still many organizations that first notice incidents from user complaints. Without monitoring, when you notice it’s already a major accident.

Cause-tracking is hard in microservices

In systems where 100 services interact, identifying where the slowness originates requires an observability foundation like distributed tracing.

Numerical SLO/SLA

To promise “99.9% uptime,” you need mechanisms measuring actuals. The principle is what can’t be measured can’t be defended.

Three Pillars

The 3 fundamental data types of observability. Any one alone is insufficient, and the whole picture appears only when combining all 3. “3 pillars” is now classic, with the modern mainstream being multi-faceted approaches adding Events and Profiles.

flowchart TB
    APP([App execution])
    APP --> METRIC[Metrics<br/>numeric time series<br/>CPU/error rate/latency]
    APP --> LOG[Logs<br/>event records<br/>access/error details]
    APP --> TRACE[Traces<br/>request paths<br/>distributed tracing]
    METRIC --> Q1[What's happening?]
    LOG --> Q2[What was recorded?]
    TRACE --> Q3[How did it move?]
    Q1 --> ANSWER([Whole picture<br/>= Observability])
    Q2 --> ANSWER
    Q3 --> ANSWER
    classDef app fill:#fef3c7,stroke:#d97706;
    classDef m fill:#dbeafe,stroke:#2563eb;
    classDef l fill:#fae8ff,stroke:#a21caf;
    classDef t fill:#dcfce7,stroke:#16a34a;
    classDef goal fill:#fef3c7,stroke:#d97706,stroke-width:2px;
    class APP app;
    class METRIC,Q1 m;
    class LOG,Q2 l;
    class TRACE,Q3 t;
    class ANSWER goal;
TypeContentRepresentatives
MetricsNumeric time-series dataPrometheus, Datadog
LogsString event recordsLoki, Elasticsearch
TracesRequest pathsJaeger, Tempo

Metrics show “what’s happening,” logs show “what was recorded,” traces show “how it moved.”

Metrics

Numerically-quantified time-series data, recording CPU usage, request count, error rate, latency, etc. Storage-efficient and good for aggregation/alerting - the basis of monitoring.

Typical metricsContent
SystemCPU, memory, disk I/O, network
AppRequest count, error rate, latency
BusinessOrder count, signup count, revenue
USEUtilization, Saturation, Errors
REDRate, Errors, Duration

USE/RED are famous metric-design frameworks used as guides for “what to measure.”

Logs

Time-stamped text events recording detailed info output by apps. Structured logs (JSON format) are the modern standard, linkable with metrics and traces. Logs are covered in detail in the next article.

Log typeUse case
App logsBusiness-processing records
Access logsHTTP requests
Audit logsPermission ops, important changes
System logsOS / middleware events

Logs occur in massive amounts, so storage cost easily becomes a problem - retention period, compression, and sampling strategies are operational topics.

Traces

Data tracking the process of one request transiting multiple services. In microservices, with chains like “the order API calls the inventory API and the payment API,” you can visualize where it’s slow and where it failed.

Traces have Spans (individual processing) connected by Trace ID, expressing the whole flow as a DAG (Directed Acyclic Graph). Representative tech is OpenTelemetry (industry standard), sent to Jaeger, Tempo, Datadog APM (Application Performance Monitoring), etc.

Trace: order-abc123
 |- Span: POST /order (100ms)
 |   |- Span: check stock (30ms)
 |   |- Span: charge card (50ms)
 |       |- Span: Stripe API (45ms)

Main tools (cloud-native OSS)

OSS-based observability has Prometheus + Grafana + Loki + Tempo + OpenTelemetry as the de facto standard combination. Developing under CNCF (Cloud Native Computing Foundation), with high future prospects.

ToolRole
PrometheusMetric collection, storage, alerting
GrafanaVisualization dashboards
LokiLog aggregation (Grafana Labs)
TempoDistributed tracing (Grafana Labs)
OpenTelemetryStandard for measurement-data collection
AlertmanagerAlert notifications

Integrated as the LGTM stack (Loki, Grafana, Tempo, Mimir), rapidly spreading recently.

Main tools (SaaS)

If you don’t want to build the foundation in-house, using observability SaaS is a quick choice. Pricing is high, but you get high-feature monitoring foundation with near-zero operational burden.

ServiceCharacteristics
DatadogStrongest features, expensive
New RelicVeteran, all-feature integrated
SplunkKing of log analytics
DynatraceAI-driven auto-analysis
HoneycombStrong on high cardinality
Grafana CloudManaged LGTM stack

Datadog has the strongest features, but often hits hundreds of thousands to millions of yen monthly - small scales realistically use Grafana Cloud or New Relic free tier.

OpenTelemetry (measurement standard)

The industry standard unifying collection of metrics, logs, and traces. Born in 2019 from the merger of OpenTracing and OpenCensus, advancing from CNCF Incubating to Graduated - the modern common standard.

Using OpenTelemetry, instrumentation code can be written vendor-neutrally. Even when switching from Datadog to Grafana, no need to rewrite app-side code. Just swap send destination on the Collector side, so vendor-lock-in avoidance kicks in not as desk argument but at implementation level.

Furthermore, with unified instrumentation API, using multiple tools together becomes easy. For example: production metrics to Datadog, long-term log archive to S3 + Loki, dev-env traces to Jaeger - distribute by use case from the same instrumentation code. Custom SDKs require redoing instrumentation per tool, so this difference becomes non-negligible at scale.

FeatureContent
Vendor-neutralSend to any backend
Language supportAlmost all major languages
Auto-instrumentationMajor-framework-supporting
Unified measurementMetrics + logs + traces integrated

For new builds, OpenTelemetry is the top candidate. Avoids vendor lock-in.

Dashboards and alerts

Just collecting monitoring data is meaningless - it becomes operationally useful only with visualization and alerting. The standard operation is overviewing state with Grafana or each SaaS dashboard, and notifying via Slack or PagerDuty when thresholds are exceeded.

Dashboard typeContent
Service healthPer-API state, error rate
InfrastructureCPU, memory, network
BusinessRevenue, user count, conversion
SLOSLI actual vs target

Alerts have the dichotomy of too few = miss it, too many = paralyzed, so per-importance routing (Slack for warnings, PagerDuty for critical) matters.

Alert design

Conditions for good alerts: (1) response definitely needed, (2) response possible, (3) should move now - 3 satisfied. Alerts off these produce alert fatigue and cause missing real critical events.

Good alertsBad alerts
SLO violationCPU over 80% (auto-recovers)
Sudden error rate spikeOne-shot error
Clear user impact”Somehow slow”
Has response proceduresUnclear who does what

Alerts should fire on user impact, not “symptoms.” High CPU itself doesn’t impact users.

“Whether the machine is suffering” and “whether humans are suffering” are different things - the standard lesson of monitoring design. Cases where teams set CPU usage 80% alerts to feel safe, with no one’s dashboard red but Twitter flowing with “can’t log in” reports - are commonly heard. The cause is DB connection pool exhaustion, with CPU actually idle and only the app having everyone wait - the typical pattern.

SLO-based alert implementation example

The modern way for “SLO-violation-based” alerts is firing by error-budget consumption speed (burn rate). Not mere threshold exceeding, but a mechanism detecting “at this pace, the budget will be exhausted” - recommended as standard by the Google SRE Workbook.

SeverityConditionDestination
Critical (immediate response)2% budget consumed in 1 hour (burn rate > 14.4x)PagerDuty
High (within hours)5% budget consumed in 6 hours (burn rate > 6x)PagerDuty
Warning (within business hours)10% budget consumed in 3 days (burn rate > 1x)Slack
# Prometheus example (availability SLO 99.9%, monthly budget 43.2 min)
alert: ErrorBudgetBurnRateCritical
expr:  (1 - availability_slo:ratio_rate5m) > (14.4 * 0.001)
  and  (1 - availability_slo:ratio_rate1h) > (14.4 * 0.001)
for: 2m

“Fire when CPU exceeds 80%” is outdated. Fire by burn rate - the modern standard.

Decision criterion 1: system scale and complexity

Observability heaviness is decided by system complexity. A monolithic single server is fine with lightweight monitoring; microservices need a full set.

System scaleRecommended
Single server / monolithCloudWatch / Datadog lightweight plan
Few servicesPrometheus + Grafana
Microservices (10-50)LGTM + OpenTelemetry
Large (100+)Datadog / Dynatrace enterprise

Decision criterion 2: operations team skills

OSS stacks have heavy operations and need a dedicated team. With SaaS, operational burden decreases but cost spikes.

Team regimeRecommended
No dedicated SRESaaS (Datadog / New Relic)
Few SREsGrafana Cloud
Rich SREs / cost-cutting-orientedIn-house LGTM + OpenTelemetry

How to choose by case

Personal dev / small SaaS

Cloud standards (CloudWatch / Cloud Monitoring) + UptimeRobot. Runs from thousands of yen monthly. Add Grafana Cloud free tier if you want custom dashboards.

Startup / 0-1 SREs

Free tier of Datadog or New Relic + OpenTelemetry SDK. To minimize operational load, SaaS is the only choice. Instrument with OTel and you can switch backends in the future. At scale, billing spikes, so review past 100 hosts.

Mid-size SaaS / microservices-ization

Grafana Cloud (LGTM) + OpenTelemetry. There’s LGTM-stack learning cost, but considerably cheaper than Datadog. Operable with 2-3 SREs.

Large enterprise / can self-operate

Self-built Prometheus + Grafana + Loki + Tempo + OpenTelemetry. Composition avoiding vendor lock-in and not putting confidential data outside. Operations need 5+ dedicated SREs.

Common misconceptions

Take all logs and you’re fine

You enter log hell. Sampling, structuring, and per-importance design are needed - log volume is money.

Monitoring CPU spikes is enough

CPU spikes sometimes don’t impact users. User-impact-based (SLO) is the correct approach.

Email alerts to everyone

No one looks anymore. Should reach responders via appropriate channels - on-call design matters.

OpenTelemetry is new, wait and see

Already industry standard. Adopting now means OTel only - vendor-specific SDKs are becoming outdated.

Monitoring-cost / alert-operation numerical gates

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For monitoring, not “install and feel safe” - the key of operations is tracking cost and signal quality numerically.

MetricRecommendedWhat to do if exceeded
Monthly monitoring foundation cost5-10% of infrastructure costSampling, retention shortening
Production DEBUG-log outputForbiddenNarrow to INFO+
Alerts fired / week10 or fewerNoise reduction, move to SLO-based
Alert response rate90%+Delete if firing but no one looks
MTTA (Mean Time To Acknowledge)Within 5 minReview on-call regime
MTTR (Mean Time To Resolve)Within 30 minMaintain Runbooks
Lighthouse Observability score90 or moreReview metric design
Structured-log rate100%Don’t accept non-JSON
OpenTelemetry adoption rate100% for newAvoid vendor-specific SDKs

Alerts firing 10+ /week is a “noise-ization sign.” Datadog over $3k/month is a guideline for over-investment at startup scale. Continuing to output DEBUG logs in production easily hits CloudWatch Logs $10k+/month - the typical accident.

Monitoring cost has 10% of infrastructure cost as upper bound. Cut via sampling and retention if exceeded.

Monitoring-design pitfalls and forbidden moves

Typical accident patterns in monitoring. All have the structure of configured but not operating.

Forbidden moveWhy it’s bad
Alert by CPU/memory static threshold (over 80% etc.)Load fluctuations accumulate false positives, becoming fires but no one looks
Continue outputting DEBUG-level logs in productionCloudWatch Logs / Datadog billing exceeds $10k/month
Send all alerts to one Slack channelReal incident notifications buried in daily noise
Track performance only by AverageThe slow 1% of users invisible. Track with P95/P99
Manage Runbooks in Word/PDF/verballyPerson-locked, irreproducible, AI can’t run them
Hunt for blame in postmortemsHotbed of info-hiding. Blameless is the rule
Don’t monitor monitoring foundation costDatadog billing 5x accidents frequent, monitoring of monitoring needed
Operate microservices without tracesCan’t identify which service is slow, MTTR becomes hours
Instrument with custom SDKVendor lock-in, full rewrite on future migration
Don’t conduct alert reviewsMonthly reviews delete formalized alerts
Operate monitoring and production on same networkSame structure as October 2021 Facebook 6-hour outage (BGP-config error and monitoring tools also invisible)

The October 2021 Facebook/Instagram 6-hour outage (BGP config error made servers invisible from outside, and recovery delayed because internal monitoring tools and entry/exit systems all depended on the same network, estimated $60M+ loss) showed the structural problem of no monitoring of monitoring.

Monitoring fires on “human pain.” Fire on SLO violations, not CPU 80%.

AI-era perspective

When AI-driven dev (vibe coding) and AI usage are the premise, observability is evolving into the area where AI auto-detects and diagnoses anomalies. AI-driven monitoring like Datadog Watchdog and Dynatrace Davis goes beyond legacy threshold-based to auto-discover anomalies.

Favored in the AI eraDisfavored in the AI era
High cardinality support (Honeycomb)Only fixed metrics
OpenTelemetry standardCustom SDK
Structured logs AI can readNon-structured string logs
SRE Agent / LLM (Large Language Model) integrationManual cause-tracking

To have AI investigate incidents, traces and structured logs are required. AI also can’t analyze raw string logs alone. “Output measurement data in form AI can read” - the new standard.

The future of observability is AI agents operating it. Data structure is everything.

Author’s note - cases of “all-green dashboards” with fires beneath

Cases of “monitored but couldn’t notice” have become standard talking points in operational fields.

At a certain SaaS, dashboards of CPU, memory, and network all stayed green while Twitter had hundreds of “can’t log in” reports flowing - a near-miss. The cause was DB connection pool exhaustion, with CPU actually idle and only the app having everyone wait. The trap was “feeling safe looking at infrastructure metrics,” with not measuring SLO (user impact) the root cause.

Another, the October 2021 Facebook/Instagram 6-hour outage, where BGP-config error made servers invisible from outside, but also internal monitoring tools and entry/exit systems all depended on the same network, so engineers couldn’t enter the data center and recovery was delayed - told as a case of “no monitoring of monitoring.” Estimated $60M+ in ad-revenue loss alone.

Both have design gaps in “what to measure” and “monitoring system independence” as lethal blows, slapping home that firing on user impact and separating monitoring from target are both required.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Measurement SDK (OpenTelemetry recommended / vendor-specific)
  • Backend (OSS LGTM / Datadog / New Relic)
  • Metric design (USE, RED, SLI)
  • Log strategy (collection scope, retention)
  • Distributed tracing (all requests / sampling)
  • Alert design (SLO-based, channel separation)
  • Dashboards (service health, business)

How to make the final call

The core of monitoring and observability is understanding the role difference: monitoring answers “is it running?” and observability answers “why did it break?” Monolithic is enough with legacy monitoring, but the moment microservices-ization progresses, the 3 pillars of Metrics/Logs/Traces and unified measurement via OpenTelemetry become required. Lean alert design toward SLO-violation-like “user impact” rather than CPU-spike-like “symptoms” - the key to operational sustainability.

Another decisive axis is data structure for AI agents to operate. Foundations with OpenTelemetry, structured logs, and high-cardinality support let AI auto-diagnose incidents and ride the new operational models like Datadog Watchdog and SRE Agent. Custom SDK, non-structured string logs, and only fixed metrics become outdated in the AI era.

Selection priorities

  1. Unify measurement with OpenTelemetry - vendor-neutral, prepare for future backend switching
  2. Decide SaaS vs OSS by ops regime - no dedicated SRE → Datadog, few → Grafana Cloud, rich → in-house LGTM
  3. Alerts on user-impact basis - fire on SLO violations, don’t fire on CPU spike alone
  4. Structured data AI can read - JSON / Traces / high cardinality, prepare for AI diagnosis

“Unify three pillars with OpenTelemetry.” Align measurement in form AI can read, fire on user impact.

Summary

This article covered monitoring and observability, including the 3 pillars, OpenTelemetry, main tools, SLO burn-rate alerts, and structured data for AI diagnosis.

Unify measurement with OpenTelemetry, decide SaaS vs OSS by ops regime, alerts on user-impact basis, organize structured data AI can read. That is the practical answer for monitoring and observability in 2026.

Next time we’ll cover log design (structured logs, PII protection, retention).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.