DevOps Architecture

[DevOps Architecture] Monitoring and Observability

[DevOps Architecture] Monitoring and Observability

About this article

As the ninth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains monitoring and observability.

Four Golden Signals of Monitoring

What answers “is it running?” is monitoring, what answers “why did it break?” is observability. This article covers the 3 pillars of monitoring (Metrics/Logs/Traces), the 4 golden signals (Latency/Traffic/Errors/Saturation), OpenTelemetry, AIOps, and operational implementation (system-architecture-stage monitoring requirements live in the separate “System Architecture” article).

What are monitoring and observability, anyway?

Imagine driving a car. Speedometer, fuel gauge, engine warning light — without dashboard instruments, you can’t tell your speed, remaining fuel, or engine status. “Noticing only after the engine seizes” is too late.

Monitoring is your system’s dashboard. It’s the mechanism of continuously measuring values like CPU usage, error rate, and response time, and firing alerts when something is off. Observability goes one step further, referring to the ability to investigate “why it broke” after the fact.

Without monitoring, you first learn about incidents from user complaints. Without observability, even when you notice an incident, you can’t identify the cause, and recovery takes hours.

Why it’s needed

Damage expands when you don’t notice incidents

There are still many organizations that first notice incidents from user complaints. Without monitoring, when you notice it’s already a major accident.

Cause-tracking is hard in microservices

In systems where 100 services interact, identifying where the slowness originates requires an observability foundation like distributed tracing.

Numerical SLO/SLA

To promise “99.9% uptime,” you need mechanisms measuring actuals. The principle is what can’t be measured can’t be defended.

Three Pillars

The 3 fundamental data types of observability. Any one alone is insufficient, and the whole picture appears only when combining all 3. “3 pillars” is now classic, with the modern mainstream being multi-faceted approaches adding Events and Profiles.

Three Pillars of Observability (Metrics, Logs, Traces) Like a car dashboard. Without instruments, you can't notice anomalies Observability Metrics "What is happening" Numerical time-series data CPU usage, error rate, latency Prometheus / Datadog Storage-efficient, suited for aggregation & alerting USE RED Design Guidelines Logs "What was recorded" String-based event records App / Access / Audit / System Loki / Elasticsearch Structured logs (JSON) are the modern standard Storage cost and retention period are operational concerns Traces "How it behaved" Request path tracing Spans chain via Trace ID Jaeger / Tempo Essential for root cause analysis in microservices OpenTelemetry is the industry standard OpenTelemetry Industry standard for unified collection of 3 data types. Avoids vendor lock-in Only by combining all three can you see the full picture. Any single one is insufficient
TypeContentRepresentatives
MetricsNumeric time-series dataPrometheus, Datadog
LogsString event recordsLoki, Elasticsearch
TracesRequest pathsJaeger, Tempo

Metrics show “what’s happening,” logs show “what was recorded,” traces show “how it moved.”

Metrics

Numerically-quantified time-series data, recording CPU usage, request count, error rate, latency, etc. Storage-efficient and good for aggregation/alerting - the basis of monitoring.

Typical metricsContent
SystemCPU, memory, disk I/O, network
AppRequest count, error rate, latency
BusinessOrder count, signup count, revenue
USEUtilization, Saturation, Errors
REDRate, Errors, Duration

USE/RED are famous metric-design frameworks used as guides for “what to measure.”

Logs

Time-stamped text events recording detailed info output by apps. Structured logs (JSON format) are the modern standard, linkable with metrics and traces. Logs are covered in detail in the next article.

Log typeUse case
App logsBusiness-processing records
Access logsHTTP requests
Audit logsPermission ops, important changes
System logsOS / middleware events

Logs occur in massive amounts, so storage cost easily becomes a problem - retention period, compression, and sampling strategies are operational topics.

Traces

Data tracking the process of one request transiting multiple services. In microservices, with chains like “the order API calls the inventory API and the payment API,” you can visualize where it’s slow and where it failed.

Traces have Spans (individual processing) connected by Trace ID, expressing the whole flow as a DAG. Representative tech is OpenTelemetry (industry standard), sent to Jaeger, Tempo, Datadog APM, etc.

Trace: order-abc123
 |- Span: POST /order (100ms)
 |   |- Span: check stock (30ms)
 |   |- Span: charge card (50ms)
 |       |- Span: Stripe API (45ms)

Main tools (cloud-native OSS)

OSS-based observability has Prometheus + Grafana + Loki + Tempo + OpenTelemetry as the de facto standard combination. Developing under CNCF, with high future prospects.

ToolRole
PrometheusMetric collection, storage, alerting
GrafanaVisualization dashboards
LokiLog aggregation (Grafana Labs)
TempoDistributed tracing (Grafana Labs)
OpenTelemetryStandard for measurement-data collection
AlertmanagerAlert notifications

Integrated as the LGTM stack (Loki, Grafana, Tempo, Mimir), rapidly spreading recently.

Main tools (SaaS)

If you don’t want to build the foundation in-house, using observability SaaS is a quick choice. Pricing is high, but you get high-feature monitoring foundation with near-zero operational burden.

ServiceCharacteristics
DatadogStrongest features, expensive
New RelicVeteran, all-feature integrated
SplunkKing of log analytics
DynatraceAI-driven auto-analysis
HoneycombStrong on high cardinality
Grafana CloudManaged LGTM stack

Datadog has the strongest features, but often hits hundreds of thousands to millions of yen monthly - small scales realistically use Grafana Cloud or New Relic free tier.

OpenTelemetry (measurement standard)

The industry standard unifying collection of metrics, logs, and traces. Born in 2019 from the merger of OpenTracing and OpenCensus, advancing from CNCF Incubating to Graduated - the modern common standard.

Using OpenTelemetry, instrumentation code can be written vendor-neutrally. Even when switching from Datadog to Grafana, no need to rewrite app-side code. Just swap send destination on the Collector side, so vendor-lock-in avoidance kicks in not as desk argument but at implementation level.

Furthermore, with unified instrumentation API, using multiple tools together becomes easy. For example: production metrics to Datadog, long-term log archive to S3 + Loki, dev-env traces to Jaeger - distribute by use case from the same instrumentation code. Custom SDKs require redoing instrumentation per tool, so this difference becomes non-negligible at scale.

FeatureContent
Vendor-neutralSend to any backend
Language supportAlmost all major languages
Auto-instrumentationMajor-framework-supporting
Unified measurementMetrics + logs + traces integrated

For new builds, OpenTelemetry is the top candidate. Avoids vendor lock-in.

Dashboards and alerts

Just collecting monitoring data is meaningless - it becomes operationally useful only with visualization and alerting. The standard operation is overviewing state with Grafana or each SaaS dashboard, and notifying via Slack or PagerDuty when thresholds are exceeded.

Dashboard typeContent
Service healthPer-API state, error rate
InfrastructureCPU, memory, network
BusinessRevenue, user count, conversion
SLOSLI actual vs target

Alerts have the dichotomy of too few = miss it, too many = paralyzed, so per-importance routing (Slack for warnings, PagerDuty for critical) matters.

Alert design

Conditions for good alerts: (1) response definitely needed, (2) response possible, (3) should move now - 3 satisfied. Alerts off these produce alert fatigue and cause missing real critical events.

Good alertsBad alerts
SLO violationCPU over 80% (auto-recovers)
Sudden error rate spikeOne-shot error
Clear user impact”Somehow slow”
Has response proceduresUnclear who does what

Alerts should fire on user impact, not “symptoms.” High CPU itself doesn’t impact users.

“Whether the machine is suffering” and “whether humans are suffering” are different things - the standard lesson of monitoring design. Cases where teams set CPU usage 80% alerts to feel safe, with no one’s dashboard red but Twitter flowing with “can’t log in” reports - are commonly heard. The cause is DB connection pool exhaustion, with CPU actually idle and only the app having everyone wait - the typical pattern.

SLO-based alert implementation example

The modern way for “SLO-violation-based” alerts is firing by error-budget consumption speed (burn rate). Not mere threshold exceeding, but a mechanism detecting “at this pace, the budget will be exhausted” - recommended as standard by the Google SRE Workbook.

SeverityConditionDestination
Critical (immediate response)2% budget consumed in 1 hour (burn rate > 14.4x)PagerDuty
High (within hours)5% budget consumed in 6 hours (burn rate > 6x)PagerDuty
Warning (within business hours)10% budget consumed in 3 days (burn rate > 1x)Slack
# Prometheus example (availability SLO 99.9%, monthly budget 43.2 min)
alert: ErrorBudgetBurnRateCritical
expr:  (1 - availability_slo:ratio_rate5m) > (14.4 * 0.001)
  and  (1 - availability_slo:ratio_rate1h) > (14.4 * 0.001)
for: 2m

“Fire when CPU exceeds 80%” is outdated. Fire by burn rate - the modern standard.

Decision criterion 1: system scale and complexity

Observability heaviness is decided by system complexity. A monolithic single server is fine with lightweight monitoring; microservices need a full set.

System scaleRecommended
Single server / monolithCloudWatch / Datadog lightweight plan
Few servicesPrometheus + Grafana
Microservices (10-50)LGTM + OpenTelemetry
Large (100+)Datadog / Dynatrace enterprise

Decision criterion 2: operations team skills

OSS stacks have heavy operations and need a dedicated team. With SaaS, operational burden decreases but costs add up.

Team regimeRecommended
No dedicated SRESaaS (Datadog / New Relic)
Few SREsGrafana Cloud
Rich SREs / cost-cutting-orientedIn-house LGTM + OpenTelemetry

How to choose by case

Personal dev / small SaaS

Cloud standards (CloudWatch / Cloud Monitoring) + UptimeRobot. Runs from thousands of yen monthly. Add Grafana Cloud free tier if you want custom dashboards.

Startup / 0-1 SREs

Free tier of Datadog or New Relic + OpenTelemetry SDK. To minimize operational load, SaaS is the only choice. Instrument with OTel and you can switch backends in the future. At scale, billing spikes, so review past 100 hosts.

Mid-size SaaS / microservices-ization

Grafana Cloud (LGTM) + OpenTelemetry. There’s LGTM-stack learning cost, but considerably cheaper than Datadog. Operable with 2-3 SREs.

Large enterprise / can self-operate

Self-built Prometheus + Grafana + Loki + Tempo + OpenTelemetry. Composition avoiding vendor lock-in and not putting confidential data outside. Operations need 5+ dedicated SREs.

Monitoring-cost / alert-operation numerical gates

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For monitoring, not “install and feel safe” - the key of operations is tracking cost and signal quality numerically.

MetricRecommendedWhat to do if exceeded
Monthly monitoring foundation cost5-10% of infrastructure costSampling, retention shortening
Production DEBUG-log outputForbiddenNarrow to INFO+
Alerts fired / week10 or fewerNoise reduction, move to SLO-based
Alert response rate90%+Delete if firing but no one looks
MTTAWithin 5 minReview on-call regime
MTTRWithin 30 minMaintain Runbooks
Lighthouse Observability score90 or moreReview metric design
Structured-log rate100%Don’t accept non-JSON
OpenTelemetry adoption rate100% for newAvoid vendor-specific SDKs

Alerts firing 10+ /week is a “noise-ization sign.” Datadog over $3k/month is a guideline for over-investment at startup scale. Continuing to output DEBUG logs in production easily hits CloudWatch Logs $10k+/month - the typical accident.

Monitoring cost has 10% of infrastructure cost as upper bound. Cut via sampling and retention if exceeded.

Monitoring-design pitfalls and forbidden moves

Typical accident patterns in monitoring. All have the structure of configured but not operating.

Forbidden moveWhy it’s bad
Alert by CPU/memory static threshold (over 80% etc.)Load fluctuations accumulate false positives, becoming fires but no one looks
Continue outputting DEBUG-level logs in productionCloudWatch Logs / Datadog billing exceeds $10k/month
Send all alerts to one Slack channelReal incident notifications buried in daily noise
Track performance only by AverageThe slow 1% of users invisible. Track with P95/P99
Manage Runbooks in Word/PDF/verballyPerson-locked, irreproducible, AI can’t run them
Hunt for blame in postmortemsHotbed of info-hiding. Blameless is the rule
Don’t monitor monitoring foundation costDatadog billing 5x accidents frequent, monitoring of monitoring needed
Operate microservices without tracesCan’t identify which service is slow, MTTR becomes hours
Instrument with custom SDKVendor lock-in, full rewrite on future migration
Don’t conduct alert reviewsMonthly reviews delete formalized alerts
Operate monitoring and production on same networkSame structure as October 2021 Facebook 6-hour outage (BGP-config error and monitoring tools also invisible)
“Monitoring CPU spikes is enough”CPU spikes don’t always impact users; SLO (user impact) based is the correct approach
”Email alerts to everyone”No one looks; on-call design delivering to appropriate responders via appropriate channels is needed

The October 2021 Facebook/Instagram 6-hour outage (BGP config error made servers invisible from outside, and recovery delayed because internal monitoring tools and entry/exit systems all depended on the same network, estimated $60M+ loss) showed the structural problem of no monitoring of monitoring.

Monitoring fires on “human pain.” Fire on SLO violations, not CPU 80%.

AI decision axes

AI-favoredAI-disfavored
High cardinality support (Honeycomb)Only fixed metrics
OpenTelemetry standardCustom SDK
Structured logs AI can readNon-structured string logs
SRE Agent / LLM integrationManual cause-tracking
  1. Unify measurement with OpenTelemetry - vendor-neutral, prepare for future backend switching
  2. Decide SaaS vs OSS by ops regime - no dedicated SRE: Datadog, few: Grafana Cloud, rich: in-house LGTM
  3. Alerts on user-impact basis - fire on SLO violations, don’t fire on CPU spike alone
  4. Structured data AI can read - JSON / Traces / high cardinality, prepare for AI diagnosis

AI-driven root-cause analysis (RCA) has become practical

Datadog’s “Watchdog RCA” and New Relic’s “AI Insights” provide features that cross-analyze multiple signals (logs, metrics, traces) to estimate incident root causes. The prerequisite for these to work is that the 3 pillars (logs, metrics, traces) are correlated via trace_id.

With unified instrumentation via OpenTelemetry, the log → trace → metrics correlation is automatically built on the tool side, maximizing AI RCA accuracy.

Observability backend selection and AI feature gaps

As of 2026, AI feature maturity varies by observability tool. Datadog covers anomaly detection, RCA, and Runbook recommendation with AI; Honeycomb broadens AI intervention space with BubbleUp (high-cardinality analysis); Grafana Cloud has implemented LLM-based log-query generation. “How far AI-powered diagnostic support goes” is now an evaluation axis in tool selection.

Author’s note - cases of “all-green dashboards” with fires beneath

Cases of “monitored but couldn’t notice” have become standard talking points in operational fields.

At a certain SaaS, dashboards of CPU, memory, and network all stayed green while Twitter had hundreds of “can’t log in” reports flowing - a near-miss. The cause was DB connection pool exhaustion, with CPU actually idle and only the app having everyone wait. The trap was “feeling safe looking at infrastructure metrics,” with not measuring SLO (user impact) the root cause.

Another, the October 2021 Facebook/Instagram 6-hour outage, where BGP-config error made servers invisible from outside, but also internal monitoring tools and entry/exit systems all depended on the same network, so engineers couldn’t enter the data center and recovery was delayed - told as a case of “no monitoring of monitoring.” Estimated $60M+ in ad-revenue loss alone.

Both have design gaps in “what to measure” and “monitoring system independence” as lethal blows, slapping home that firing on user impact and separating monitoring from target are both required.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Measurement SDK (OpenTelemetry recommended / vendor-specific)
  • Backend (OSS LGTM / Datadog / New Relic)
  • Metric design (USE, RED, SLI)
  • Log strategy (collection scope, retention)
  • Distributed tracing (all requests / sampling)
  • Alert design (SLO-based, channel separation)
  • Dashboards (service health, business)

https://en.senkohome.com/arch-intro-devops-overview/ https://en.senkohome.com/arch-intro-devops-logging/ https://en.senkohome.com/arch-intro-devops-slo/

Summary

This article covered monitoring and observability, including the 3 pillars, OpenTelemetry, main tools, SLO burn-rate alerts, and structured data for AI diagnosis.

Unify measurement with OpenTelemetry, decide SaaS vs OSS by ops regime, alerts on user-impact basis, organize structured data AI can read. That is the practical answer for monitoring and observability in 2026.

Next time we’ll cover log design (structured logs, PII protection, retention).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.

📚 Series: Architecture Crash Course for the Generative-AI Era (62/89)