About this article
As the ninth installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains monitoring and observability.
What answers âis it running?â is monitoring, what answers âwhy did it break?â is observability. This article covers the 3 pillars of monitoring (Metrics/Logs/Traces), the 4 golden signals (Latency/Traffic/Errors/Saturation), OpenTelemetry, AIOps, and operational implementation (system-architecture-stage monitoring requirements live in the separate âSystem Architectureâ article).
Other articles in this category
Why itâs needed
Damage expands when you donât notice incidents
There are still many organizations that first notice incidents from user complaints. Without monitoring, when you notice itâs already a major accident.
Cause-tracking is hard in microservices
In systems where 100 services interact, identifying where the slowness originates requires an observability foundation like distributed tracing.
Numerical SLO/SLA
To promise â99.9% uptime,â you need mechanisms measuring actuals. The principle is what canât be measured canât be defended.
Three Pillars
The 3 fundamental data types of observability. Any one alone is insufficient, and the whole picture appears only when combining all 3. â3 pillarsâ is now classic, with the modern mainstream being multi-faceted approaches adding Events and Profiles.
flowchart TB
APP([App execution])
APP --> METRIC[Metrics<br/>numeric time series<br/>CPU/error rate/latency]
APP --> LOG[Logs<br/>event records<br/>access/error details]
APP --> TRACE[Traces<br/>request paths<br/>distributed tracing]
METRIC --> Q1[What's happening?]
LOG --> Q2[What was recorded?]
TRACE --> Q3[How did it move?]
Q1 --> ANSWER([Whole picture<br/>= Observability])
Q2 --> ANSWER
Q3 --> ANSWER
classDef app fill:#fef3c7,stroke:#d97706;
classDef m fill:#dbeafe,stroke:#2563eb;
classDef l fill:#fae8ff,stroke:#a21caf;
classDef t fill:#dcfce7,stroke:#16a34a;
classDef goal fill:#fef3c7,stroke:#d97706,stroke-width:2px;
class APP app;
class METRIC,Q1 m;
class LOG,Q2 l;
class TRACE,Q3 t;
class ANSWER goal;
| Type | Content | Representatives |
|---|---|---|
| Metrics | Numeric time-series data | Prometheus, Datadog |
| Logs | String event records | Loki, Elasticsearch |
| Traces | Request paths | Jaeger, Tempo |
Metrics show âwhatâs happening,â logs show âwhat was recorded,â traces show âhow it moved.â
Metrics
Numerically-quantified time-series data, recording CPU usage, request count, error rate, latency, etc. Storage-efficient and good for aggregation/alerting - the basis of monitoring.
| Typical metrics | Content |
|---|---|
| System | CPU, memory, disk I/O, network |
| App | Request count, error rate, latency |
| Business | Order count, signup count, revenue |
| USE | Utilization, Saturation, Errors |
| RED | Rate, Errors, Duration |
USE/RED are famous metric-design frameworks used as guides for âwhat to measure.â
Logs
Time-stamped text events recording detailed info output by apps. Structured logs (JSON format) are the modern standard, linkable with metrics and traces. Logs are covered in detail in the next article.
| Log type | Use case |
|---|---|
| App logs | Business-processing records |
| Access logs | HTTP requests |
| Audit logs | Permission ops, important changes |
| System logs | OS / middleware events |
Logs occur in massive amounts, so storage cost easily becomes a problem - retention period, compression, and sampling strategies are operational topics.
Traces
Data tracking the process of one request transiting multiple services. In microservices, with chains like âthe order API calls the inventory API and the payment API,â you can visualize where itâs slow and where it failed.
Traces have Spans (individual processing) connected by Trace ID, expressing the whole flow as a DAG (Directed Acyclic Graph). Representative tech is OpenTelemetry (industry standard), sent to Jaeger, Tempo, Datadog APM (Application Performance Monitoring), etc.
Trace: order-abc123
|- Span: POST /order (100ms)
| |- Span: check stock (30ms)
| |- Span: charge card (50ms)
| |- Span: Stripe API (45ms)
Main tools (cloud-native OSS)
OSS-based observability has Prometheus + Grafana + Loki + Tempo + OpenTelemetry as the de facto standard combination. Developing under CNCF (Cloud Native Computing Foundation), with high future prospects.
| Tool | Role |
|---|---|
| Prometheus | Metric collection, storage, alerting |
| Grafana | Visualization dashboards |
| Loki | Log aggregation (Grafana Labs) |
| Tempo | Distributed tracing (Grafana Labs) |
| OpenTelemetry | Standard for measurement-data collection |
| Alertmanager | Alert notifications |
Integrated as the LGTM stack (Loki, Grafana, Tempo, Mimir), rapidly spreading recently.
Main tools (SaaS)
If you donât want to build the foundation in-house, using observability SaaS is a quick choice. Pricing is high, but you get high-feature monitoring foundation with near-zero operational burden.
| Service | Characteristics |
|---|---|
| Datadog | Strongest features, expensive |
| New Relic | Veteran, all-feature integrated |
| Splunk | King of log analytics |
| Dynatrace | AI-driven auto-analysis |
| Honeycomb | Strong on high cardinality |
| Grafana Cloud | Managed LGTM stack |
Datadog has the strongest features, but often hits hundreds of thousands to millions of yen monthly - small scales realistically use Grafana Cloud or New Relic free tier.
OpenTelemetry (measurement standard)
The industry standard unifying collection of metrics, logs, and traces. Born in 2019 from the merger of OpenTracing and OpenCensus, advancing from CNCF Incubating to Graduated - the modern common standard.
Using OpenTelemetry, instrumentation code can be written vendor-neutrally. Even when switching from Datadog to Grafana, no need to rewrite app-side code. Just swap send destination on the Collector side, so vendor-lock-in avoidance kicks in not as desk argument but at implementation level.
Furthermore, with unified instrumentation API, using multiple tools together becomes easy. For example: production metrics to Datadog, long-term log archive to S3 + Loki, dev-env traces to Jaeger - distribute by use case from the same instrumentation code. Custom SDKs require redoing instrumentation per tool, so this difference becomes non-negligible at scale.
| Feature | Content |
|---|---|
| Vendor-neutral | Send to any backend |
| Language support | Almost all major languages |
| Auto-instrumentation | Major-framework-supporting |
| Unified measurement | Metrics + logs + traces integrated |
For new builds, OpenTelemetry is the top candidate. Avoids vendor lock-in.
Dashboards and alerts
Just collecting monitoring data is meaningless - it becomes operationally useful only with visualization and alerting. The standard operation is overviewing state with Grafana or each SaaS dashboard, and notifying via Slack or PagerDuty when thresholds are exceeded.
| Dashboard type | Content |
|---|---|
| Service health | Per-API state, error rate |
| Infrastructure | CPU, memory, network |
| Business | Revenue, user count, conversion |
| SLO | SLI actual vs target |
Alerts have the dichotomy of too few = miss it, too many = paralyzed, so per-importance routing (Slack for warnings, PagerDuty for critical) matters.
Alert design
Conditions for good alerts: (1) response definitely needed, (2) response possible, (3) should move now - 3 satisfied. Alerts off these produce alert fatigue and cause missing real critical events.
| Good alerts | Bad alerts |
|---|---|
| SLO violation | CPU over 80% (auto-recovers) |
| Sudden error rate spike | One-shot error |
| Clear user impact | âSomehow slowâ |
| Has response procedures | Unclear who does what |
Alerts should fire on user impact, not âsymptoms.â High CPU itself doesnât impact users.
âWhether the machine is sufferingâ and âwhether humans are sufferingâ are different things - the standard lesson of monitoring design. Cases where teams set CPU usage 80% alerts to feel safe, with no oneâs dashboard red but Twitter flowing with âcanât log inâ reports - are commonly heard. The cause is DB connection pool exhaustion, with CPU actually idle and only the app having everyone wait - the typical pattern.
SLO-based alert implementation example
The modern way for âSLO-violation-basedâ alerts is firing by error-budget consumption speed (burn rate). Not mere threshold exceeding, but a mechanism detecting âat this pace, the budget will be exhaustedâ - recommended as standard by the Google SRE Workbook.
| Severity | Condition | Destination |
|---|---|---|
| Critical (immediate response) | 2% budget consumed in 1 hour (burn rate > 14.4x) | PagerDuty |
| High (within hours) | 5% budget consumed in 6 hours (burn rate > 6x) | PagerDuty |
| Warning (within business hours) | 10% budget consumed in 3 days (burn rate > 1x) | Slack |
# Prometheus example (availability SLO 99.9%, monthly budget 43.2 min)
alert: ErrorBudgetBurnRateCritical
expr: (1 - availability_slo:ratio_rate5m) > (14.4 * 0.001)
and (1 - availability_slo:ratio_rate1h) > (14.4 * 0.001)
for: 2m
âFire when CPU exceeds 80%â is outdated. Fire by burn rate - the modern standard.
Decision criterion 1: system scale and complexity
Observability heaviness is decided by system complexity. A monolithic single server is fine with lightweight monitoring; microservices need a full set.
| System scale | Recommended |
|---|---|
| Single server / monolith | CloudWatch / Datadog lightweight plan |
| Few services | Prometheus + Grafana |
| Microservices (10-50) | LGTM + OpenTelemetry |
| Large (100+) | Datadog / Dynatrace enterprise |
Decision criterion 2: operations team skills
OSS stacks have heavy operations and need a dedicated team. With SaaS, operational burden decreases but cost spikes.
| Team regime | Recommended |
|---|---|
| No dedicated SRE | SaaS (Datadog / New Relic) |
| Few SREs | Grafana Cloud |
| Rich SREs / cost-cutting-oriented | In-house LGTM + OpenTelemetry |
How to choose by case
Personal dev / small SaaS
Cloud standards (CloudWatch / Cloud Monitoring) + UptimeRobot. Runs from thousands of yen monthly. Add Grafana Cloud free tier if you want custom dashboards.
Startup / 0-1 SREs
Free tier of Datadog or New Relic + OpenTelemetry SDK. To minimize operational load, SaaS is the only choice. Instrument with OTel and you can switch backends in the future. At scale, billing spikes, so review past 100 hosts.
Mid-size SaaS / microservices-ization
Grafana Cloud (LGTM) + OpenTelemetry. Thereâs LGTM-stack learning cost, but considerably cheaper than Datadog. Operable with 2-3 SREs.
Large enterprise / can self-operate
Self-built Prometheus + Grafana + Loki + Tempo + OpenTelemetry. Composition avoiding vendor lock-in and not putting confidential data outside. Operations need 5+ dedicated SREs.
Common misconceptions
Take all logs and youâre fine
You enter log hell. Sampling, structuring, and per-importance design are needed - log volume is money.
Monitoring CPU spikes is enough
CPU spikes sometimes donât impact users. User-impact-based (SLO) is the correct approach.
Email alerts to everyone
No one looks anymore. Should reach responders via appropriate channels - on-call design matters.
OpenTelemetry is new, wait and see
Already industry standard. Adopting now means OTel only - vendor-specific SDKs are becoming outdated.
Monitoring-cost / alert-operation numerical gates
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
For monitoring, not âinstall and feel safeâ - the key of operations is tracking cost and signal quality numerically.
| Metric | Recommended | What to do if exceeded |
|---|---|---|
| Monthly monitoring foundation cost | 5-10% of infrastructure cost | Sampling, retention shortening |
| Production DEBUG-log output | Forbidden | Narrow to INFO+ |
| Alerts fired / week | 10 or fewer | Noise reduction, move to SLO-based |
| Alert response rate | 90%+ | Delete if firing but no one looks |
| MTTA (Mean Time To Acknowledge) | Within 5 min | Review on-call regime |
| MTTR (Mean Time To Resolve) | Within 30 min | Maintain Runbooks |
| Lighthouse Observability score | 90 or more | Review metric design |
| Structured-log rate | 100% | Donât accept non-JSON |
| OpenTelemetry adoption rate | 100% for new | Avoid vendor-specific SDKs |
Alerts firing 10+ /week is a ânoise-ization sign.â Datadog over $3k/month is a guideline for over-investment at startup scale. Continuing to output DEBUG logs in production easily hits CloudWatch Logs $10k+/month - the typical accident.
Monitoring cost has 10% of infrastructure cost as upper bound. Cut via sampling and retention if exceeded.
Monitoring-design pitfalls and forbidden moves
Typical accident patterns in monitoring. All have the structure of configured but not operating.
| Forbidden move | Why itâs bad |
|---|---|
| Alert by CPU/memory static threshold (over 80% etc.) | Load fluctuations accumulate false positives, becoming fires but no one looks |
| Continue outputting DEBUG-level logs in production | CloudWatch Logs / Datadog billing exceeds $10k/month |
| Send all alerts to one Slack channel | Real incident notifications buried in daily noise |
| Track performance only by Average | The slow 1% of users invisible. Track with P95/P99 |
| Manage Runbooks in Word/PDF/verbally | Person-locked, irreproducible, AI canât run them |
| Hunt for blame in postmortems | Hotbed of info-hiding. Blameless is the rule |
| Donât monitor monitoring foundation cost | Datadog billing 5x accidents frequent, monitoring of monitoring needed |
| Operate microservices without traces | Canât identify which service is slow, MTTR becomes hours |
| Instrument with custom SDK | Vendor lock-in, full rewrite on future migration |
| Donât conduct alert reviews | Monthly reviews delete formalized alerts |
| Operate monitoring and production on same network | Same structure as October 2021 Facebook 6-hour outage (BGP-config error and monitoring tools also invisible) |
The October 2021 Facebook/Instagram 6-hour outage (BGP config error made servers invisible from outside, and recovery delayed because internal monitoring tools and entry/exit systems all depended on the same network, estimated $60M+ loss) showed the structural problem of no monitoring of monitoring.
Monitoring fires on âhuman pain.â Fire on SLO violations, not CPU 80%.
AI-era perspective
When AI-driven dev (vibe coding) and AI usage are the premise, observability is evolving into the area where AI auto-detects and diagnoses anomalies. AI-driven monitoring like Datadog Watchdog and Dynatrace Davis goes beyond legacy threshold-based to auto-discover anomalies.
| Favored in the AI era | Disfavored in the AI era |
|---|---|
| High cardinality support (Honeycomb) | Only fixed metrics |
| OpenTelemetry standard | Custom SDK |
| Structured logs AI can read | Non-structured string logs |
| SRE Agent / LLM (Large Language Model) integration | Manual cause-tracking |
To have AI investigate incidents, traces and structured logs are required. AI also canât analyze raw string logs alone. âOutput measurement data in form AI can readâ - the new standard.
The future of observability is AI agents operating it. Data structure is everything.
Authorâs note - cases of âall-green dashboardsâ with fires beneath
Cases of âmonitored but couldnât noticeâ have become standard talking points in operational fields.
At a certain SaaS, dashboards of CPU, memory, and network all stayed green while Twitter had hundreds of âcanât log inâ reports flowing - a near-miss. The cause was DB connection pool exhaustion, with CPU actually idle and only the app having everyone wait. The trap was âfeeling safe looking at infrastructure metrics,â with not measuring SLO (user impact) the root cause.
Another, the October 2021 Facebook/Instagram 6-hour outage, where BGP-config error made servers invisible from outside, but also internal monitoring tools and entry/exit systems all depended on the same network, so engineers couldnât enter the data center and recovery was delayed - told as a case of âno monitoring of monitoring.â Estimated $60M+ in ad-revenue loss alone.
Both have design gaps in âwhat to measureâ and âmonitoring system independenceâ as lethal blows, slapping home that firing on user impact and separating monitoring from target are both required.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Measurement SDK (OpenTelemetry recommended / vendor-specific)
- Backend (OSS LGTM / Datadog / New Relic)
- Metric design (USE, RED, SLI)
- Log strategy (collection scope, retention)
- Distributed tracing (all requests / sampling)
- Alert design (SLO-based, channel separation)
- Dashboards (service health, business)
How to make the final call
The core of monitoring and observability is understanding the role difference: monitoring answers âis it running?â and observability answers âwhy did it break?â Monolithic is enough with legacy monitoring, but the moment microservices-ization progresses, the 3 pillars of Metrics/Logs/Traces and unified measurement via OpenTelemetry become required. Lean alert design toward SLO-violation-like âuser impactâ rather than CPU-spike-like âsymptomsâ - the key to operational sustainability.
Another decisive axis is data structure for AI agents to operate. Foundations with OpenTelemetry, structured logs, and high-cardinality support let AI auto-diagnose incidents and ride the new operational models like Datadog Watchdog and SRE Agent. Custom SDK, non-structured string logs, and only fixed metrics become outdated in the AI era.
Selection priorities
- Unify measurement with OpenTelemetry - vendor-neutral, prepare for future backend switching
- Decide SaaS vs OSS by ops regime - no dedicated SRE â Datadog, few â Grafana Cloud, rich â in-house LGTM
- Alerts on user-impact basis - fire on SLO violations, donât fire on CPU spike alone
- Structured data AI can read - JSON / Traces / high cardinality, prepare for AI diagnosis
âUnify three pillars with OpenTelemetry.â Align measurement in form AI can read, fire on user impact.
Summary
This article covered monitoring and observability, including the 3 pillars, OpenTelemetry, main tools, SLO burn-rate alerts, and structured data for AI diagnosis.
Unify measurement with OpenTelemetry, decide SaaS vs OSS by ops regime, alerts on user-impact basis, organize structured data AI can read. That is the practical answer for monitoring and observability in 2026.
Next time weâll cover log design (structured logs, PII protection, retention).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (62/89)