DevOps Architecture

[DevOps Architecture] Log Design

[DevOps Architecture] Log Design

About this article

As the tenth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains log design.

Logs are a letter to your future self - their true value is asked half a year later, not at writing time. This article covers structured logs (JSON), log levels, correlation IDs, PII masking, retention, and log-aggregation foundations (CloudWatch / Loki / Datadog), handling design that leaves necessary-and-sufficient information in machine-readable form with appropriate retention.

What is log design, anyway?

Log Design Fundamentals

Picture an airplane’s flight recorder (black box). When an accident happens, the only reason investigators can determine what occurred is that all in-flight data was automatically recorded. “Starting the camera after the crash” is too late.

Log design means deciding in advance what to record, in what format, where to store it, and for how long about your system’s operational records. Incident investigation, security audits, business analysis — none of it can begin without logs.

Without log design, when an incident occurs there are zero clues to identify the cause. Nobody can answer “what happened last night,” and the same failures keep repeating.

Why log design is needed

Primary information for incident investigation

When incidents occur, without logs the cause can’t be known. Logs are the only asset for restoring past system state.

Audit / compliance response

Records of “who did what when” are always demanded in audits. SOX, PCI DSS, Personal Information Protection Act - all require audit-log retention.

Business analysis / improvement

User behavior, error trends, performance - analyzing logs gets hints for product improvement. More detailed info available than metrics.

Log levels

Standard practice is severity-staging logs. At runtime, you can filter by level - production at INFO+, development at DEBUG+. Log libraries in each language standard-support this.

LevelUse case
TRACEMost detailed, usually disabled
DEBUGDebug info during development
INFOMilestones of normal processing
WARNNotable situations (auto-recovery etc.)
ERRORErrors occurred, needs investigation
FATALCritical incident, immediate response

Outputting DEBUG in production is the start of hell. Volume explodes and storage cost balloons several-fold.

Structured Logging

The method outputting logs in machine-readable structures like JSON. Legacy free-text logs are human-readable but search, aggregation, and correlation analysis were difficult. With structured logs, all become easy.

{
  "timestamp": "2026-04-18T10:23:45Z",
  "level": "ERROR",
  "service": "order-api",
  "trace_id": "abc123",
  "user_id": "u42",
  "message": "Payment failed",
  "error_code": "CARD_DECLINED",
  "latency_ms": 1234
}

Free-text output is now an antipattern. All new projects should start with structured logs.

“A log writing only ‘an error occurred’ is the same as a will writing only ‘someone died’” - the standard maxim of log design. In late-night incident response, facing logs lined with just [ERROR] Payment failed, having to reverse-engineer “which user, what amount, why payment failed” via “timestamp” and “Stripe-management-screen cross-reference” alone often happens in the field. Stories of struggling 3 hours to finally reach the cause and adding user_id, amount, and error_code that very night - aren’t rare.

What to write in logs

Standardize info included in logs and unify across all services. Without deciding standard items, things become disparate per service, making cross-search hard.

Required fieldsContent
timestampISO 8601, UTC recommended
levelERROR / INFO etc.
serviceService name
trace_idLink with distributed tracing
user_id / request_idSubject / request identification
messageContent humans read
contextStructured additional info

Personal info, passwords, tokens absolutely never written to logs - the iron rule. Once output, irrecoverable.

Log-collection architecture

Beyond outputting from apps, foundations to aggregate, store, and search are needed. The modern mainstream is the composition apps to stdout, collection in separate processes (12-Factor App, the cloud-era app-architecture guideline proposed by Heroku), letting apps not worry about log destinations.

Log Collection Infrastructure (12-Factor App Method) Apps just write to stdout. Collection is another process's job Application App A App B App C Output structured JSON logs to stdout App doesn't know where logs go stdout Log Collection Fluent Bit / Vector Filtering Parse & Transform PII Masking Routing (Distribution) Hot Search (7-30 days) Loki / Elasticsearch / Datadog Warm (3-6 months) Grafana / Splunk Cold (1-7 years) S3 / GCS Archive 12-Factor App Principle: Apps should "emit logs as event streams to stdout" not "write to files" Fluent Bit Lightweight, CNCF graduated, K8s standard Runs on a few MB of memory Vector Built in Rust, high-performance, flexible Owned by Datadog Fluentd Veteran, feature-rich Ruby plugin ecosystem Promtail Loki-dedicated, lightweight Standard choice when paired with Grafana Loki Structured logs (JSON) + stdout output + tiered cold storage is the foundation of log design
Collection toolCharacteristics
Fluent BitLightweight, CNCF graduated
VectorRust-built, high-perf
FluentdVeteran, feature-rich
PromtailLoki-dedicated, lightweight

Storage destinations (log backend)

Choose log storage by search demand, cost, and data volume. High-speed-searching mass logs is unexpectedly costly, and “all logs into Elasticsearch for now” is a breakdown-prone approach.

BackendCharacteristicsCost
ElasticsearchStrongest search, heavy opsHigh
LokiLabel-based, cheapLow
Datadog LogsSaaS-integratedHigh
SplunkVeteran, high-featureHighest
Cloud Logging (GCP)ManagedMid
CloudWatch LogsAWS-integratedMid
S3 / GCSArchiveLowest

Loki by Grafana Labs has extremely good cost efficiency and has rapidly spread recently. Drawing attention as Elasticsearch’s alternative.

Retention and cost management

Logs cost increases proportionally to retention, so permanent storage isn’t realistic. Considering legal and investigative requirements, phased cold-tiering is general.

TierPeriodUse case
Hot (high-speed search)7-30 daysIncident investigation
Warm (slightly slow)3-6 monthsAnalysis, audit
Cold (archive)1-7 yearsAudit requirements, legal retention
DeletionBeyondErase unneeded info

Audit logs at 7 years are required by many regulations, but app logs are often enough at 30 days. Categorize and handle - realistic.

Log sampling

In large systems, storing all logs explodes cost, so reduce by sampling. But the principle is storing 100% of error logs, sampling only successful requests.

StrategyContent
Fixed rateStore only 1/100
Tail samplingStore all on errors
Importance-basedVary by amount, user type
Adaptive samplingAuto-adjust by traffic volume

OpenTelemetry’s tail sampling is the modern answer, judging storage after seeing the whole trace.

Audit logs

Special logs recording “who did what when,” requiring tamper-proof and long-term retention. Manage through different routes from general logs, ideally storing on WORM (Write Once Read Many) storage.

Required record fieldsContent
WhoUser ID, IP
WhatOperation contents
WhenTimestamp
WhereSystem, resource
ResultSuccess / failure

AWS CloudTrail and GCP Audit Logs provide cloud-level audit logs as standard. App-level audit logs are safer separated to different tables or different log streams.

PII (personal info) handling

The principle is don’t output personal info to logs. Strictly regulated by GDPR and Personal Information Protection Act - “leaked from logs” isn’t an excuse.

TreatmentContent
MaskingMask like user***@gmail.com
HashingOne-way conversion for analysis
ExclusionDon’t output in the first place
PII-detection toolsAuto-detect and block

CC numbers, My Number, passwords, API keys - take measures via frameworks and log libraries to absolutely not output these to logs.

Decision criterion 1: data volume

Log volume varies backend choice. Few GB monthly: CloudWatch Logs; few TB: Loki / Elasticsearch; beyond: full-fledged like Splunk needed.

Monthly log volumeRecommended
~10GBCloudWatch Logs / Datadog free tier
10GB-1TBLoki / Grafana Cloud
1TB-10TBSelf-built Elasticsearch / Datadog
10TB+Splunk / Elastic Cloud Enterprise

Decision criterion 2: org regulatory requirements

Industries with strict audit/compliance requirements need log tamper-proofing and long-term retention.

Industry / regulationRequired
Finance / PCI DSS1+ year, tamper prevention
Medical / HIPAA6-year retention, encryption
J-SOXAccounting-related 7 years
General companies30-90 days + audit logs long-term

How to choose by case

Personal dev / small web service

CloudWatch Logs or Cloud Logging + structured-log output. Start without additional foundation. 30-day retention is enough, just transfer audit logs to S3 for long-term retention.

Startup / SaaS (tens of GB monthly)

Grafana Cloud (Loki) + OpenTelemetry Logs. Cost-efficient, the same foundation handles metrics and traces. PII via multi-layer defense of pre-commit hook + Fluent Bit filter.

Mid-size enterprise / microservices

Self-built Loki + Vector or Fluent Bit + S3 archive. Loki for hot search, S3 for long-term archive, audit logs separated to different routes from CloudTrail. Operable with 2-3 SREs.

Finance / medical / regulated industries

Splunk or Elastic Cloud + WORM storage + tamper prevention. Audit logs in different system, different permissions, long-term retention (7 years), with all logs signed and chained (Hash Chain) for tamper-detection capability.

Log-volume / retention numerical gates

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For logs, “take everything” explodes the bill, so per-use-case retention strategies are required.

ItemRecommendedReason
Production log levelINFO+DEBUG forbidden (10x volume)
Hot retention period7-30 daysImmediate response in investigation
Warm retention period3-6 monthsAnalysis, audit
Cold retention period (audit)1-7 yearsRegulation (finance: 7 years, PCI DSS: 1 year)
Log volume per requestUnder 1KBPrevent bloat
Error-log storage100% (no sampling)Keep all
Success-log storage1-10% samplingCost reduction
Log formatStructured JSONAI / machine-readable
Required fieldstimestamp / level / service / trace_idCross-search
PII outputAbsolutely forbiddenMasking required

Monthly log volume guidelines: ~10GB on CloudWatch free-thousands of yen, ~1TB on Loki tens of thousands of yen, ~10TB on self-built Elasticsearch hundreds of thousands of yen, beyond at Splunk Enterprise millions+. Past monthly 1TB is the line to consider migration to Loki / Grafana Cloud.

For logs, leave them in restorable form more than “take them.” Control cost via phased cold-tiering and sampling.

Log-operation pitfalls and forbidden moves

Typical accident patterns in logs. All link to either info leaks, cost explosion, or investigation-impossibility.

Forbidden moveWhy it’s bad
Output PII to logsThe 2018 GitHub plaintext-password incident triggering 4.7M forced resets
Free-text logsSearch / aggregation / AI analysis difficult. Structured JSON required
Operate microservices without trace_idCross-search impossible, incident investigation goes from hours to days
Output everything at INFO+Leaving DEBUG in production is the typical $10k/month bill case
Custom format for just one service’s logsCross-search clogs. Unify standards within team
Save logs in same storage as appLogs erased on breach. Different account + WORM
Aim for permanent retentionTens of millions of yen yearly storage billing. Phased cold-tiering
Sample error logsThe rule for errors is 100% storage. Keep all failures
Output whole HTTP body to logsThe 2021 Twitter internal-log API-key-inclusion pattern
Leave print statements without pre-commit hooksTypical of personal-info inclusion. Defend with GitLeaks / detect-secrets
Mix audit logs and general logsAudit response impossible. Separate to different log streams
”Output everything at INFO level for safety”Production DEBUG logs are 10x the volume, the typical pattern for $10k/month billing
”Want to store logs forever”Tens of millions in yearly storage billing; phased cold-tiering is required

The 2018 GitHub plaintext-password log incident (password-reset-feature implementation error recorded plaintext passwords of some users in internal logs, forced reset of about 4.7M), the 2021 Twitter internal-log API-Key-inclusion incident (whole-HTTP-body log output left auth tokens in 6 months of audit logs) - cases where design gaps in what to output to logs link directly to info leaks.

PII / token / password output bans enforced at implementation level. Not protected by operational rules alone.

AI decision axes

AI-favoredAI-disfavored
Structured logs (JSON)Free text
Standard schema (OTel Logs)Custom format
Correlation via trace_idIsolated logs
Meaningful messagesJust “an error occurred”
  1. Default to structured logs - JSON / standard items / meaningful messages
  2. Absolutely forbid PII output - masking / detection / pre-commit; once output, irrecoverable
  3. Phased cold-tiering - hot 30 days / warm 3-6 months / cold 1-7 years
  4. Correlate via trace_id - link with metrics / traces, foundation for AI diagnosis

Structured logs enable AI fault diagnosis

When JSON-format logs contain trace_id, service, level, and timestamp, AI can instantly analyze “which service has concentrated errors in this time window” and “where did processing for this trace_id fail.” With unstructured text logs, parsing with regex is needed first, and analysis accuracy drops.

As of 2026, the major observability tools - Datadog, New Relic, and Honeycomb - all ship “ask AI about logs” features, with structured logs as a prerequisite.

Log-level design and AI

In AI-generated code, log-level design tends to be vague. AI overuses console.log or logger.info, outputting at INFO level even where ERROR or WARN would be appropriate.

Documenting log-level criteria in the project (ERROR = immediate action, WARN = investigation needed, INFO = normal flow confirmation, DEBUG = dev only) and verifying appropriate levels via CI lint is effective.

Author’s note - cases where “writing everything to logs” turned into info leaks

Cases where “outputting everything to logs for now” linked to incidents are perennial industry lessons.

In the 2018 GitHub plaintext-password log incident, due to a password-reset-feature implementation error, plaintext passwords of some users were recorded in internal logs. Fortunately no external leak, but for internal investigation about 4.7M forced password resets were performed, told as a case where the premise “logs are a safe place” collapsed.

Another, in 2021 Twitter internal-log API-Key inclusion was reported. Developers had whole HTTP bodies output to logs, resulting in auth tokens recorded in 6 months of audit logs - a case where every employee with log-viewing permission effectively knew those tokens. A case highlighting the structural problem of readable logs = readable production data.

Both have lax design of “what to output, what not to output” as the lethal blow, and the discipline of not outputting PII / tokens / passwords to logs must be enforced at implementation level, not operational rules.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Log format (JSON-structured recommended)
  • Log-level strategy (production INFO / dev DEBUG)
  • Required-field standardization (timestamp, trace_id, service)
  • Collection foundation (Fluent Bit / Vector / each cloud)
  • Storage destination (Loki / Elasticsearch / Datadog)
  • Retention period (hot / warm / cold)
  • PII countermeasure (masking, detection)

https://en.senkohome.com/arch-intro-devops-docs/ https://en.senkohome.com/arch-intro-devops-observability/ https://en.senkohome.com/arch-intro-devops-overview/

Summary

This article covered log design, including log levels, structured logs, required fields, collection and storage, retention, sampling, PII protection, and the message-to-AI viewpoint.

Default to structured logs, absolutely forbid PII output, phased cold-tiering, correlate via trace_id. That is the practical answer for log design in 2026.

Next time we’ll cover SLO and SLI (reliability targets and error budgets).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.

📚 Series: Architecture Crash Course for the Generative-AI Era (63/89)