DevOps Architecture

[DevOps Architecture] Log Design - Structured JSON + No PII + Phased Cold-Tiering

[DevOps Architecture] Log Design - Structured JSON + No PII + Phased Cold-Tiering

About this article

As the tenth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains log design.

Logs are a letter to your future self - their true value is asked half a year later, not at writing time. This article covers structured logs (JSON), log levels, correlation IDs, PII masking, retention, and log-aggregation foundations (CloudWatch / Loki / Datadog), handling design that leaves necessary-and-sufficient information in machine-readable form with appropriate retention.

DevOps Architecture Overview — One Pipeline for Build, Ship, and Runen.senkohome.com/arch-intro-devops-overview/[DevOps Architecture] DevOps and SRE Overview - Speed and Stability Coexisten.senkohome.com/arch-intro-devops-sre/[DevOps Architecture] Version Control - Git + Monorepo + GitHub Flow Is the Standarden.senkohome.com/arch-intro-devops-vcs/[DevOps Architecture] Dev Environment and Local Execution - Half a Day to First Commiten.senkohome.com/arch-intro-devops-devenv/[DevOps Architecture] Code Review - PR 300 Lines + 1 Approver + CODEOWNERSen.senkohome.com/arch-intro-devops-review/[DevOps Architecture] Test Design - Pyramid + Testcontainers + Branch Coverageen.senkohome.com/arch-intro-devops-test/[DevOps Architecture] CI/CD - GitHub Actions + OIDC + Feature Flag Is the Standarden.senkohome.com/arch-intro-devops-cicd/[DevOps Architecture] Deploy Strategy - Raise Frequency, Lower Risken.senkohome.com/arch-intro-devops-deploy/[DevOps Architecture] Monitoring and Observability - Three Pillars + OpenTelemetry + SLO Alertsen.senkohome.com/arch-intro-devops-observability/[DevOps Architecture] SLO and SLI - Don't Pursue 100%, Buy Speed With Error Budgeten.senkohome.com/arch-intro-devops-slo/[DevOps Architecture] Incident Response - Resolve via Mechanism, Not Heroesen.senkohome.com/arch-intro-devops-incident/[DevOps Architecture] SRE Practices - Toil Reduction and Chaos Drillsen.senkohome.com/arch-intro-devops-sre-practice/[DevOps Architecture] Documentation - Lean README + ADR + OpenAPI Toward Giten.senkohome.com/arch-intro-devops-docs/[DevOps Architecture] Ticket and Project Management - Epic/Story/Task + 1-Day Granularityen.senkohome.com/arch-intro-devops-ticket/

Why log design is needed

Primary information for incident investigation

When incidents occur, without logs the cause can’t be known. Logs are the only asset for restoring past system state.

Audit / compliance response

Records of “who did what when” are always demanded in audits. SOX, PCI DSS, Personal Information Protection Act - all require audit-log retention.

Business analysis / improvement

User behavior, error trends, performance - analyzing logs gets hints for product improvement. More detailed info available than metrics.

Log levels

Standard practice is severity-staging logs. At runtime, you can filter by level - production at INFO+, development at DEBUG+. Log libraries in each language standard-support this.

LevelUse case
TRACEMost detailed, usually disabled
DEBUGDebug info during development
INFOMilestones of normal processing
WARNNotable situations (auto-recovery etc.)
ERRORErrors occurred, needs investigation
FATALCritical incident, immediate response

Outputting DEBUG in production is the start of hell. Volume explodes and storage cost balloons several-fold.

Structured Logging

The method outputting logs in machine-readable structures like JSON. Legacy free-text logs are human-readable but search, aggregation, and correlation analysis were difficult. With structured logs, all become easy.

{
  "timestamp": "2026-04-18T10:23:45Z",
  "level": "ERROR",
  "service": "order-api",
  "trace_id": "abc123",
  "user_id": "u42",
  "message": "Payment failed",
  "error_code": "CARD_DECLINED",
  "latency_ms": 1234
}

Free-text output is now an antipattern. All new projects should start with structured logs.

“A log writing only ‘an error occurred’ is the same as a will writing only ‘someone died’” - the standard maxim of log design. In late-night incident response, facing logs lined with just [ERROR] Payment failed, having to reverse-engineer “which user, what amount, why payment failed” via “timestamp” and “Stripe-management-screen cross-reference” alone often happens in the field. Stories of struggling 3 hours to finally reach the cause and adding user_id, amount, and error_code that very night - aren’t rare.

What to write in logs

Standardize info included in logs and unify across all services. Without deciding standard items, things become disparate per service, making cross-search hard.

Required fieldsContent
timestampISO 8601, UTC recommended
levelERROR / INFO etc.
serviceService name
trace_idLink with distributed tracing
user_id / request_idSubject / request identification
messageContent humans read
contextStructured additional info

Personal info, passwords, tokens absolutely never written to logs - the iron rule. Once output, irrecoverable.

Log-collection architecture

Beyond outputting from apps, foundations to aggregate, store, and search are needed. The modern mainstream is the composition apps to stdout, collection in separate processes (12-Factor App, the cloud-era app-architecture guideline proposed by Heroku), letting apps not worry about log destinations.

flowchart LR
    APP1[App A] -->|stdout| COL[Collector<br/>Fluent Bit / Vector]
    APP2[App B] -->|stdout| COL
    APP3[App C] -->|stdout| COL
    COL --> BUF[(Buffer<br/>Kafka etc.)]
    BUF --> BACKEND[(Storage<br/>Loki / Elasticsearch / Datadog)]
    BACKEND --> UI[Search/visualization UI<br/>Grafana / Kibana]
    classDef app fill:#fef3c7,stroke:#d97706;
    classDef col fill:#dbeafe,stroke:#2563eb;
    classDef buf fill:#fae8ff,stroke:#a21caf;
    classDef store fill:#dcfce7,stroke:#16a34a;
    classDef ui fill:#f0f9ff,stroke:#0369a1;
    class APP1,APP2,APP3 app;
    class COL col;
    class BUF buf;
    class BACKEND store;
    class UI ui;
Collection toolCharacteristics
Fluent BitLightweight, CNCF graduated
VectorRust-built, high-perf
FluentdVeteran, feature-rich
PromtailLoki-dedicated, lightweight

Storage destinations (log backend)

Choose log storage by search demand, cost, and data volume. High-speed-searching mass logs is unexpectedly costly, and “all logs into Elasticsearch for now” is a breakdown-prone approach.

BackendCharacteristicsCost
ElasticsearchStrongest search, heavy opsHigh
LokiLabel-based, cheapLow
Datadog LogsSaaS-integratedHigh
SplunkVeteran, high-featureHighest
Cloud Logging (GCP)ManagedMid
CloudWatch LogsAWS-integratedMid
S3 / GCSArchiveLowest

Loki by Grafana Labs has extremely good cost efficiency and has rapidly spread recently. Drawing attention as Elasticsearch’s alternative.

Retention and cost management

Logs cost increases proportionally to retention, so permanent storage isn’t realistic. Considering legal and investigative requirements, phased cold-tiering is general.

TierPeriodUse case
Hot (high-speed search)7-30 daysIncident investigation
Warm (slightly slow)3-6 monthsAnalysis, audit
Cold (archive)1-7 yearsAudit requirements, legal retention
DeletionBeyondErase unneeded info

Audit logs at 7 years are required by many regulations, but app logs are often enough at 30 days. Categorize and handle - realistic.

Log sampling

In large systems, storing all logs explodes cost, so reduce by sampling. But the principle is storing 100% of error logs, sampling only successful requests.

StrategyContent
Fixed rateStore only 1/100
Tail samplingStore all on errors
Importance-basedVary by amount, user type
Adaptive samplingAuto-adjust by traffic volume

OpenTelemetry’s tail sampling is the modern answer, judging storage after seeing the whole trace.

Audit logs

Special logs recording “who did what when,” requiring tamper-proof and long-term retention. Manage through different routes from general logs, ideally storing on WORM (Write Once Read Many) storage.

Required record fieldsContent
WhoUser ID, IP
WhatOperation contents
WhenTimestamp
WhereSystem, resource
ResultSuccess / failure

AWS CloudTrail and GCP Audit Logs provide cloud-level audit logs as standard. App-level audit logs are safer separated to different tables or different log streams.

PII (personal info) handling

The principle is don’t output personal info to logs. Strictly regulated by GDPR and Personal Information Protection Act - “leaked from logs” isn’t an excuse.

TreatmentContent
MaskingMask like user***@gmail.com
HashingOne-way conversion for analysis
ExclusionDon’t output in the first place
PII-detection toolsAuto-detect and block

CC numbers, My Number, passwords, API keys - take measures via frameworks and log libraries to absolutely not output these to logs.

Decision criterion 1: data volume

Log volume varies backend choice. Few GB monthly: CloudWatch Logs; few TB: Loki / Elasticsearch; beyond: full-fledged like Splunk needed.

Monthly log volumeRecommended
~10GBCloudWatch Logs / Datadog free tier
10GB-1TBLoki / Grafana Cloud
1TB-10TBSelf-built Elasticsearch / Datadog
10TB+Splunk / Elastic Cloud Enterprise

Decision criterion 2: org regulatory requirements

Industries with strict audit/compliance requirements need log tamper-proofing and long-term retention.

Industry / regulationRequired
Finance / PCI DSS1+ year, tamper prevention
Medical / HIPAA6-year retention, encryption
J-SOXAccounting-related 7 years
General companies30-90 days + audit logs long-term

How to choose by case

Personal dev / small web service

CloudWatch Logs or Cloud Logging + structured-log output. Start without additional foundation. 30-day retention is enough, just transfer audit logs to S3 for long-term retention.

Startup / SaaS (tens of GB monthly)

Grafana Cloud (Loki) + OpenTelemetry Logs. Cost-efficient, the same foundation handles metrics and traces. PII via multi-layer defense of pre-commit hook + Fluent Bit filter.

Mid-size enterprise / microservices

Self-built Loki + Vector or Fluent Bit + S3 archive. Loki for hot search, S3 for long-term archive, audit logs separated to different routes from CloudTrail. Operable with 2-3 SREs.

Finance / medical / regulated industries

Splunk or Elastic Cloud + WORM storage + tamper prevention. Audit logs in different system, different permissions, long-term retention (7 years), with all logs signed and chained (Hash Chain) for tamper-detection capability.

Common misconceptions

Output everything at INFO level for safety

Volume explodes and cost and search break down. Level design comes first.

OK to leave dev-time print statements in production

The typical pattern of personal-info inclusion. Pre-commit print-detection is safe.

Want to store logs forever

Unrealistic. Cost calculations sometimes show tens of millions of yen for 1 year - phased cold-tiering is required.

Just stdout is enough

A correct judgment as an app. But collection foundation is needed separately, following 12-Factor App.

Log-volume / retention numerical gates

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For logs, “take everything” explodes the bill, so per-use-case retention strategies are required.

ItemRecommendedReason
Production log levelINFO+DEBUG forbidden (10x volume)
Hot retention period7-30 daysImmediate response in investigation
Warm retention period3-6 monthsAnalysis, audit
Cold retention period (audit)1-7 yearsRegulation (finance: 7 years, PCI DSS: 1 year)
Log volume per requestUnder 1KBPrevent bloat
Error-log storage100% (no sampling)Keep all
Success-log storage1-10% samplingCost reduction
Log formatStructured JSONAI / machine-readable
Required fieldstimestamp / level / service / trace_idCross-search
PII (personal info) outputAbsolutely forbiddenMasking required

Monthly log volume guidelines: ~10GB on CloudWatch free-thousands of yen, ~1TB on Loki tens of thousands of yen, ~10TB on self-built Elasticsearch hundreds of thousands of yen, beyond at Splunk Enterprise millions+. Past monthly 1TB is the line to consider migration to Loki / Grafana Cloud.

For logs, leave them in restorable form more than “take them.” Control cost via phased cold-tiering and sampling.

Log-operation pitfalls and forbidden moves

Typical accident patterns in logs. All link to either info leaks, cost explosion, or investigation-impossibility.

Forbidden moveWhy it’s bad
Output PII (personal info / passwords / API keys) to logsThe 2018 GitHub plaintext-password incident triggering 4.7M forced resets
Free-text logsSearch / aggregation / AI analysis difficult. Structured JSON required
Operate microservices without trace_idCross-search impossible, incident investigation goes from hours to days
Output everything at INFO+Leaving DEBUG in production is the typical $10k/month bill case
Custom format for just one service’s logsCross-search clogs. Unify standards within team
Save logs in same storage as appLogs erased on breach. Different account + WORM
Aim for permanent retentionTens of millions of yen yearly storage billing. Phased cold-tiering
Sample error logsThe rule for errors is 100% storage. Keep all failures
Output whole HTTP body to logsThe 2021 Twitter internal-log API-key-inclusion pattern
Leave print statements without pre-commit hooksTypical of personal-info inclusion. Defend with GitLeaks / detect-secrets
Mix audit logs and general logsAudit response impossible. Separate to different log streams

The 2018 GitHub plaintext-password log incident (password-reset-feature implementation error recorded plaintext passwords of some users in internal logs, forced reset of about 4.7M), the 2021 Twitter internal-log API-Key-inclusion incident (whole-HTTP-body log output left auth tokens in 6 months of audit logs) - cases where design gaps in what to output to logs link directly to info leaks.

PII / token / password output bans enforced at implementation level. Not protected by operational rules alone.

AI-era perspective

When AI-driven dev (vibe coding) and AI usage are the premise, logs are redefined as teaching material AI reads to diagnose. AI assistants like Datadog Bits AI and Grafana LLM analyze logs and explain incident causes in natural language.

Favored in the AI eraDisfavored in the AI era
Structured logs (JSON)Free text
Standard schema (OTel Logs)Custom format
Correlation via trace_idIsolated logs
Meaningful messagesJust “an error occurred”

The era of AI answering “the cause of this error is X, fix Y” is starting. The premise is outputting in AI-understandable structures - the new standard is making messages clearly state what failed and why.

Logs are a message to AI. Leave them in form machines can read and diagnose.

Author’s note - cases where “writing everything to logs” turned into info leaks

Cases where “outputting everything to logs for now” linked to incidents are perennial industry lessons.

In the 2018 GitHub plaintext-password log incident, due to a password-reset-feature implementation error, plaintext passwords of some users were recorded in internal logs. Fortunately no external leak, but for internal investigation about 4.7M forced password resets were performed, told as a case where the premise “logs are a safe place” collapsed.

Another, in 2021 Twitter internal-log API-Key inclusion was reported. Developers had whole HTTP bodies output to logs, resulting in auth tokens recorded in 6 months of audit logs - a case where every employee with log-viewing permission effectively knew those tokens. A case highlighting the structural problem of readable logs = readable production data.

Both have lax design of “what to output, what not to output” as the lethal blow, and the discipline of not outputting PII / tokens / passwords to logs must be enforced at implementation level, not operational rules.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Log format (JSON-structured recommended)
  • Log-level strategy (production INFO / dev DEBUG)
  • Required-field standardization (timestamp, trace_id, service)
  • Collection foundation (Fluent Bit / Vector / each cloud)
  • Storage destination (Loki / Elasticsearch / Datadog)
  • Retention period (hot / warm / cold)
  • PII countermeasure (masking, detection)

How to make the final call

The core of log design is a letter to your future self half a year later, with true value asked at future incident investigation rather than just after writing. The rational decision is making structured logs (JSON) the standard, unifying standard items like trace_id / service / user_id, building a state of cross-searchability. Free-text logs are now antipatterns, and the discipline of absolutely not outputting PII (personal info), plus operations of suppressing cost via phased cold-tiering, are required. Storing 100% of errors and sampling successes is the standard that doesn’t break at scale.

Another decisive axis is logs are messages to AI. In the era when AI assistants like Datadog Bits AI and Grafana LLM read logs and explain incident causes in natural language, having structured / standard schema / trace_id correlation lets you build cause hypotheses in seconds. Conversely, free text and custom formats become liabilities unreadable to both AI and humans.

Selection priorities

  1. Default to structured logs - JSON / standard items / meaningful messages
  2. Absolutely forbid PII output - masking / detection / pre-commit; once output, irrecoverable
  3. Phased cold-tiering - hot 30 days / warm 3-6 months / cold 1-7 years
  4. Correlate via trace_id - link with metrics / traces, foundation for AI diagnosis

“Logs are messages to AI.” Leave them in form where true value emerges half a year later via structure / standards / correlation.

Summary

This article covered log design, including log levels, structured logs, required fields, collection and storage, retention, sampling, PII protection, and the message-to-AI viewpoint.

Default to structured logs, absolutely forbid PII output, phased cold-tiering, correlate via trace_id. That is the practical answer for log design in 2026.

Next time we’ll cover SLO and SLI (reliability targets and error budgets).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.