[DevOps Architecture] Log Design - Structured JSON + No PII + Phased Cold-Tiering

About this article

As the tenth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains log design.

Logs are a letter to your future self - their true value is asked half a year later, not at writing time. This article covers structured logs (JSON), log levels, correlation IDs, PII masking, retention, and log-aggregation foundations (CloudWatch / Loki / Datadog), handling design that leaves necessary-and-sufficient information in machine-readable form with appropriate retention.

Why log design is needed

Primary information for incident investigation

When incidents occur, without logs the cause can’t be known. Logs are the only asset for restoring past system state.

Audit / compliance response

Records of “who did what when” are always demanded in audits. SOX, PCI DSS, Personal Information Protection Act - all require audit-log retention.

Business analysis / improvement

User behavior, error trends, performance - analyzing logs gets hints for product improvement. More detailed info available than metrics.

Log levels

Standard practice is severity-staging logs. At runtime, you can filter by level - production at INFO+, development at DEBUG+. Log libraries in each language standard-support this.

Level	Use case
TRACE	Most detailed, usually disabled
DEBUG	Debug info during development
INFO	Milestones of normal processing
WARN	Notable situations (auto-recovery etc.)
ERROR	Errors occurred, needs investigation
FATAL	Critical incident, immediate response

Outputting DEBUG in production is the start of hell. Volume explodes and storage cost balloons several-fold.

Structured Logging

The method outputting logs in machine-readable structures like JSON. Legacy free-text logs are human-readable but search, aggregation, and correlation analysis were difficult. With structured logs, all become easy.

{
  "timestamp": "2026-04-18T10:23:45Z",
  "level": "ERROR",
  "service": "order-api",
  "trace_id": "abc123",
  "user_id": "u42",
  "message": "Payment failed",
  "error_code": "CARD_DECLINED",
  "latency_ms": 1234
}

Free-text output is now an antipattern. All new projects should start with structured logs.

“A log writing only ‘an error occurred’ is the same as a will writing only ‘someone died’” - the standard maxim of log design. In late-night incident response, facing logs lined with just [ERROR] Payment failed, having to reverse-engineer “which user, what amount, why payment failed” via “timestamp” and “Stripe-management-screen cross-reference” alone often happens in the field. Stories of struggling 3 hours to finally reach the cause and adding user_id, amount, and error_code that very night - aren’t rare.

What to write in logs

Standardize info included in logs and unify across all services. Without deciding standard items, things become disparate per service, making cross-search hard.

Required fields	Content
timestamp	ISO 8601, UTC recommended
level	ERROR / INFO etc.
service	Service name
trace_id	Link with distributed tracing
user_id / request_id	Subject / request identification
message	Content humans read
context	Structured additional info

Personal info, passwords, tokens absolutely never written to logs - the iron rule. Once output, irrecoverable.

Log-collection architecture

Beyond outputting from apps, foundations to aggregate, store, and search are needed. The modern mainstream is the composition apps to stdout, collection in separate processes (12-Factor App, the cloud-era app-architecture guideline proposed by Heroku), letting apps not worry about log destinations.

flowchart LR
    APP1[App A] -->|stdout| COL[Collector<br/>Fluent Bit / Vector]
    APP2[App B] -->|stdout| COL
    APP3[App C] -->|stdout| COL
    COL --> BUF[(Buffer<br/>Kafka etc.)]
    BUF --> BACKEND[(Storage<br/>Loki / Elasticsearch / Datadog)]
    BACKEND --> UI[Search/visualization UI<br/>Grafana / Kibana]
    classDef app fill:#fef3c7,stroke:#d97706;
    classDef col fill:#dbeafe,stroke:#2563eb;
    classDef buf fill:#fae8ff,stroke:#a21caf;
    classDef store fill:#dcfce7,stroke:#16a34a;
    classDef ui fill:#f0f9ff,stroke:#0369a1;
    class APP1,APP2,APP3 app;
    class COL col;
    class BUF buf;
    class BACKEND store;
    class UI ui;

Collection tool	Characteristics
Fluent Bit	Lightweight, CNCF graduated
Vector	Rust-built, high-perf
Fluentd	Veteran, feature-rich
Promtail	Loki-dedicated, lightweight

Storage destinations (log backend)

Choose log storage by search demand, cost, and data volume. High-speed-searching mass logs is unexpectedly costly, and “all logs into Elasticsearch for now” is a breakdown-prone approach.

Backend	Characteristics	Cost
Elasticsearch	Strongest search, heavy ops	High
Loki	Label-based, cheap	Low
Datadog Logs	SaaS-integrated	High
Splunk	Veteran, high-feature	Highest
Cloud Logging (GCP)	Managed	Mid
CloudWatch Logs	AWS-integrated	Mid
S3 / GCS	Archive	Lowest

Loki by Grafana Labs has extremely good cost efficiency and has rapidly spread recently. Drawing attention as Elasticsearch’s alternative.

Retention and cost management

Logs cost increases proportionally to retention, so permanent storage isn’t realistic. Considering legal and investigative requirements, phased cold-tiering is general.

Tier	Period	Use case
Hot (high-speed search)	7-30 days	Incident investigation
Warm (slightly slow)	3-6 months	Analysis, audit
Cold (archive)	1-7 years	Audit requirements, legal retention
Deletion	Beyond	Erase unneeded info

Audit logs at 7 years are required by many regulations, but app logs are often enough at 30 days. Categorize and handle - realistic.

Log sampling

In large systems, storing all logs explodes cost, so reduce by sampling. But the principle is storing 100% of error logs, sampling only successful requests.

Strategy	Content
Fixed rate	Store only 1/100
Tail sampling	Store all on errors
Importance-based	Vary by amount, user type
Adaptive sampling	Auto-adjust by traffic volume

OpenTelemetry’s tail sampling is the modern answer, judging storage after seeing the whole trace.

Audit logs

Special logs recording “who did what when,” requiring tamper-proof and long-term retention. Manage through different routes from general logs, ideally storing on WORM (Write Once Read Many) storage.

Required record fields	Content
Who	User ID, IP
What	Operation contents
When	Timestamp
Where	System, resource
Result	Success / failure

AWS CloudTrail and GCP Audit Logs provide cloud-level audit logs as standard. App-level audit logs are safer separated to different tables or different log streams.

PII (personal info) handling

The principle is don’t output personal info to logs. Strictly regulated by GDPR and Personal Information Protection Act - “leaked from logs” isn’t an excuse.

Treatment	Content
Masking	Mask like `user***@gmail.com`
Hashing	One-way conversion for analysis
Exclusion	Don’t output in the first place
PII-detection tools	Auto-detect and block

CC numbers, My Number, passwords, API keys - take measures via frameworks and log libraries to absolutely not output these to logs.

Decision criterion 1: data volume

Log volume varies backend choice. Few GB monthly: CloudWatch Logs; few TB: Loki / Elasticsearch; beyond: full-fledged like Splunk needed.

Monthly log volume	Recommended
~10GB	CloudWatch Logs / Datadog free tier
10GB-1TB	Loki / Grafana Cloud
1TB-10TB	Self-built Elasticsearch / Datadog
10TB+	Splunk / Elastic Cloud Enterprise

Decision criterion 2: org regulatory requirements

Industries with strict audit/compliance requirements need log tamper-proofing and long-term retention.

Industry / regulation	Required
Finance / PCI DSS	1+ year, tamper prevention
Medical / HIPAA	6-year retention, encryption
J-SOX	Accounting-related 7 years
General companies	30-90 days + audit logs long-term

How to choose by case

Personal dev / small web service

CloudWatch Logs or Cloud Logging + structured-log output. Start without additional foundation. 30-day retention is enough, just transfer audit logs to S3 for long-term retention.

Startup / SaaS (tens of GB monthly)

Grafana Cloud (Loki) + OpenTelemetry Logs. Cost-efficient, the same foundation handles metrics and traces. PII via multi-layer defense of pre-commit hook + Fluent Bit filter.

Mid-size enterprise / microservices

Self-built Loki + Vector or Fluent Bit + S3 archive. Loki for hot search, S3 for long-term archive, audit logs separated to different routes from CloudTrail. Operable with 2-3 SREs.

Finance / medical / regulated industries

Splunk or Elastic Cloud + WORM storage + tamper prevention. Audit logs in different system, different permissions, long-term retention (7 years), with all logs signed and chained (Hash Chain) for tamper-detection capability.

Common misconceptions

Output everything at INFO level for safety

Volume explodes and cost and search break down. Level design comes first.

OK to leave dev-time print statements in production

The typical pattern of personal-info inclusion. Pre-commit print-detection is safe.

Want to store logs forever

Unrealistic. Cost calculations sometimes show tens of millions of yen for 1 year - phased cold-tiering is required.

Just stdout is enough

A correct judgment as an app. But collection foundation is needed separately, following 12-Factor App.

Log-volume / retention numerical gates

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For logs, “take everything” explodes the bill, so per-use-case retention strategies are required.

Item	Recommended	Reason
Production log level	INFO+	DEBUG forbidden (10x volume)
Hot retention period	7-30 days	Immediate response in investigation
Warm retention period	3-6 months	Analysis, audit
Cold retention period (audit)	1-7 years	Regulation (finance: 7 years, PCI DSS: 1 year)
Log volume per request	Under 1KB	Prevent bloat
Error-log storage	100% (no sampling)	Keep all
Success-log storage	1-10% sampling	Cost reduction
Log format	Structured JSON	AI / machine-readable
Required fields	timestamp / level / service / trace_id	Cross-search
PII (personal info) output	Absolutely forbidden	Masking required

Monthly log volume guidelines: ~10GB on CloudWatch free-thousands of yen, ~1TB on Loki tens of thousands of yen, ~10TB on self-built Elasticsearch hundreds of thousands of yen, beyond at Splunk Enterprise millions+. Past monthly 1TB is the line to consider migration to Loki / Grafana Cloud.

For logs, leave them in restorable form more than “take them.” Control cost via phased cold-tiering and sampling.

Log-operation pitfalls and forbidden moves

Typical accident patterns in logs. All link to either info leaks, cost explosion, or investigation-impossibility.

Forbidden move	Why it’s bad
Output PII (personal info / passwords / API keys) to logs	The 2018 GitHub plaintext-password incident triggering 4.7M forced resets
Free-text logs	Search / aggregation / AI analysis difficult. Structured JSON required
Operate microservices without trace_id	Cross-search impossible, incident investigation goes from hours to days
Output everything at INFO+	Leaving DEBUG in production is the typical $10k/month bill case
Custom format for just one service’s logs	Cross-search clogs. Unify standards within team
Save logs in same storage as app	Logs erased on breach. Different account + WORM
Aim for permanent retention	Tens of millions of yen yearly storage billing. Phased cold-tiering
Sample error logs	The rule for errors is 100% storage. Keep all failures
Output whole HTTP body to logs	The 2021 Twitter internal-log API-key-inclusion pattern
Leave print statements without pre-commit hooks	Typical of personal-info inclusion. Defend with GitLeaks / detect-secrets
Mix audit logs and general logs	Audit response impossible. Separate to different log streams

The 2018 GitHub plaintext-password log incident (password-reset-feature implementation error recorded plaintext passwords of some users in internal logs, forced reset of about 4.7M), the 2021 Twitter internal-log API-Key-inclusion incident (whole-HTTP-body log output left auth tokens in 6 months of audit logs) - cases where design gaps in what to output to logs link directly to info leaks.

PII / token / password output bans enforced at implementation level. Not protected by operational rules alone.

AI-era perspective

When AI-driven dev (vibe coding) and AI usage are the premise, logs are redefined as teaching material AI reads to diagnose. AI assistants like Datadog Bits AI and Grafana LLM analyze logs and explain incident causes in natural language.

Favored in the AI era	Disfavored in the AI era
Structured logs (JSON)	Free text
Standard schema (OTel Logs)	Custom format
Correlation via trace_id	Isolated logs
Meaningful messages	Just “an error occurred”

The era of AI answering “the cause of this error is X, fix Y” is starting. The premise is outputting in AI-understandable structures - the new standard is making messages clearly state what failed and why.

Logs are a message to AI. Leave them in form machines can read and diagnose.

Author’s note - cases where “writing everything to logs” turned into info leaks

Cases where “outputting everything to logs for now” linked to incidents are perennial industry lessons.

In the 2018 GitHub plaintext-password log incident, due to a password-reset-feature implementation error, plaintext passwords of some users were recorded in internal logs. Fortunately no external leak, but for internal investigation about 4.7M forced password resets were performed, told as a case where the premise “logs are a safe place” collapsed.

Another, in 2021 Twitter internal-log API-Key inclusion was reported. Developers had whole HTTP bodies output to logs, resulting in auth tokens recorded in 6 months of audit logs - a case where every employee with log-viewing permission effectively knew those tokens. A case highlighting the structural problem of readable logs = readable production data.

Both have lax design of “what to output, what not to output” as the lethal blow, and the discipline of not outputting PII / tokens / passwords to logs must be enforced at implementation level, not operational rules.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

Log format (JSON-structured recommended)
Log-level strategy (production INFO / dev DEBUG)
Required-field standardization (timestamp, trace_id, service)
Collection foundation (Fluent Bit / Vector / each cloud)
Storage destination (Loki / Elasticsearch / Datadog)
Retention period (hot / warm / cold)
PII countermeasure (masking, detection)

How to make the final call

The core of log design is a letter to your future self half a year later, with true value asked at future incident investigation rather than just after writing. The rational decision is making structured logs (JSON) the standard, unifying standard items like trace_id / service / user_id, building a state of cross-searchability. Free-text logs are now antipatterns, and the discipline of absolutely not outputting PII (personal info), plus operations of suppressing cost via phased cold-tiering, are required. Storing 100% of errors and sampling successes is the standard that doesn’t break at scale.

Another decisive axis is logs are messages to AI. In the era when AI assistants like Datadog Bits AI and Grafana LLM read logs and explain incident causes in natural language, having structured / standard schema / trace_id correlation lets you build cause hypotheses in seconds. Conversely, free text and custom formats become liabilities unreadable to both AI and humans.

Selection priorities

Default to structured logs - JSON / standard items / meaningful messages
Absolutely forbid PII output - masking / detection / pre-commit; once output, irrecoverable
Phased cold-tiering - hot 30 days / warm 3-6 months / cold 1-7 years
Correlate via trace_id - link with metrics / traces, foundation for AI diagnosis

“Logs are messages to AI.” Leave them in form where true value emerges half a year later via structure / standards / correlation.

Summary

This article covered log design, including log levels, structured logs, required fields, collection and storage, retention, sampling, PII protection, and the message-to-AI viewpoint.

Default to structured logs, absolutely forbid PII output, phased cold-tiering, correlate via trace_id. That is the practical answer for log design in 2026.

Next time we’ll cover SLO and SLI (reliability targets and error budgets).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.