About this article
As the tenth installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains log design.
Logs are a letter to your future self - their true value is asked half a year later, not at writing time. This article covers structured logs (JSON), log levels, correlation IDs, PII masking, retention, and log-aggregation foundations (CloudWatch / Loki / Datadog), handling design that leaves necessary-and-sufficient information in machine-readable form with appropriate retention.
Other articles in this category
Why log design is needed
Primary information for incident investigation
When incidents occur, without logs the cause canât be known. Logs are the only asset for restoring past system state.
Audit / compliance response
Records of âwho did what whenâ are always demanded in audits. SOX, PCI DSS, Personal Information Protection Act - all require audit-log retention.
Business analysis / improvement
User behavior, error trends, performance - analyzing logs gets hints for product improvement. More detailed info available than metrics.
Log levels
Standard practice is severity-staging logs. At runtime, you can filter by level - production at INFO+, development at DEBUG+. Log libraries in each language standard-support this.
| Level | Use case |
|---|---|
| TRACE | Most detailed, usually disabled |
| DEBUG | Debug info during development |
| INFO | Milestones of normal processing |
| WARN | Notable situations (auto-recovery etc.) |
| ERROR | Errors occurred, needs investigation |
| FATAL | Critical incident, immediate response |
Outputting DEBUG in production is the start of hell. Volume explodes and storage cost balloons several-fold.
Structured Logging
The method outputting logs in machine-readable structures like JSON. Legacy free-text logs are human-readable but search, aggregation, and correlation analysis were difficult. With structured logs, all become easy.
{
"timestamp": "2026-04-18T10:23:45Z",
"level": "ERROR",
"service": "order-api",
"trace_id": "abc123",
"user_id": "u42",
"message": "Payment failed",
"error_code": "CARD_DECLINED",
"latency_ms": 1234
}
Free-text output is now an antipattern. All new projects should start with structured logs.
âA log writing only âan error occurredâ is the same as a will writing only âsomeone diedââ - the standard maxim of log design. In late-night incident response, facing logs lined with just [ERROR] Payment failed, having to reverse-engineer âwhich user, what amount, why payment failedâ via âtimestampâ and âStripe-management-screen cross-referenceâ alone often happens in the field. Stories of struggling 3 hours to finally reach the cause and adding user_id, amount, and error_code that very night - arenât rare.
What to write in logs
Standardize info included in logs and unify across all services. Without deciding standard items, things become disparate per service, making cross-search hard.
| Required fields | Content |
|---|---|
| timestamp | ISO 8601, UTC recommended |
| level | ERROR / INFO etc. |
| service | Service name |
| trace_id | Link with distributed tracing |
| user_id / request_id | Subject / request identification |
| message | Content humans read |
| context | Structured additional info |
Personal info, passwords, tokens absolutely never written to logs - the iron rule. Once output, irrecoverable.
Log-collection architecture
Beyond outputting from apps, foundations to aggregate, store, and search are needed. The modern mainstream is the composition apps to stdout, collection in separate processes (12-Factor App, the cloud-era app-architecture guideline proposed by Heroku), letting apps not worry about log destinations.
flowchart LR
APP1[App A] -->|stdout| COL[Collector<br/>Fluent Bit / Vector]
APP2[App B] -->|stdout| COL
APP3[App C] -->|stdout| COL
COL --> BUF[(Buffer<br/>Kafka etc.)]
BUF --> BACKEND[(Storage<br/>Loki / Elasticsearch / Datadog)]
BACKEND --> UI[Search/visualization UI<br/>Grafana / Kibana]
classDef app fill:#fef3c7,stroke:#d97706;
classDef col fill:#dbeafe,stroke:#2563eb;
classDef buf fill:#fae8ff,stroke:#a21caf;
classDef store fill:#dcfce7,stroke:#16a34a;
classDef ui fill:#f0f9ff,stroke:#0369a1;
class APP1,APP2,APP3 app;
class COL col;
class BUF buf;
class BACKEND store;
class UI ui;
| Collection tool | Characteristics |
|---|---|
| Fluent Bit | Lightweight, CNCF graduated |
| Vector | Rust-built, high-perf |
| Fluentd | Veteran, feature-rich |
| Promtail | Loki-dedicated, lightweight |
Storage destinations (log backend)
Choose log storage by search demand, cost, and data volume. High-speed-searching mass logs is unexpectedly costly, and âall logs into Elasticsearch for nowâ is a breakdown-prone approach.
| Backend | Characteristics | Cost |
|---|---|---|
| Elasticsearch | Strongest search, heavy ops | High |
| Loki | Label-based, cheap | Low |
| Datadog Logs | SaaS-integrated | High |
| Splunk | Veteran, high-feature | Highest |
| Cloud Logging (GCP) | Managed | Mid |
| CloudWatch Logs | AWS-integrated | Mid |
| S3 / GCS | Archive | Lowest |
Loki by Grafana Labs has extremely good cost efficiency and has rapidly spread recently. Drawing attention as Elasticsearchâs alternative.
Retention and cost management
Logs cost increases proportionally to retention, so permanent storage isnât realistic. Considering legal and investigative requirements, phased cold-tiering is general.
| Tier | Period | Use case |
|---|---|---|
| Hot (high-speed search) | 7-30 days | Incident investigation |
| Warm (slightly slow) | 3-6 months | Analysis, audit |
| Cold (archive) | 1-7 years | Audit requirements, legal retention |
| Deletion | Beyond | Erase unneeded info |
Audit logs at 7 years are required by many regulations, but app logs are often enough at 30 days. Categorize and handle - realistic.
Log sampling
In large systems, storing all logs explodes cost, so reduce by sampling. But the principle is storing 100% of error logs, sampling only successful requests.
| Strategy | Content |
|---|---|
| Fixed rate | Store only 1/100 |
| Tail sampling | Store all on errors |
| Importance-based | Vary by amount, user type |
| Adaptive sampling | Auto-adjust by traffic volume |
OpenTelemetryâs tail sampling is the modern answer, judging storage after seeing the whole trace.
Audit logs
Special logs recording âwho did what when,â requiring tamper-proof and long-term retention. Manage through different routes from general logs, ideally storing on WORM (Write Once Read Many) storage.
| Required record fields | Content |
|---|---|
| Who | User ID, IP |
| What | Operation contents |
| When | Timestamp |
| Where | System, resource |
| Result | Success / failure |
AWS CloudTrail and GCP Audit Logs provide cloud-level audit logs as standard. App-level audit logs are safer separated to different tables or different log streams.
PII (personal info) handling
The principle is donât output personal info to logs. Strictly regulated by GDPR and Personal Information Protection Act - âleaked from logsâ isnât an excuse.
| Treatment | Content |
|---|---|
| Masking | Mask like user***@gmail.com |
| Hashing | One-way conversion for analysis |
| Exclusion | Donât output in the first place |
| PII-detection tools | Auto-detect and block |
CC numbers, My Number, passwords, API keys - take measures via frameworks and log libraries to absolutely not output these to logs.
Decision criterion 1: data volume
Log volume varies backend choice. Few GB monthly: CloudWatch Logs; few TB: Loki / Elasticsearch; beyond: full-fledged like Splunk needed.
| Monthly log volume | Recommended |
|---|---|
| ~10GB | CloudWatch Logs / Datadog free tier |
| 10GB-1TB | Loki / Grafana Cloud |
| 1TB-10TB | Self-built Elasticsearch / Datadog |
| 10TB+ | Splunk / Elastic Cloud Enterprise |
Decision criterion 2: org regulatory requirements
Industries with strict audit/compliance requirements need log tamper-proofing and long-term retention.
| Industry / regulation | Required |
|---|---|
| Finance / PCI DSS | 1+ year, tamper prevention |
| Medical / HIPAA | 6-year retention, encryption |
| J-SOX | Accounting-related 7 years |
| General companies | 30-90 days + audit logs long-term |
How to choose by case
Personal dev / small web service
CloudWatch Logs or Cloud Logging + structured-log output. Start without additional foundation. 30-day retention is enough, just transfer audit logs to S3 for long-term retention.
Startup / SaaS (tens of GB monthly)
Grafana Cloud (Loki) + OpenTelemetry Logs. Cost-efficient, the same foundation handles metrics and traces. PII via multi-layer defense of pre-commit hook + Fluent Bit filter.
Mid-size enterprise / microservices
Self-built Loki + Vector or Fluent Bit + S3 archive. Loki for hot search, S3 for long-term archive, audit logs separated to different routes from CloudTrail. Operable with 2-3 SREs.
Finance / medical / regulated industries
Splunk or Elastic Cloud + WORM storage + tamper prevention. Audit logs in different system, different permissions, long-term retention (7 years), with all logs signed and chained (Hash Chain) for tamper-detection capability.
Common misconceptions
Output everything at INFO level for safety
Volume explodes and cost and search break down. Level design comes first.
OK to leave dev-time print statements in production
The typical pattern of personal-info inclusion. Pre-commit print-detection is safe.
Want to store logs forever
Unrealistic. Cost calculations sometimes show tens of millions of yen for 1 year - phased cold-tiering is required.
Just stdout is enough
A correct judgment as an app. But collection foundation is needed separately, following 12-Factor App.
Log-volume / retention numerical gates
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
For logs, âtake everythingâ explodes the bill, so per-use-case retention strategies are required.
| Item | Recommended | Reason |
|---|---|---|
| Production log level | INFO+ | DEBUG forbidden (10x volume) |
| Hot retention period | 7-30 days | Immediate response in investigation |
| Warm retention period | 3-6 months | Analysis, audit |
| Cold retention period (audit) | 1-7 years | Regulation (finance: 7 years, PCI DSS: 1 year) |
| Log volume per request | Under 1KB | Prevent bloat |
| Error-log storage | 100% (no sampling) | Keep all |
| Success-log storage | 1-10% sampling | Cost reduction |
| Log format | Structured JSON | AI / machine-readable |
| Required fields | timestamp / level / service / trace_id | Cross-search |
| PII (personal info) output | Absolutely forbidden | Masking required |
Monthly log volume guidelines: ~10GB on CloudWatch free-thousands of yen, ~1TB on Loki tens of thousands of yen, ~10TB on self-built Elasticsearch hundreds of thousands of yen, beyond at Splunk Enterprise millions+. Past monthly 1TB is the line to consider migration to Loki / Grafana Cloud.
For logs, leave them in restorable form more than âtake them.â Control cost via phased cold-tiering and sampling.
Log-operation pitfalls and forbidden moves
Typical accident patterns in logs. All link to either info leaks, cost explosion, or investigation-impossibility.
| Forbidden move | Why itâs bad |
|---|---|
| Output PII (personal info / passwords / API keys) to logs | The 2018 GitHub plaintext-password incident triggering 4.7M forced resets |
| Free-text logs | Search / aggregation / AI analysis difficult. Structured JSON required |
| Operate microservices without trace_id | Cross-search impossible, incident investigation goes from hours to days |
| Output everything at INFO+ | Leaving DEBUG in production is the typical $10k/month bill case |
| Custom format for just one serviceâs logs | Cross-search clogs. Unify standards within team |
| Save logs in same storage as app | Logs erased on breach. Different account + WORM |
| Aim for permanent retention | Tens of millions of yen yearly storage billing. Phased cold-tiering |
| Sample error logs | The rule for errors is 100% storage. Keep all failures |
| Output whole HTTP body to logs | The 2021 Twitter internal-log API-key-inclusion pattern |
| Leave print statements without pre-commit hooks | Typical of personal-info inclusion. Defend with GitLeaks / detect-secrets |
| Mix audit logs and general logs | Audit response impossible. Separate to different log streams |
The 2018 GitHub plaintext-password log incident (password-reset-feature implementation error recorded plaintext passwords of some users in internal logs, forced reset of about 4.7M), the 2021 Twitter internal-log API-Key-inclusion incident (whole-HTTP-body log output left auth tokens in 6 months of audit logs) - cases where design gaps in what to output to logs link directly to info leaks.
PII / token / password output bans enforced at implementation level. Not protected by operational rules alone.
AI-era perspective
When AI-driven dev (vibe coding) and AI usage are the premise, logs are redefined as teaching material AI reads to diagnose. AI assistants like Datadog Bits AI and Grafana LLM analyze logs and explain incident causes in natural language.
| Favored in the AI era | Disfavored in the AI era |
|---|---|
| Structured logs (JSON) | Free text |
| Standard schema (OTel Logs) | Custom format |
| Correlation via trace_id | Isolated logs |
| Meaningful messages | Just âan error occurredâ |
The era of AI answering âthe cause of this error is X, fix Yâ is starting. The premise is outputting in AI-understandable structures - the new standard is making messages clearly state what failed and why.
Logs are a message to AI. Leave them in form machines can read and diagnose.
Authorâs note - cases where âwriting everything to logsâ turned into info leaks
Cases where âoutputting everything to logs for nowâ linked to incidents are perennial industry lessons.
In the 2018 GitHub plaintext-password log incident, due to a password-reset-feature implementation error, plaintext passwords of some users were recorded in internal logs. Fortunately no external leak, but for internal investigation about 4.7M forced password resets were performed, told as a case where the premise âlogs are a safe placeâ collapsed.
Another, in 2021 Twitter internal-log API-Key inclusion was reported. Developers had whole HTTP bodies output to logs, resulting in auth tokens recorded in 6 months of audit logs - a case where every employee with log-viewing permission effectively knew those tokens. A case highlighting the structural problem of readable logs = readable production data.
Both have lax design of âwhat to output, what not to outputâ as the lethal blow, and the discipline of not outputting PII / tokens / passwords to logs must be enforced at implementation level, not operational rules.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Log format (JSON-structured recommended)
- Log-level strategy (production INFO / dev DEBUG)
- Required-field standardization (timestamp, trace_id, service)
- Collection foundation (Fluent Bit / Vector / each cloud)
- Storage destination (Loki / Elasticsearch / Datadog)
- Retention period (hot / warm / cold)
- PII countermeasure (masking, detection)
How to make the final call
The core of log design is a letter to your future self half a year later, with true value asked at future incident investigation rather than just after writing. The rational decision is making structured logs (JSON) the standard, unifying standard items like trace_id / service / user_id, building a state of cross-searchability. Free-text logs are now antipatterns, and the discipline of absolutely not outputting PII (personal info), plus operations of suppressing cost via phased cold-tiering, are required. Storing 100% of errors and sampling successes is the standard that doesnât break at scale.
Another decisive axis is logs are messages to AI. In the era when AI assistants like Datadog Bits AI and Grafana LLM read logs and explain incident causes in natural language, having structured / standard schema / trace_id correlation lets you build cause hypotheses in seconds. Conversely, free text and custom formats become liabilities unreadable to both AI and humans.
Selection priorities
- Default to structured logs - JSON / standard items / meaningful messages
- Absolutely forbid PII output - masking / detection / pre-commit; once output, irrecoverable
- Phased cold-tiering - hot 30 days / warm 3-6 months / cold 1-7 years
- Correlate via trace_id - link with metrics / traces, foundation for AI diagnosis
âLogs are messages to AI.â Leave them in form where true value emerges half a year later via structure / standards / correlation.
Summary
This article covered log design, including log levels, structured logs, required fields, collection and storage, retention, sampling, PII protection, and the message-to-AI viewpoint.
Default to structured logs, absolutely forbid PII output, phased cold-tiering, correlate via trace_id. That is the practical answer for log design in 2026.
Next time weâll cover SLO and SLI (reliability targets and error budgets).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (63/89)