About this article
As the tenth installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains log design.
Logs are a letter to your future self - their true value is asked half a year later, not at writing time. This article covers structured logs (JSON), log levels, correlation IDs, PII masking, retention, and log-aggregation foundations (CloudWatch / Loki / Datadog), handling design that leaves necessary-and-sufficient information in machine-readable form with appropriate retention.
What is log design, anyway?
Picture an airplaneâs flight recorder (black box). When an accident happens, the only reason investigators can determine what occurred is that all in-flight data was automatically recorded. âStarting the camera after the crashâ is too late.
Log design means deciding in advance what to record, in what format, where to store it, and for how long about your systemâs operational records. Incident investigation, security audits, business analysis â none of it can begin without logs.
Without log design, when an incident occurs there are zero clues to identify the cause. Nobody can answer âwhat happened last night,â and the same failures keep repeating.
Why log design is needed
Primary information for incident investigation
When incidents occur, without logs the cause canât be known. Logs are the only asset for restoring past system state.
Audit / compliance response
Records of âwho did what whenâ are always demanded in audits. SOX, PCI DSS, Personal Information Protection Act - all require audit-log retention.
Business analysis / improvement
User behavior, error trends, performance - analyzing logs gets hints for product improvement. More detailed info available than metrics.
Log levels
Standard practice is severity-staging logs. At runtime, you can filter by level - production at INFO+, development at DEBUG+. Log libraries in each language standard-support this.
| Level | Use case |
|---|---|
| TRACE | Most detailed, usually disabled |
| DEBUG | Debug info during development |
| INFO | Milestones of normal processing |
| WARN | Notable situations (auto-recovery etc.) |
| ERROR | Errors occurred, needs investigation |
| FATAL | Critical incident, immediate response |
Outputting DEBUG in production is the start of hell. Volume explodes and storage cost balloons several-fold.
Structured Logging
The method outputting logs in machine-readable structures like JSON. Legacy free-text logs are human-readable but search, aggregation, and correlation analysis were difficult. With structured logs, all become easy.
{
"timestamp": "2026-04-18T10:23:45Z",
"level": "ERROR",
"service": "order-api",
"trace_id": "abc123",
"user_id": "u42",
"message": "Payment failed",
"error_code": "CARD_DECLINED",
"latency_ms": 1234
}
Free-text output is now an antipattern. All new projects should start with structured logs.
âA log writing only âan error occurredâ is the same as a will writing only âsomeone diedââ - the standard maxim of log design. In late-night incident response, facing logs lined with just [ERROR] Payment failed, having to reverse-engineer âwhich user, what amount, why payment failedâ via âtimestampâ and âStripe-management-screen cross-referenceâ alone often happens in the field. Stories of struggling 3 hours to finally reach the cause and adding user_id, amount, and error_code that very night - arenât rare.
What to write in logs
Standardize info included in logs and unify across all services. Without deciding standard items, things become disparate per service, making cross-search hard.
| Required fields | Content |
|---|---|
| timestamp | ISO 8601, UTC recommended |
| level | ERROR / INFO etc. |
| service | Service name |
| trace_id | Link with distributed tracing |
| user_id / request_id | Subject / request identification |
| message | Content humans read |
| context | Structured additional info |
Personal info, passwords, tokens absolutely never written to logs - the iron rule. Once output, irrecoverable.
Log-collection architecture
Beyond outputting from apps, foundations to aggregate, store, and search are needed. The modern mainstream is the composition apps to stdout, collection in separate processes (12-Factor App, the cloud-era app-architecture guideline proposed by Heroku), letting apps not worry about log destinations.
| Collection tool | Characteristics |
|---|---|
| Fluent Bit | Lightweight, CNCF graduated |
| Vector | Rust-built, high-perf |
| Fluentd | Veteran, feature-rich |
| Promtail | Loki-dedicated, lightweight |
Storage destinations (log backend)
Choose log storage by search demand, cost, and data volume. High-speed-searching mass logs is unexpectedly costly, and âall logs into Elasticsearch for nowâ is a breakdown-prone approach.
| Backend | Characteristics | Cost |
|---|---|---|
| Elasticsearch | Strongest search, heavy ops | High |
| Loki | Label-based, cheap | Low |
| Datadog Logs | SaaS-integrated | High |
| Splunk | Veteran, high-feature | Highest |
| Cloud Logging (GCP) | Managed | Mid |
| CloudWatch Logs | AWS-integrated | Mid |
| S3 / GCS | Archive | Lowest |
Loki by Grafana Labs has extremely good cost efficiency and has rapidly spread recently. Drawing attention as Elasticsearchâs alternative.
Retention and cost management
Logs cost increases proportionally to retention, so permanent storage isnât realistic. Considering legal and investigative requirements, phased cold-tiering is general.
| Tier | Period | Use case |
|---|---|---|
| Hot (high-speed search) | 7-30 days | Incident investigation |
| Warm (slightly slow) | 3-6 months | Analysis, audit |
| Cold (archive) | 1-7 years | Audit requirements, legal retention |
| Deletion | Beyond | Erase unneeded info |
Audit logs at 7 years are required by many regulations, but app logs are often enough at 30 days. Categorize and handle - realistic.
Log sampling
In large systems, storing all logs explodes cost, so reduce by sampling. But the principle is storing 100% of error logs, sampling only successful requests.
| Strategy | Content |
|---|---|
| Fixed rate | Store only 1/100 |
| Tail sampling | Store all on errors |
| Importance-based | Vary by amount, user type |
| Adaptive sampling | Auto-adjust by traffic volume |
OpenTelemetryâs tail sampling is the modern answer, judging storage after seeing the whole trace.
Audit logs
Special logs recording âwho did what when,â requiring tamper-proof and long-term retention. Manage through different routes from general logs, ideally storing on WORM (Write Once Read Many) storage.
| Required record fields | Content |
|---|---|
| Who | User ID, IP |
| What | Operation contents |
| When | Timestamp |
| Where | System, resource |
| Result | Success / failure |
AWS CloudTrail and GCP Audit Logs provide cloud-level audit logs as standard. App-level audit logs are safer separated to different tables or different log streams.
PII (personal info) handling
The principle is donât output personal info to logs. Strictly regulated by GDPR and Personal Information Protection Act - âleaked from logsâ isnât an excuse.
| Treatment | Content |
|---|---|
| Masking | Mask like user***@gmail.com |
| Hashing | One-way conversion for analysis |
| Exclusion | Donât output in the first place |
| PII-detection tools | Auto-detect and block |
CC numbers, My Number, passwords, API keys - take measures via frameworks and log libraries to absolutely not output these to logs.
Decision criterion 1: data volume
Log volume varies backend choice. Few GB monthly: CloudWatch Logs; few TB: Loki / Elasticsearch; beyond: full-fledged like Splunk needed.
| Monthly log volume | Recommended |
|---|---|
| ~10GB | CloudWatch Logs / Datadog free tier |
| 10GB-1TB | Loki / Grafana Cloud |
| 1TB-10TB | Self-built Elasticsearch / Datadog |
| 10TB+ | Splunk / Elastic Cloud Enterprise |
Decision criterion 2: org regulatory requirements
Industries with strict audit/compliance requirements need log tamper-proofing and long-term retention.
| Industry / regulation | Required |
|---|---|
| Finance / PCI DSS | 1+ year, tamper prevention |
| Medical / HIPAA | 6-year retention, encryption |
| J-SOX | Accounting-related 7 years |
| General companies | 30-90 days + audit logs long-term |
How to choose by case
Personal dev / small web service
CloudWatch Logs or Cloud Logging + structured-log output. Start without additional foundation. 30-day retention is enough, just transfer audit logs to S3 for long-term retention.
Startup / SaaS (tens of GB monthly)
Grafana Cloud (Loki) + OpenTelemetry Logs. Cost-efficient, the same foundation handles metrics and traces. PII via multi-layer defense of pre-commit hook + Fluent Bit filter.
Mid-size enterprise / microservices
Self-built Loki + Vector or Fluent Bit + S3 archive. Loki for hot search, S3 for long-term archive, audit logs separated to different routes from CloudTrail. Operable with 2-3 SREs.
Finance / medical / regulated industries
Splunk or Elastic Cloud + WORM storage + tamper prevention. Audit logs in different system, different permissions, long-term retention (7 years), with all logs signed and chained (Hash Chain) for tamper-detection capability.
Log-volume / retention numerical gates
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
For logs, âtake everythingâ explodes the bill, so per-use-case retention strategies are required.
| Item | Recommended | Reason |
|---|---|---|
| Production log level | INFO+ | DEBUG forbidden (10x volume) |
| Hot retention period | 7-30 days | Immediate response in investigation |
| Warm retention period | 3-6 months | Analysis, audit |
| Cold retention period (audit) | 1-7 years | Regulation (finance: 7 years, PCI DSS: 1 year) |
| Log volume per request | Under 1KB | Prevent bloat |
| Error-log storage | 100% (no sampling) | Keep all |
| Success-log storage | 1-10% sampling | Cost reduction |
| Log format | Structured JSON | AI / machine-readable |
| Required fields | timestamp / level / service / trace_id | Cross-search |
| PII output | Absolutely forbidden | Masking required |
Monthly log volume guidelines: ~10GB on CloudWatch free-thousands of yen, ~1TB on Loki tens of thousands of yen, ~10TB on self-built Elasticsearch hundreds of thousands of yen, beyond at Splunk Enterprise millions+. Past monthly 1TB is the line to consider migration to Loki / Grafana Cloud.
For logs, leave them in restorable form more than âtake them.â Control cost via phased cold-tiering and sampling.
Log-operation pitfalls and forbidden moves
Typical accident patterns in logs. All link to either info leaks, cost explosion, or investigation-impossibility.
| Forbidden move | Why itâs bad |
|---|---|
| Output PII to logs | The 2018 GitHub plaintext-password incident triggering 4.7M forced resets |
| Free-text logs | Search / aggregation / AI analysis difficult. Structured JSON required |
| Operate microservices without trace_id | Cross-search impossible, incident investigation goes from hours to days |
| Output everything at INFO+ | Leaving DEBUG in production is the typical $10k/month bill case |
| Custom format for just one serviceâs logs | Cross-search clogs. Unify standards within team |
| Save logs in same storage as app | Logs erased on breach. Different account + WORM |
| Aim for permanent retention | Tens of millions of yen yearly storage billing. Phased cold-tiering |
| Sample error logs | The rule for errors is 100% storage. Keep all failures |
| Output whole HTTP body to logs | The 2021 Twitter internal-log API-key-inclusion pattern |
| Leave print statements without pre-commit hooks | Typical of personal-info inclusion. Defend with GitLeaks / detect-secrets |
| Mix audit logs and general logs | Audit response impossible. Separate to different log streams |
| âOutput everything at INFO level for safetyâ | Production DEBUG logs are 10x the volume, the typical pattern for $10k/month billing |
| âWant to store logs foreverâ | Tens of millions in yearly storage billing; phased cold-tiering is required |
The 2018 GitHub plaintext-password log incident (password-reset-feature implementation error recorded plaintext passwords of some users in internal logs, forced reset of about 4.7M), the 2021 Twitter internal-log API-Key-inclusion incident (whole-HTTP-body log output left auth tokens in 6 months of audit logs) - cases where design gaps in what to output to logs link directly to info leaks.
PII / token / password output bans enforced at implementation level. Not protected by operational rules alone.
AI decision axes
| AI-favored | AI-disfavored |
|---|---|
| Structured logs (JSON) | Free text |
| Standard schema (OTel Logs) | Custom format |
| Correlation via trace_id | Isolated logs |
| Meaningful messages | Just âan error occurredâ |
- Default to structured logs - JSON / standard items / meaningful messages
- Absolutely forbid PII output - masking / detection / pre-commit; once output, irrecoverable
- Phased cold-tiering - hot 30 days / warm 3-6 months / cold 1-7 years
- Correlate via trace_id - link with metrics / traces, foundation for AI diagnosis
Structured logs enable AI fault diagnosis
When JSON-format logs contain trace_id, service, level, and timestamp, AI can instantly analyze âwhich service has concentrated errors in this time windowâ and âwhere did processing for this trace_id fail.â With unstructured text logs, parsing with regex is needed first, and analysis accuracy drops.
As of 2026, the major observability tools - Datadog, New Relic, and Honeycomb - all ship âask AI about logsâ features, with structured logs as a prerequisite.
Log-level design and AI
In AI-generated code, log-level design tends to be vague. AI overuses console.log or logger.info, outputting at INFO level even where ERROR or WARN would be appropriate.
Documenting log-level criteria in the project (ERROR = immediate action, WARN = investigation needed, INFO = normal flow confirmation, DEBUG = dev only) and verifying appropriate levels via CI lint is effective.
Authorâs note - cases where âwriting everything to logsâ turned into info leaks
Cases where âoutputting everything to logs for nowâ linked to incidents are perennial industry lessons.
In the 2018 GitHub plaintext-password log incident, due to a password-reset-feature implementation error, plaintext passwords of some users were recorded in internal logs. Fortunately no external leak, but for internal investigation about 4.7M forced password resets were performed, told as a case where the premise âlogs are a safe placeâ collapsed.
Another, in 2021 Twitter internal-log API-Key inclusion was reported. Developers had whole HTTP bodies output to logs, resulting in auth tokens recorded in 6 months of audit logs - a case where every employee with log-viewing permission effectively knew those tokens. A case highlighting the structural problem of readable logs = readable production data.
Both have lax design of âwhat to output, what not to outputâ as the lethal blow, and the discipline of not outputting PII / tokens / passwords to logs must be enforced at implementation level, not operational rules.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Log format (JSON-structured recommended)
- Log-level strategy (production INFO / dev DEBUG)
- Required-field standardization (timestamp, trace_id, service)
- Collection foundation (Fluent Bit / Vector / each cloud)
- Storage destination (Loki / Elasticsearch / Datadog)
- Retention period (hot / warm / cold)
- PII countermeasure (masking, detection)
Related Articles
https://en.senkohome.com/arch-intro-devops-docs/ https://en.senkohome.com/arch-intro-devops-observability/ https://en.senkohome.com/arch-intro-devops-overview/
Summary
This article covered log design, including log levels, structured logs, required fields, collection and storage, retention, sampling, PII protection, and the message-to-AI viewpoint.
Default to structured logs, absolutely forbid PII output, phased cold-tiering, correlate via trace_id. That is the practical answer for log design in 2026.
Next time weâll cover SLO and SLI (reliability targets and error budgets).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (63/89)