About this article
This article is the Application Architecture category’s final installment (4th) in the Architecture Crash Course for the Generative-AI Era series, covering error handling.
Happy paths look similar across implementations; differences appear at “DB blipped, external API slowed, input was unexpected.” Design here decides reliability and UX. The article covers error classification, exceptions vs Result types, error boundaries, correlation IDs, retry strategy, idempotency, Circuit Breaker, timeouts, Bulkhead — design guidance for “systems that recover after falling.”
What is error handling in the first place
Error handling is, in a nutshell, “a design that pre-determines how a program should respond when it encounters unexpected situations.”
In everyday terms, think of a car’s airbags and ABS. They’re invisible while driving normally, but the instant a collision or skid occurs, they activate to minimize damage. Software works the same way: the DB goes down for a moment, an external API stops responding, a user enters unexpected values — for these “inevitable anomalies,” deciding in advance where to detect them, how to recover, and what to tell the user is what error handling is all about.
Why error handling is needed
Without error handling, trivial failures cascade and engulf the entire system. One microservice slowing causes the caller’s threads to fill up, then the next caller fills up — avalanche-style total stop. Preventing this requires deciding at design time “where errors occur, where they’re caught, how propagated, how recovered.”
A senior in an early-career story told me “spend 10x normal-path effort on the abnormal path,” and operations gradually proved it true. 90% of operational incidents are abnormal-path design gaps.
Build “systems that recover after falling,” not “systems that don’t fall.” That is error design’s substance.
Error classification
The first step in error design is “classifying expected errors by character.” Program bugs, input errors, business errors, transient failures, persistent failures — causes and right responses differ entirely, but many codebases lump them into one Error or Exception and process them in the same catch. Looking back, this is a fatal trap.
Three accidents recur if classification is skipped. First, internal errors that shouldn’t reach users (stack traces, DB-failure detail) leak to screens. Second, transient failures to retry can’t be distinguished from non-retryable business errors (insufficient balance, etc.), producing serious accidents like double charges. Third, mixed bugs and expected business errors keep alarm channels firing, “burying important warnings.” Classification is the starting point of every error strategy.
| Type | Examples | Response |
|---|---|---|
| Program bug | NullPointer, type errors | Fix |
| Input error | Validation failure | Return to user, prompt re-entry |
| Business error | Insufficient stock / balance | Process in business flow |
| Transient failure | Network / external-API timeout | Retry to attempt recovery |
| Persistent failure | Auth failure, lack of permission | Cannot retry, fail immediately |
Different errors get different handling. “One common base class catching everything” is the worst design.
Exceptions vs Result types
Two main ways to express errors in code: exception style and Result-type style, with language-default differences. “Implicit-control-flow” dislike has lifted Result-type popularity, but neither is absolutely correct.
| Method | Languages | Trait |
|---|---|---|
| Exceptions (throw) | Java / C# / Python / JS | Implicit control flow; usually no need to write |
| Result types | Rust / Go / Elm / Haskell | Explicit; type system forces handling |
// Go's Result-style
value, err := repo.Find(id)
if err != nil { return err }
// Rust's Result
let value = repo.find(id)?; // ? operator propagates upward
Go’s if err != nil verbosity vs Rust’s ? succinctness — writing experience differs sharply by language. Writing exception-style as Result-style or vice versa, against the language’s grain, just adds complexity.
Which to pick
Exceptions and Result types aren’t opposing concepts; using them differently by error nature is the modern mainstream. Differentiating produces both readability and safety.
| Scene | Recommended | Reason |
|---|---|---|
| Predictable failures (input / business errors) | Result / Either | Force callers to handle |
| Unpredictable failures (bugs, DB outage, OOM — Out Of Memory) | Exceptions | Throwing upward is safer than per-layer handling |
Making everything exceptions hides which functions return which errors; missed catches drop the app. Conversely, making everything Result fills logic with if err != nil and buries the substance. Drawing the line at “predictable vs not” is balanced design.
Error boundaries
Errors should be “caught at the appropriate boundary, aggregated” rather than “handled where they occur.” Each layer has a role; technical exceptions shouldn’t leak straight into business or UI layers.
flowchart BT
INFRA["Infrastructure layer<br/>throws technical exceptions<br/>(DB / network)"]
DOMAIN["Domain layer<br/>throws business exceptions<br/>(business-rule violations)"]
APP["Application layer<br/>defines business exceptions<br/>technical exceptions pass through"]
UI["UI / Controller layer<br/>aggregate via global handler<br/>convert to HTTP status"]
USER([User response])
INFRA -->|throw| DOMAIN
DOMAIN -->|throw| APP
APP -->|throw| UI
UI --> USER
classDef infra fill:#fee2e2,stroke:#dc2626;
classDef domain fill:#fef3c7,stroke:#d97706;
classDef app fill:#dbeafe,stroke:#2563eb;
classDef ui fill:#fae8ff,stroke:#a21caf;
classDef user fill:#dcfce7,stroke:#16a34a;
class INFRA infra;
class DOMAIN domain;
class APP app;
class UI ui;
class USER user;
The top boundary (controller / API gateway) “catches everything” and converts to HTTP status / JSON response. Try/catch at every layer makes code noisy and raises the swallowing risk.
“Catch everything in one place” — having a global error handler is the modern favorite.
Anti-patterns
Error-handling failure patterns that show up in code review every time. All prioritize “works for now” state, producing serious incidents later.
❌ catch (e) { /* nothing */ } ← Exception swallowing (worst)
❌ catch (Exception e) { log(e) } ← All same handling (no distinction)
❌ throw new Error("error") ← Zero-info exception
❌ return null / -1 to signal failure ← Caller doesn't notice
❌ Deep try/catch nesting ← Out of control, unreadable
Especially “catch and log only, then pass through” is worst — production doesn’t detect it as an outage but actual data is broken: a “silent failure.” When catching, do it by specific type and clarify what can be done before processing.
Hierarchy exception types (business / technical / bugs); catch by specific type — the rule.
”No timeout” caused the avalanche (industry case)
The November 2020 AWS large-scale outage (us-east-1 Kinesis stop, ~17h impact) is the canonical avalanche case starting from “thread exhaustion.” CloudWatch, Cognito, SQS, and many AWS services were affected in cascade — textbook “external-dependency latency cascade” at real-world scale.
This kind of accident is everyday in companies too. “HTTP client calling external APIs left without timeout for years” is heard in many places. Personal early-career stories: running production code with no external-API timeout; one Monday morning, the other side’s maintenance ran long, and our entire API server went unresponsive.
It worked because responses normally returned in tens of ms. The day the other side jammed, the calling thread got eternally held, and eventually the whole process stopped responding. The shared lesson: Bulkhead, Circuit Breaker, and timeout cannot be “added after it happens” — never in time.
External calls without timeout are like placing a time bomb.
User-facing messages
Error messages have “developer-facing” and “end-user-facing” kinds with completely different goals and content. Confusing them either leaks internal info to users or leaves developers unable to debug.
| Audience | Approach |
|---|---|
| End user | Plain text, concrete workaround, no PII or internal info |
| Developer | Stack trace, correlation ID, input values, timestamp |
❌ Show user: "java.sql.SQLIntegrityConstraintException: duplicate key 'email'..."
✅ Show user: "This email address is already registered"
Log: Detailed stack trace + trace_id: abc123 + user_id: 42
Showing technical detail (stack traces, SQL errors) directly to users gives attackers clues. Conversely, just “an error occurred” leaves nothing to investigate when inquiries come. “Tie both to the same correlation ID” is the rule.
Correlation ID
In microservice environments, a single request crosses multiple services, making it hard to trace what happened where. Correlation ID (or Trace ID) solves this: a unique ID assigned at request entry propagates across all services.
[Request X-Request-Id: abc123]
↓
[Service A] → log: trace=abc123 "processing started"
↓
[Service B] → log: trace=abc123 "DB write"
↓
[Service C] → log: trace=abc123 "notification send failed"
Putting that ID on user-facing error screens lets you, when the user reports “error number abc123,” immediately trace path and cause across logs. The field favorite is OpenTelemetry, automating ID issue, propagation, and visualization.
For microservices, correlation IDs are required. Bolting on later is painful — install at the start.
Retry strategy
Transient failures (network blips, transient external-API errors) often succeed on retry seconds later. But naive retries worsen things; “backoff” + “Jitter” combined is the default.
| Method | Substance |
|---|---|
| Fixed interval | Retry every 1s for N times (simple but concentrates) |
| Exponential backoff | 1 → 2 → 4 → 8s (doubling) |
| With Jitter | Add randomness to backoff |
| Max attempts | 3-5 then give up (prevent infinite loop) |
Many clients retrying simultaneously cause “Thundering Herd”, taking down a recovering external service again. Jitter scatters timing — the rule for distributed systems. AWS SDK, Google Cloud SDK, and major libraries default to Jittered exponential backoff.
The “retry + Jitter + max attempts” triple-set is the rule.
Idempotency
What must always pair with retry is idempotency. Idempotent means “same request, same result regardless of repeats.” Without this, retries cause double charges, double registrations, double shipments.
❌ POST /users retried 3 times due to network failure
→ Same user created 3 times
✅ POST /users + Idempotency-Key: uuid-abc123
→ Second+ requests with same key return the first result
Implementation pattern:
- UUID issued by the client (Idempotency-Key) included in the request.
- Server records the key (DB unique constraint / Redis with TTL).
- Same-key request returns the first result.
Payment APIs like Stripe API support Idempotency-Key by default. For your own APIs, “always for processes touching money or with side effects” — the must-introduce pattern.
Circuit Breaker
When an external service is failing, sending more useless calls can drag you down too. Circuit Breaker prevents this — same idea as electrical breakers.
| State | Behavior |
|---|---|
| Closed | Normal. All requests pass |
| Open | Failure crossed threshold; cut off. Return errors immediately |
| Half-Open | After a wait, send a single test request to check recovery |
Calling a downed external service via “5s-timeout calls 1000 times/sec” exhausts your thread pool and takes you down. Moving Circuit Breaker to Open returns failures immediately, keeping you alive; on recovery, automatically resumes.
Implementations: Resilience4j / Polly / Istio / Linkerd are typical. Required for production services with external dependencies.
Timeout and Bulkhead
Beyond Circuit Breaker, three patterns are required for any system with external dependencies. They aim “limit the impact of failures,” not block them.
| Pattern | Role |
|---|---|
| Timeout | Prevent infinite waits. Always set on every external call |
| Bulkhead | Isolate resources (thread pools, etc.) to prevent blocking cascades |
| Rate Limit | Cap N req/sec, protect from overload |
“No timeout set” is the most-frequently-seen accident cause. HTTP clients and DB connections used without specification, when the other side doesn’t respond, eternally hold threads, and lag eventually progresses to system stop. Bulkhead, like a ship’s bulkhead, is design separating connection pools per function so one external service’s lag doesn’t consume all threads.
Set default timeouts on every external call. Unspecified always becomes a hotbed of incidents.
Implementation priority by case
Adopting all patterns from the start is over-engineering. Decide gradually by system scale and external-dependency count.
Personal / internal tools
Exceptions + global handler + timeout. Enough. Circuit Breaker etc. unnecessary.
General web service
Above + correlation ID + retry (with Jitter) + idempotency. Required at launch.
Microservices / heavy external API use
Above + Circuit Breaker + Bulkhead + Rate Limit. Adding from outside via service mesh (Istio, etc.) is efficient.
Payment / financial / inventory
Above + strict idempotency + transaction design (Saga / Outbox). Double processing absolutely forbidden.
Timeout / retry numeric gates
Note: industry rates as of April 2026. Periodic refresh required.
Running error strategy on the vague “appropriately” produces production accidents; set specific numerical baselines at the start. Industry defaults:
| Setting | Recommended | Reason |
|---|---|---|
| HTTP-client timeout | 5s connect / 30s read | Unspecified = time bomb |
| DB-connection timeout | 3s connect / 30s query | DB-failure avalanche prevention |
| Retry max attempts | 3-5 | Infinite loop = attack |
| Exponential-backoff intervals | 1 → 2 → 4 → 8s + Jitter 0-1s | Thundering Herd prevention |
| Circuit Breaker error-rate threshold | 50% (last 10s) | Too sensitive or too dull both bad |
| Circuit Breaker half-open recovery | 30-60s | Service recovery grace |
| Bulkhead parallelism | 10-50 per external service | Resource independence |
| Rate Limit (public API) | 60 req/min/user | Brute-force prevention |
| Idempotency-Key TTL | 24 hours | Retry-possible window |
AWS SDK, Google Cloud SDK, Stripe SDK default to Jittered exponential backoff, so leaning on libraries beats DIY. Resilience4j (Java) / Polly (.NET) / tenacity (Python) / resilience (TypeScript) are 2026 default libraries.
No timeout set is a time bomb. Always set on every external call.
Error-handling traps
Common abnormal-path failure patterns. All cause silent failures, double processing, avalanche stops.
| Forbidden move | Why |
|---|---|
catch (e) { /* nothing */ } swallowing | Silent-failure hotbed. Endless mystery bugs in production |
catch (Exception e) { log(e) } treating all the same | Bugs and business errors mix; alarms keep firing into formality |
| HTTP client / DB connection without timeout | The same avalanche pattern as the November 2020 AWS us-east-1 Kinesis outage |
| Retry without idempotency keys | Network failures cause double payments, double registrations, double inventory decrements |
| Retry without Jitter | Thundering Herd takes down a recovering external service again |
| Showing users stack traces / SQL errors | Attack clues, business-info leakage potential |
return null / return -1 for error signaling | Callers don’t notice; null propagation eventually hits NullPointer |
| External API hammering without Circuit Breaker | Thread-pool exhaustion during external outage drags you into the cascade |
| Correlation ID added later | Cross-service log tracing becomes impossible. Install OpenTelemetry from start |
No error-type hierarchy (everything is Error) | Business / technical / bug errors mixed; catch fails to function |
| Assuming Circuit Breaker is for large services only | Required for any service hitting external APIs. Even personal use needs it with external dependencies |
| ”try/catch wrapped, so it’s fine” complacency | Default trap of AI-generated code. Spotting swallowing in review is the human’s job |
The 2012 Knight Capital incident ($440M loss in 45 minutes) started from a single server with old code’s “error-handling gap.” Error design “added after the fact” is never in time.
“try/catch wrapped, so it’s fine” is a default trap of AI-generated code. Spotting swallowing in review is the human’s job.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Result types, explicit error returns | Implicit throws, missed catches |
| Type-expressed business errors | string-message generic exceptions |
| Standard instrumentation like OpenTelemetry | Custom log formats |
| Global error handler unifying processing | Disparate try/catch per layer |
- Classify errors by type (bug / input / business / transient / persistent).
- Aggregate at boundaries (convert in global handler).
- Retry + idempotency + Circuit Breaker triple-set (required for external dependencies).
- Bind errors with types (Result / Discriminated Union prevent swallowing).
What you must decide — what’s your project’s answer?
Articulate your project’s answer in 1-2 sentences for each:
- Exceptions vs Result types policy
- Error-type hierarchy (business / technical / bug)
- Correlation ID issuance and propagation (OpenTelemetry, etc.)
- Retry policy (backoff / Jitter / max attempts)
- Idempotency implementation (Idempotency-Key holding)
- Circuit Breaker / Timeout / Rate Limit thresholds
- User-facing error-message format
- Log levels (ERROR / WARN / INFO / DEBUG) usage rules
Summary
This article covered error handling — error classification, exceptions vs Result types, retry strategy, idempotency, Circuit Breaker.
Imagination for the abnormal path is human work. Bind AI with types and standard libraries — the 2026 realistic answer for error design.
This concludes the “Application Architecture” category’s 5 articles. The next category is “Frontend Architecture” — hosting, rendering, state management, SEO, and other frontend design judgments.
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (29/89)