Error Handling — Systems That Recover After Falling

About this article

This article is the Application Architecture category’s final installment (4th) in the Architecture Crash Course for the Generative-AI Era series, covering error handling.

Happy paths look similar across implementations; differences appear at “DB blipped, external API slowed, input was unexpected.” Design here decides reliability and UX. The article covers error classification, exceptions vs Result types, error boundaries, correlation IDs, retry strategy, idempotency, Circuit Breaker, timeouts, Bulkhead — design guidance for “systems that recover after falling.”

What is error handling in the first place

Error handling is, in a nutshell, “a design that pre-determines how a program should respond when it encounters unexpected situations.”

In everyday terms, think of a car’s airbags and ABS. They’re invisible while driving normally, but the instant a collision or skid occurs, they activate to minimize damage. Software works the same way: the DB goes down for a moment, an external API stops responding, a user enters unexpected values — for these “inevitable anomalies,” deciding in advance where to detect them, how to recover, and what to tell the user is what error handling is all about.

Why error handling is needed

Without error handling, trivial failures cascade and engulf the entire system. One microservice slowing causes the caller’s threads to fill up, then the next caller fills up — avalanche-style total stop. Preventing this requires deciding at design time “where errors occur, where they’re caught, how propagated, how recovered.”

A senior in an early-career story told me “spend 10x normal-path effort on the abnormal path,” and operations gradually proved it true. 90% of operational incidents are abnormal-path design gaps.

Build “systems that recover after falling,” not “systems that don’t fall.” That is error design’s substance.

Error classification

The first step in error design is “classifying expected errors by character.” Program bugs, input errors, business errors, transient failures, persistent failures — causes and right responses differ entirely, but many codebases lump them into one Error or Exception and process them in the same catch. Looking back, this is a fatal trap.

Three accidents recur if classification is skipped. First, internal errors that shouldn’t reach users (stack traces, DB-failure detail) leak to screens. Second, transient failures to retry can’t be distinguished from non-retryable business errors (insufficient balance, etc.), producing serious accidents like double charges. Third, mixed bugs and expected business errors keep alarm channels firing, “burying important warnings.” Classification is the starting point of every error strategy.

Type	Examples	Response
Program bug	NullPointer, type errors	Fix
Input error	Validation failure	Return to user, prompt re-entry
Business error	Insufficient stock / balance	Process in business flow
Transient failure	Network / external-API timeout	Retry to attempt recovery
Persistent failure	Auth failure, lack of permission	Cannot retry, fail immediately

Different errors get different handling. “One common base class catching everything” is the worst design.

Exceptions vs Result types

Two main ways to express errors in code: exception style and Result-type style, with language-default differences. “Implicit-control-flow” dislike has lifted Result-type popularity, but neither is absolutely correct.

Method	Languages	Trait
Exceptions (throw)	Java / C# / Python / JS	Implicit control flow; usually no need to write
Result types	Rust / Go / Elm / Haskell	Explicit; type system forces handling

// Go's Result-style
value, err := repo.Find(id)
if err != nil { return err }

// Rust's Result
let value = repo.find(id)?;  // ? operator propagates upward

Go’s if err != nil verbosity vs Rust’s ? succinctness — writing experience differs sharply by language. Writing exception-style as Result-style or vice versa, against the language’s grain, just adds complexity.

Which to pick

Exceptions and Result types aren’t opposing concepts; using them differently by error nature is the modern mainstream. Differentiating produces both readability and safety.

Scene	Recommended	Reason
Predictable failures (input / business errors)	Result / Either	Force callers to handle
Unpredictable failures (bugs, DB outage, OOM — Out Of Memory)	Exceptions	Throwing upward is safer than per-layer handling

Making everything exceptions hides which functions return which errors; missed catches drop the app. Conversely, making everything Result fills logic with if err != nil and buries the substance. Drawing the line at “predictable vs not” is balanced design.

Error boundaries

Errors should be “caught at the appropriate boundary, aggregated” rather than “handled where they occur.” Each layer has a role; technical exceptions shouldn’t leak straight into business or UI layers.

flowchart BT
    INFRA["Infrastructure layer<br/>throws technical exceptions<br/>(DB / network)"]
    DOMAIN["Domain layer<br/>throws business exceptions<br/>(business-rule violations)"]
    APP["Application layer<br/>defines business exceptions<br/>technical exceptions pass through"]
    UI["UI / Controller layer<br/>aggregate via global handler<br/>convert to HTTP status"]
    USER([User response])
    INFRA -->|throw| DOMAIN
    DOMAIN -->|throw| APP
    APP -->|throw| UI
    UI --> USER
    classDef infra fill:#fee2e2,stroke:#dc2626;
    classDef domain fill:#fef3c7,stroke:#d97706;
    classDef app fill:#dbeafe,stroke:#2563eb;
    classDef ui fill:#fae8ff,stroke:#a21caf;
    classDef user fill:#dcfce7,stroke:#16a34a;
    class INFRA infra;
    class DOMAIN domain;
    class APP app;
    class UI ui;
    class USER user;

The top boundary (controller / API gateway) “catches everything” and converts to HTTP status / JSON response. Try/catch at every layer makes code noisy and raises the swallowing risk.

“Catch everything in one place” — having a global error handler is the modern favorite.

Anti-patterns

Error-handling failure patterns that show up in code review every time. All prioritize “works for now” state, producing serious incidents later.

❌ catch (e) { /* nothing */ }   ← Exception swallowing (worst)
❌ catch (Exception e) { log(e) } ← All same handling (no distinction)
❌ throw new Error("error")       ← Zero-info exception
❌ return null / -1 to signal failure ← Caller doesn't notice
❌ Deep try/catch nesting          ← Out of control, unreadable

Especially “catch and log only, then pass through” is worst — production doesn’t detect it as an outage but actual data is broken: a “silent failure.” When catching, do it by specific type and clarify what can be done before processing.

Hierarchy exception types (business / technical / bugs); catch by specific type — the rule.

”No timeout” caused the avalanche (industry case)

The November 2020 AWS large-scale outage (us-east-1 Kinesis stop, ~17h impact) is the canonical avalanche case starting from “thread exhaustion.” CloudWatch, Cognito, SQS, and many AWS services were affected in cascade — textbook “external-dependency latency cascade” at real-world scale.

This kind of accident is everyday in companies too. “HTTP client calling external APIs left without timeout for years” is heard in many places. Personal early-career stories: running production code with no external-API timeout; one Monday morning, the other side’s maintenance ran long, and our entire API server went unresponsive.

It worked because responses normally returned in tens of ms. The day the other side jammed, the calling thread got eternally held, and eventually the whole process stopped responding. The shared lesson: Bulkhead, Circuit Breaker, and timeout cannot be “added after it happens” — never in time.

External calls without timeout are like placing a time bomb.

User-facing messages

Error messages have “developer-facing” and “end-user-facing” kinds with completely different goals and content. Confusing them either leaks internal info to users or leaves developers unable to debug.

Audience	Approach
End user	Plain text, concrete workaround, no PII or internal info
Developer	Stack trace, correlation ID, input values, timestamp

❌ Show user: "java.sql.SQLIntegrityConstraintException: duplicate key 'email'..."
✅ Show user: "This email address is already registered"
   Log:        Detailed stack trace + trace_id: abc123 + user_id: 42

Showing technical detail (stack traces, SQL errors) directly to users gives attackers clues. Conversely, just “an error occurred” leaves nothing to investigate when inquiries come. “Tie both to the same correlation ID” is the rule.

Correlation ID

In microservice environments, a single request crosses multiple services, making it hard to trace what happened where. Correlation ID (or Trace ID) solves this: a unique ID assigned at request entry propagates across all services.

[Request  X-Request-Id: abc123]
   ↓
[Service A] → log: trace=abc123 "processing started"
   ↓
[Service B] → log: trace=abc123 "DB write"
   ↓
[Service C] → log: trace=abc123 "notification send failed"

Putting that ID on user-facing error screens lets you, when the user reports “error number abc123,” immediately trace path and cause across logs. The field favorite is OpenTelemetry, automating ID issue, propagation, and visualization.

For microservices, correlation IDs are required. Bolting on later is painful — install at the start.

Retry strategy

Transient failures (network blips, transient external-API errors) often succeed on retry seconds later. But naive retries worsen things; “backoff” + “Jitter” combined is the default.

Method	Substance
Fixed interval	Retry every 1s for N times (simple but concentrates)
Exponential backoff	1 → 2 → 4 → 8s (doubling)
With Jitter	Add randomness to backoff
Max attempts	3-5 then give up (prevent infinite loop)

Many clients retrying simultaneously cause “Thundering Herd”, taking down a recovering external service again. Jitter scatters timing — the rule for distributed systems. AWS SDK, Google Cloud SDK, and major libraries default to Jittered exponential backoff.

The “retry + Jitter + max attempts” triple-set is the rule.

Idempotency

What must always pair with retry is idempotency. Idempotent means “same request, same result regardless of repeats.” Without this, retries cause double charges, double registrations, double shipments.

❌ POST /users retried 3 times due to network failure
   → Same user created 3 times

✅ POST /users + Idempotency-Key: uuid-abc123
   → Second+ requests with same key return the first result

Implementation pattern:

UUID issued by the client (Idempotency-Key) included in the request.
Server records the key (DB unique constraint / Redis with TTL).
Same-key request returns the first result.

Payment APIs like Stripe API support Idempotency-Key by default. For your own APIs, “always for processes touching money or with side effects” — the must-introduce pattern.

Circuit Breaker

When an external service is failing, sending more useless calls can drag you down too. Circuit Breaker prevents this — same idea as electrical breakers.

State	Behavior
Closed	Normal. All requests pass
Open	Failure crossed threshold; cut off. Return errors immediately
Half-Open	After a wait, send a single test request to check recovery

Calling a downed external service via “5s-timeout calls 1000 times/sec” exhausts your thread pool and takes you down. Moving Circuit Breaker to Open returns failures immediately, keeping you alive; on recovery, automatically resumes.

Implementations: Resilience4j / Polly / Istio / Linkerd are typical. Required for production services with external dependencies.

Timeout and Bulkhead

Beyond Circuit Breaker, three patterns are required for any system with external dependencies. They aim “limit the impact of failures,” not block them.

Pattern	Role
Timeout	Prevent infinite waits. Always set on every external call
Bulkhead	Isolate resources (thread pools, etc.) to prevent blocking cascades
Rate Limit	Cap N req/sec, protect from overload

“No timeout set” is the most-frequently-seen accident cause. HTTP clients and DB connections used without specification, when the other side doesn’t respond, eternally hold threads, and lag eventually progresses to system stop. Bulkhead, like a ship’s bulkhead, is design separating connection pools per function so one external service’s lag doesn’t consume all threads.

Set default timeouts on every external call. Unspecified always becomes a hotbed of incidents.

Implementation priority by case

Adopting all patterns from the start is over-engineering. Decide gradually by system scale and external-dependency count.

Personal / internal tools

Exceptions + global handler + timeout. Enough. Circuit Breaker etc. unnecessary.

General web service

Above + correlation ID + retry (with Jitter) + idempotency. Required at launch.

Microservices / heavy external API use

Above + Circuit Breaker + Bulkhead + Rate Limit. Adding from outside via service mesh (Istio, etc.) is efficient.

Payment / financial / inventory

Above + strict idempotency + transaction design (Saga / Outbox). Double processing absolutely forbidden.

Timeout / retry numeric gates

Note: industry rates as of April 2026. Periodic refresh required.

Running error strategy on the vague “appropriately” produces production accidents; set specific numerical baselines at the start. Industry defaults:

Setting	Recommended	Reason
HTTP-client timeout	5s connect / 30s read	Unspecified = time bomb
DB-connection timeout	3s connect / 30s query	DB-failure avalanche prevention
Retry max attempts	3-5	Infinite loop = attack
Exponential-backoff intervals	1 → 2 → 4 → 8s + Jitter 0-1s	Thundering Herd prevention
Circuit Breaker error-rate threshold	50% (last 10s)	Too sensitive or too dull both bad
Circuit Breaker half-open recovery	30-60s	Service recovery grace
Bulkhead parallelism	10-50 per external service	Resource independence
Rate Limit (public API)	60 req/min/user	Brute-force prevention
Idempotency-Key TTL	24 hours	Retry-possible window

AWS SDK, Google Cloud SDK, Stripe SDK default to Jittered exponential backoff, so leaning on libraries beats DIY. Resilience4j (Java) / Polly (.NET) / tenacity (Python) / resilience (TypeScript) are 2026 default libraries.

No timeout set is a time bomb. Always set on every external call.

Error-handling traps

Common abnormal-path failure patterns. All cause silent failures, double processing, avalanche stops.

Forbidden move	Why
*`catch (e) { / nothing / }` swallowing*	Silent-failure hotbed. Endless mystery bugs in production
`catch (Exception e) { log(e) }` treating all the same	Bugs and business errors mix; alarms keep firing into formality
HTTP client / DB connection without timeout	The same avalanche pattern as the November 2020 AWS us-east-1 Kinesis outage
Retry without idempotency keys	Network failures cause double payments, double registrations, double inventory decrements
Retry without Jitter	Thundering Herd takes down a recovering external service again
Showing users stack traces / SQL errors	Attack clues, business-info leakage potential
`return null` / `return -1` for error signaling	Callers don’t notice; null propagation eventually hits NullPointer
External API hammering without Circuit Breaker	Thread-pool exhaustion during external outage drags you into the cascade
Correlation ID added later	Cross-service log tracing becomes impossible. Install OpenTelemetry from start
No error-type hierarchy (everything is `Error`)	Business / technical / bug errors mixed; catch fails to function
Assuming Circuit Breaker is for large services only	Required for any service hitting external APIs. Even personal use needs it with external dependencies
”try/catch wrapped, so it’s fine” complacency	Default trap of AI-generated code. Spotting swallowing in review is the human’s job

The 2012 Knight Capital incident ($440M loss in 45 minutes) started from a single server with old code’s “error-handling gap.” Error design “added after the fact” is never in time.

“try/catch wrapped, so it’s fine” is a default trap of AI-generated code. Spotting swallowing in review is the human’s job.

AI decision axes

AI-era favorable	AI-era unfavorable
Result types, explicit error returns	Implicit throws, missed catches
Type-expressed business errors	string-message generic exceptions
Standard instrumentation like OpenTelemetry	Custom log formats
Global error handler unifying processing	Disparate try/catch per layer

Classify errors by type (bug / input / business / transient / persistent).
Aggregate at boundaries (convert in global handler).
Retry + idempotency + Circuit Breaker triple-set (required for external dependencies).
Bind errors with types (Result / Discriminated Union prevent swallowing).

What you must decide — what’s your project’s answer?

Articulate your project’s answer in 1-2 sentences for each:

Exceptions vs Result types policy
Error-type hierarchy (business / technical / bug)
Correlation ID issuance and propagation (OpenTelemetry, etc.)
Retry policy (backoff / Jitter / max attempts)
Idempotency implementation (Idempotency-Key holding)
Circuit Breaker / Timeout / Rate Limit thresholds
User-facing error-message format
Log levels (ERROR / WARN / INFO / DEBUG) usage rules

Summary

This article covered error handling — error classification, exceptions vs Result types, retry strategy, idempotency, Circuit Breaker.

Imagination for the abnormal path is human work. Bind AI with types and standard libraries — the 2026 realistic answer for error design.

This concludes the “Application Architecture” category’s 5 articles. The next category is “Frontend Architecture” — hosting, rendering, state management, SEO, and other frontend design judgments.

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.