Software Architecture

Transaction Design — ACID / Eventual Consistency / Saga / Outbox

Transaction Design — ACID / Eventual Consistency / Saga / Outbox

About this article

This article is the sixth deep dive in the “Software Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering transaction design.

The mechanism that guarantees “all-succeed-or-all-undo” — the design core that prevents money-disappearing or money-appearing accidents like bank transfers. The article covers ACID properties, isolation levels, distributed transactions, eventual consistency, Saga / Outbox patterns, the CAP theorem, and idempotency, with axes for sorting consistency level per business need.

What keeps “money from disappearing”

Transaction design decides “which data needs what granularity of consistency” in the system. Some data — bank-account balances — needs strict consistency; some — SNS “like counts” — can drift slightly without harm. Applying maximum consistency to everything makes the system too heavy; sorting precision per business need is the design core.

It must be considered together with overall structure (especially microservices). If 1 DB is enough, ACID alone suffices; whether you can lean on a configuration that avoids distributed transactions ties directly to operational cost.

Consider it together with overall structure. If 1 DB is enough, ACID alone is enough.

ACID properties

The classic transaction guarantee RDBMSes have offered for decades is ACID — four properties whose initials guarantee data integrity strongly.

ACID matters because applications’ biggest losses come from “money / inventory / order data inconsistency.” Once data drifts, root-cause investigation, correction, and customer-facing costs balloon, shaking trust in the entire business.

ACID established with RDB development from the 1970s, designed assuming “strong integrity guaranteed within one DB.” It worked perfectly when servers and DBs were one machine each. In the cloud-era distributed environment, maintaining ACID across multiple DBs and services requires locking all nodes; one network outage stops everything, making it operationally impractical. That’s why eventual consistency and Saga (covered later) emerged.

PropertySubstance
AtomicityAll-or-nothing; no partial state
ConsistencyAlways fail on constraint violations
IsolationConcurrent execution doesn’t interfere
DurabilityCommitted data survives failures

Within a single DB, ACID is solidly sufficient — the default for finance, accounting, inventory, and other strong-consistency-required businesses.

Isolation levels

In ACID, the realistic trade-off with performance is “I” (Isolation). The level you pick determines how strictly other transactions are kept apart. Stricter raises consistency but lowers concurrent performance.

LevelAllowed phenomenaPerformance
READ UNCOMMITTEDSee others’ uncommitted valuesFastest
READ COMMITTEDSame query may produce different results (non-repeatable read)Fast
REPEATABLE READNew rows may appear mid-transaction (phantom read)Mid
SERIALIZABLEAll prevented (strictest)Slow

PostgreSQL defaults to READ COMMITTED, MySQL (InnoDB) to REPEATABLE READ. Higher isolation costs performance, so strengthen only the parts that need it.

The difficulty of distributed transactions

In microservices and multi-DB configurations, transactions across multiple DBs sometimes become necessary. Traditionally 2PC (Two-Phase Commit — ask all DBs “prepared?” then commit together) solved this, but it has fatal flaws in the cloud era.

2PC forces a two-phase exchange of “all nodes prepared -> all nodes commit”, so any one node going down blocks everything. In cloud environments where network failures are routine, running 2PC in production frequently stops the whole service — “effectively non-functional” in practice.

The cloud-native principle is to avoid 2PC. Use eventual consistency instead.

Eventual Consistency

Eventual consistency allows a relaxed guarantee: “not consistent immediately, but consistent eventually.” Writes succeed instantly; propagation to other replicas and services happens asynchronously. So users may temporarily see stale data on screen.

Bank accounts can’t drift even momentarily, but “like counts,” “follower counts,” “review counts” on e-commerce sites work fine even if 1 second stale. Treating “data where slight drift is fine” as eventually consistent secures the system’s overall availability and scalability.

Examples: Amazon inventory counts, Twitter follower counts, Instagram likes, YouTube view counts.

Per data type, judge from business requirements whether strong consistency is required or eventual is sufficient.

Saga pattern

Saga is a pattern that implements distributed transactions as “a chain of small local transactions + compensation.” In a hotel-booking flow, “reservation creation,” “payment,” “inventory reservation,” “notification” run on different services; failure mid-flow triggers compensation to undo earlier steps.

sequenceDiagram
    participant U as User
    participant R as Reservation svc
    participant P as Payment svc
    participant I as Inventory svc
    participant N as Notification svc
    U->>R: Hotel booking
    R->>P: Charge
    P-->>R: Charge success
    R->>I: Reserve inventory
    I-->>R: Inventory NG
    Note over R,I: Failure -> compensate
    R->>P: Compensate: refund
    P-->>R: Refund done
    R-->>U: Booking failed (out of stock)

Unlike 2PC, this doesn’t lock all services; each step completes locally, so it works in cloud environments. But incomplete compensation design risks leaving half-done state.

Pseudocode for orchestration-style Saga: each step completes in its own local TX; on failure, compensation runs in reverse order.

async function bookHotel(input: BookingInput) {
  const completed: Array<() => Promise<void>> = [];
  try {
    const reservation = await reservationSvc.create(input);
    completed.push(() => reservationSvc.cancel(reservation.id));

    const payment = await paymentSvc.charge(input.amount);
    completed.push(() => paymentSvc.refund(payment.id));

    await inventorySvc.reserve(input.roomId);
    completed.push(() => inventorySvc.release(input.roomId));

    await notificationSvc.notify(input.userId);
    return { ok: true, reservationId: reservation.id };
  } catch (err) {
    // Compensate in reverse order (idempotent assumed)
    for (const compensate of completed.reverse()) {
      await compensate().catch(logCompensationFailure);
    }
    throw err;
  }
}

Compensation must be idempotent. If retries running double don’t produce stable results, the saga itself becomes a new source of consistency incidents. In production, combine with the Outbox pattern to reliably synchronize message sending and DB updates — the standard approach.

Two Saga forms

Saga has two implementation styles, differing in where flow control sits.

StyleTraitProsCons
OrchestrationCentral orchestrator controls orderVisible flow, easy debuggingCentral-aggregation risk
ChoreographyEvent-driven, services act autonomouslyLoose coupling, scalesHard to grasp the whole

When business is complex and order matters, Orchestration. When simple and services are independent, Choreography.

Default is starting from Orchestration. Systems where the whole isn’t visible become operational hell.

Outbox pattern

Outbox harmonizes “DB write” and “message-queue send” in a single DB transaction. It prevents the recurring microservices trouble “DB succeeded but Kafka send failed” (Kafka: distributed message-queue infrastructure) — the standard pattern.

  1. Record the message-to-send into the outbox table together with business data.
  2. A separate process (Relay) reads outbox and sends to the message queue.
  3. Update the sent flag.

This prevents the accident of only one of “DB commit” or “send” succeeding. For event-driven microservice architecture, near-mandatory.

Saga + Outbox is the default; near-mandatory for microservices.

CAP theorem

The foundation of distributed-system architecture is the CAP theorem: Consistency, Availability, and Partition tolerance can’t be simultaneously satisfied — that’s the constraint.

Network partitions happen in reality, so “P” is essentially required. Practical choice becomes “CP (consistency-priority)” vs “AP (availability-priority).” Business requirements determine whether “err out but be accurate” or “keep running with stale data.”

ChoiceRepresentative systemsBusiness
CP (consistency)Traditional RDB, MongoDB (configurable)Banking, payments, inventory
AP (availability)DynamoDB, Cassandra, RedisSNS, e-commerce browsing, analytics

Reverse from business requirements. Picking distributed DBs without CAP awareness produces unexpected behavior.

Idempotency

In distributed systems, “the same request arriving multiple times” due to network failures and timeouts is routine. The property guaranteeing the result is the same regardless of repeat execution is idempotency.

Without idempotent “payment APIs,” retries cause double charges. Idempotency is achieved by clients attaching a request ID and servers deduplicating by that ID — the default. By HTTP method, GET / PUT / DELETE are spec-defined idempotent; POST is not, requiring care.

In microservices, event-driven, and retry processing, idempotency is required; bolting it on later is hard, so design from the start.

Decision criteria: choosing consistency level

The core of transaction design is identifying the consistency level needed per business. Demanding strong consistency for all data crashes performance; making everything eventual breaks the business.

The decision axis: “if this data is even briefly inconsistent, will there be financial, legal, or reputational damage?” Money-moving processing, inventory allocation, records subject to taxation or audit can’t tolerate even one cent / one unit drift, so strong consistency is mandatory; compromising here shakes the business itself.

Conversely, “data that works business-wise even with slight drift” like like counts, view counts, recommendations, log aggregations is fine with eventual consistency; applying strong consistency would sacrifice performance and scalability. Asking the business “how would it hurt if this data were a few seconds stale?” and reasoning back is most reliable.

BusinessRequired consistencyReason
Bank transfers / paymentsStrong (ACID)Drift = financial damage
InventoryStrong or SagaOverselling = refunds and trust loss
Order historyStrongTax / audit requirements
Likes, view countsEventualSlight drift is harmless
RecommendationsEventualStale is OK if working
Logs / analyticsEventualScale over real-time

Strong consistency is a high-cost requirement; limit it to where it’s truly needed.

Data-type × consistency-level ladder

Note: industry rates as of April 2026. Periodic refresh required.

“All strong consistency” is excess and “all eventual” is dangerous; sorting consistency level per data type is the operational core. Typical industry split:

Data typeConsistency levelReasonImplementation
Bank balance / paymentsStrong (ACID, SERIALIZABLE)1-cent drift = damage1-DB ACID
Orders / inventory allocationStrongOverselling = refunds and trust loss1-DB ACID or Saga
Member registration / authStrongDirect security risk1-DB ACID
Tax / audit recordsStrong + tamper protectionLegal requirementACID + WORM
Cart infoMid (READ COMMITTED)Temporary drift OK1-DB TX
Likes / view countsEventualSeconds-stale OKAsync aggregation
RecommendationsEventualStale is OK if workingBatch recompute
Logs / analyticsEventualNo real-time needKafka + DWH

The isolation-level numeric gate: PostgreSQL’s default READ COMMITTED as the basis, with “only inventory allocation and financial transactions individually elevated to SERIALIZABLE as the rule. Applying SERIALIZABLE to all tables drops concurrent performance several-fold; limit it to the necessary parts.

Strong consistency is high-cost; limit it to genuinely necessary business.

Distributed-transaction traps

Common ways microservice-crossing / multi-DB-crossing transactions fail. All produce production data inconsistency.

Forbidden moveWhy
2PC (Two-Phase Commit) in cloud productionNetwork failure blocks all services; effectively nonfunctional
Retries without idempotency keysNetwork failures cause double payments, double inventory decrements, multiple notifications
Saga without compensation designHalf-done state (charged but no inventory) remains; manual repair required
DB write and message-queue send in separate TXWithout Outbox, “DB success / Kafka send fail” breaks consistency
SERIALIZABLE on all dataConcurrent performance drops several-fold; limit to necessary places
Concurrent updates without optimistic lockLost-update accident; one update vanishes
Treating timeout = failure and retryingServer might have succeeded; idempotency key required
Reading from DB replica then immediately writingReplication lag (ms-seconds) overwrites with stale value
DIY distributed transactionsBuilding without knowing Saga / Outbox always breaks; lean on libraries (Temporal, etc.)

Strictly speaking, the Knight Capital 2012 incident wasn’t a distributed-TX accident, but the scenario of “one machine left with old code + retry runaway” producing 45 minutes / $440M loss / company dissolution shows the horror of neglecting idempotency and consistency design (full details in Appendix: Major Incident Catalog).

“Distributed + retry + no idempotency” is the triple-landmine set for double processing.

The AI-era lens

Even with AI-driven development as the assumption, the fundamental difficulty of distributed transactions isn’t solved by AI. If anything, as AI generates massive code in short time, the need for humans to carefully verify “are transactions correctly bracketing the work?” rises.

AI tends to generate “surface-correct-looking code” while missing consistency errors crossing DB boundaries, only surfacing as data inconsistency in production.

AI-era favorableAI-era unfavorable
Complete in 1-DB ACIDHand-written Saga across services
Standard patterns like OutboxCustom eventual-consistency implementations
Trust DB-vendor features (Aurora Global, etc.)Distributed TX written at app layer
Schema and constraints expressed in codeDB triggers and hidden consistency rules

The AI-era iron rule: “lean on simple design; don’t exceed what AI can write or read.” Sticking with modular monolith + 1 DB ACID is overwhelmingly easier to operate in the AI era than building distributed TX across microservices.

In the AI era, “design that avoids distributed TX” is the smart choice; 1-DB-complete simplicity has value.

Common misreadings

  • ACID means it’s safe” -> ACID applies only within 1 DB. Across multiple services, different design (Saga + Outbox) is required.
  • “SERIALIZABLE is safest” -> Too-strict isolation crushes concurrent performance. Strengthen only where needed.
  • “Eventual consistency is engineer laziness” -> It’s a choice matched to business characteristics, not laziness. Applying strong consistency to like counts is the sign of immature design.
  • “Retries fix everything” -> Retries without idempotency are the direct cause of double processing. Design idempotency keys before adding retries.

”The day we treated timeout = failure” (industry case)

A project added 3-retry to the client side as timeout protection for the payment API; the result was duplicate charges to a small number of users. Server-side processing had succeeded; only the response was delayed, and all retries went through.

Few developers haven’t watched a similar failure happen nearby. Retry implementation tends to come in optimistically, and the “probably failed, let’s resend” mindset is the textbook path to production double-charges.

Without an idempotency key, “timeout = failure” is almost never a safe assumption. The “don’t know if it succeeded” state across the network is routine in distributed systems. Design idempotency keys before adding retries — that’s the rule.

Distributed + retry + no idempotency = double-processing landmine. Triple-set is the safety condition.

What you must decide — what’s your project’s answer?

Articulate your project’s answer in 1-2 sentences for each:

  • Per-data consistency requirement (strong / eventual)
  • Isolation level (the lowest line business tolerates)
  • Distributed-transaction handling (Saga / Outbox / avoid)
  • Retry policy and idempotency
  • Locking strategy (optimistic / pessimistic)
  • CAP choice (CP / AP)

Common failure patterns

  • Demanding strong consistency for all data, performance crashes -> Setting SERIALIZABLE “to be safe” crushes concurrent performance.
  • Solving distributed TX with 2PC -> Effectively nonfunctional in cloud. Use Saga + Outbox.
  • Incomplete compensation, half-done state remains -> In Saga, compensation design is critical. Treat it as seriously as the success path.
  • Multi-execution from retry without idempotency -> Network failure causes double payments, inventory double-decrements, multiple notifications.

Recognize “consistency is a high-cost requirement.” Apply strong consistency only where genuinely needed.

How to make the final call

Transaction design’s core is reasoning back from “how much drift does the business tolerate”; never decide by technical preference. Treating bank-transfer-strict business (no 1-cent drift) and SNS like-count business (1-second stale fine) at the same intensity makes the system too heavy and breaks down.

The selection core is “sort consistency levels per data.” Strong consistency is high-cost; limit it to genuinely necessary business. Misjudging here produces either performance crash or business breakdown.

The modern default is “design avoiding distributed transactions.” 2PC is essentially nonfunctional in cloud, and microservice-crossing transactions can only be built with Saga + Outbox — at very high design cost.

In AI-driven development, “surface-correct but consistency-broken code” is easy to generate; the more distributed the TX, the higher the incident risk. If 1-DB ACID can complete things, that’s strongest; modular monolith + 1 DB is overwhelmingly easier to operate in the AI era. When distributed is unavoidable, build the Saga + Outbox + idempotency triple-set correctly.

Selection priority:

  1. Reverse from business requirements (limit strong consistency to genuinely needed parts).
  2. Prioritize 1-DB completion (distributed TX is high-cost).
  3. If distributed is needed, Saga + Outbox (avoid 2PC).
  4. Idempotency as a hard requirement (bolting on later is hard).

“Consistency is a high-cost requirement.” Concentrate it where needed; keep the rest light with eventual.

Summary

This article covered transaction designACID, isolation levels, Saga, Outbox, the CAP theorem, idempotency.

Sort consistency level per data, prioritize 1-DB completion, and use the Saga + Outbox + idempotency triple-set when distributed is needed. The 2026 realistic answer including AI era.

The next article is the Software Architecture category’s final installment: authentication and sessions (server session / JWT / OAuth).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.