Data Architecture

[Data Architecture] Streaming - Question Whether It's Truly Needed First

[Data Architecture] Streaming - Question Whether It's Truly Needed First

About this article

As the sixth installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains streaming.

Question the “real-time” requirement and 90% of the time it lands at “5-min-delay batch is enough.” This article covers streaming-platform selection (Kafka/Kinesis/Pub-Sub/Flink/ksqlDB), Exactly-Once, window processing, and decision criteria - presenting the practical iron rule of questioning whether real-time is truly needed first.

What streaming handles

At the core sits a message queue (Kafka, Kinesis, Pub/Sub) that retains events as a persistent log and simultaneously delivers them to multiple consumers. The feature is that data never stops, it keeps flowing - a fundamentally different worldview from batch.

Choose streaming only when truly needed. Operational cost is 10x batch.

Why it’s needed

1. Sub-second delay directly drives business value

Fraud detection, inventory linking, price linking, IoT control - “a few minutes later” is too late. From event occurrence to judgment must be sub-second.

2. Batch windows are disappearing

In 24/7 global operations, time for “overnight batch” is not available. Constantly flowing data needs to be processed as it arrives.

3. Loose coupling between microservices

In microservices architectures, connecting services via events is mainstream. The streaming platform functions as the central nervous system of inter-service communication.

Batch vs streaming

Batch processingStreaming
Processing unitBundled data1 event to a few
DelayHours to daysMilliseconds to seconds
ImplementationRelatively easyHard, heavy operation
CostCheapExpensive
RetryEasy redoHard to design
Representative techSpark, dbtKafka, Flink

Most business requirements are fine with batch, and scenes where streaming is truly necessary are limited. If “looks real-time-ish” is enough, substituting with 15-min microbatch often works.

Main components

A streaming platform splits into “the layer that carries events” and “the layer that processes events.” The former is the message queue (Kafka, etc.), the latter is the stream-processing engine (Flink, etc.) - roles differ, so select them separately.

flowchart LR
    PROD[Producer<br/>business systems/IoT]
    MQ["Message queue<br/>(carrier layer)<br/>Kafka/Kinesis/Pub-Sub"]
    PROC["Processing engine<br/>(aggregate/transform/join)<br/>Flink/ksqlDB"]
    SINK1[(DWH/<br/>BigQuery)]
    SINK2[(Search<br/>Elasticsearch)]
    SINK3[Realtime<br/>dashboard]
    PROD -->|Event| MQ
    MQ --> PROC
    PROC --> SINK1
    PROC --> SINK2
    PROC --> SINK3
    MQ -. direct sink .-> SINK1
    classDef prod fill:#fef3c7,stroke:#d97706;
    classDef mq fill:#dbeafe,stroke:#2563eb,stroke-width:2px;
    classDef proc fill:#fae8ff,stroke:#a21caf;
    classDef sink fill:#dcfce7,stroke:#16a34a;
    class PROD prod;
    class MQ mq;
    class PROC proc;
    class SINK1,SINK2,SINK3 sink;
LayerRoleRepresentatives
Message queuePersist and deliver eventsKafka, Kinesis, Pub/Sub
Stream-processing engineAggregate, transform, joinFlink, ksqlDB, Spark Streaming
Schema managementDefine message typesSchema Registry, Protobuf

Apache Kafka (industry-standard message queue)

Apache Kafka is OSS originating at LinkedIn and is the de facto standard for streaming platforms. The features are high throughput handling millions of events per second, designs that “persist events as a log,” and a mechanism where multiple consumers can independently read - adopted by mega-companies worldwide like Netflix, Uber, and LINE. Confluent Platform (commercial) and Confluent Cloud (managed) are also options.

The strength is “high performance and scalability,” but at the cost of extremely heavy operational load. Managing Zookeeper (KRaft today), broker partition design, consumer-group coordination - serious use is hard without a dedicated ops team.

ProsCons
Overwhelming performance, track recordHeavy operational load
OSS, thin vendor lock-inHigh learning cost
Rich ecosystem (Connect, Streams, etc.)Excessive at small scale
Low latency (millisecond order)High cluster-design difficulty

If you can self-operate Kafka, it’s the strongest; if not, consider managed (Kinesis/Pub-Sub/Confluent Cloud).

Managed queues (Kinesis / Pub/Sub / Event Hubs)

Cloud-vendor-provided Kafka alternatives. The cloud handles operations, eliminating worries about scaling, availability, and backup - the biggest charm is that even small teams can have a streaming platform. AWS uses Kinesis, GCP uses Pub/Sub, Azure uses Event Hubs as standard choices.

ProsCons
Near-zero operationsCloud lock-in
Easy to start smallCan be more expensive at large scale
Easy integration with other managed servicesFine-grained tuning is hard
Cloud handles failuresKafka-specific features unavailable

Representatives: Amazon Kinesis Data Streams, Google Pub/Sub, Azure Event Hubs, Confluent Cloud

The modern rule: managed first, migrate to Kafka if you hit throughput limits.

Apache Flink is OSS specialized in stateful stream processing, executing complex aggregation, join, and event-time processing at millisecond latency. Used by Uber, Alibaba, Stripe at the scale of tens of billions of events per day - the serious option that implements Exactly-Once with the highest reliability.

On the other hand, operational difficulty exceeds Kafka - checkpoint design, state-backend selection, job-restart management - learning costs are very high. Managed versions exist like AWS Kinesis Data Analytics and Aliyun Realtime Compute, and adopting via these to lower operational load is realistic.

ProsCons
Low latency, high throughputHigh learning cost
Flexible to write complex processingHigh operational difficulty
Robust Exactly-OnceExcessive at small scale
Strong event-time processingJava/Scala primary (Python also)

ksqlDB / Kafka Streams (SQL and Java library)

Lightweight processing engines specific to Kafka. ksqlDB is a product that handles Kafka via SQL, expressing aggregation and filtering in SQL without writing serious Flink-class processing. Kafka Streams is a library, with the appeal of being embeddable in applications to write stream processing.

Both presuppose Kafka and can’t be used with non-Kafka queues (Kinesis, etc.). Effective for SQL-completable use cases or wanting to embed processing in existing Java apps. Can’t do as complex processing as Flink, but the appeal is “an order of magnitude lower learning cost.”

For “scales completable with Kafka + SQL,” ksqlDB is the shortest route. Migrate to Flink when complexity grows.

Typical composition example

A typical streaming-platform composition is below. From event source to BI/DB, the decisive difference from batch is flowing in real time.

flowchart TB
    SRC([Web / mobile / IoT / business systems])
    BUS["Kafka / Kinesis / Pub/Sub"]
    RT["Real-time processing<br/>Flink, ksqlDB"]
    DWH["DWH ingestion<br/>Fivetran / Snowpipe"]
    ACT["Immediate action<br/>notify/block"]
    BI["Analysis / BI"]
    SRC -->|event publish| BUS
    BUS --> RT --> ACT
    BUS --> DWH --> BI
    classDef src fill:#fef3c7,stroke:#d97706;
    classDef bus fill:#dbeafe,stroke:#2563eb,stroke-width:2px;
    classDef rt fill:#fae8ff,stroke:#a21caf;
    classDef bi fill:#dcfce7,stroke:#16a34a;
    class SRC src;
    class BUS bus;
    class RT,ACT rt;
    class DWH,BI bi;

The general split is left-side real-time processing and right-side analytics-bound ingestion - a two-line split.

The difficulty of Exactly-Once

The most troublesome thing in streaming is realizing the “guarantee of processing a message exactly once” (Exactly-Once). Network failures, restarts, and timeouts easily cause double processing or loss. In businesses like bank transfers, payments, or inventory updates, duplication is critical.

Kafka and Flink support Exactly-Once, but “end-to-end guarantees require design,” and unless the consumer side is also designed idempotent (same input gives same result), it’s meaningless.

Guarantee levelMeaningDifficulty
At-Most-OnceGive up on failure (loss possible)Easy
At-Least-OnceReliably delivered, with possible duplicatesMid
Exactly-OnceStrictly onceHard

To avoid double processing, the royal road is to design the consumer side idempotent. Exactly-Once is the shield, idempotency is the spear.

Window processing

Streaming sees frequent time-bucketed aggregation (window processing) like “sales in the last 5 minutes” or “errors per hour.” What’s an easy aggregation in batch becomes a design issue in an unending stream of “where to cut.”

Window typeContentExample
TumblingFixed-length, no overlap0-5 min, 5-10 min
SlidingFixed-length, slid forwardLast 5 min (updated every 1 min)
SessionUntil activity breaksOne user’s visit session
GlobalAll timeCumulative count

Additionally, distinguishing event time (occurrence time) from processing time (arrival time) matters - network delays disorder things, and “how to handle late-arriving events” becomes a design point.

Decision criteria

1. Is real-time truly needed

When considering streaming, the first thing to ask is whether real-time is truly needed. Hearings often reveal “a few minutes’ delay is OK,” letting batch operate cheaply and stably.

RequirementRecommended
Days of delay OKDaily batch
Hours of delay OKHourly batch
5-15 min delay OKMicrobatch
Seconds to 1 min OKLightweight streaming
Sub-100ms requiredSerious streaming

Businesses where sub-100ms is truly needed are limited - “payments, fraud detection, ad bidding, IoT control, exchanges.”

2. Operational regime

Streaming “runs 24/7 continuously,” so monitoring and incident-response load is an order of magnitude heavier than batch. Without an ops team, an incident can lose hours of data.

Operational regimeRecommended
No dedicated SRE (Site Reliability Engineering)/data engineersDon’t choose streaming
Can use managed servicesKinesis, Pub/Sub
Can self-operate KafkaKafka + Flink
24/7 ops team availableSerious streaming platform

3. Data volume and budget

Streaming-platform pricing scales with data volume. Kafka brokers, Kinesis shards, Pub/Sub message counts - left unattended, monthly bills hit millions of yen.

Data volumeComposition imageMonthly target
~1M events/dayPub/Sub + Cloud FunctionsTens of thousands of yen
~100M events/dayKinesis + LambdaHundreds of thousands of yen
Tens of billions/dayKafka + Flink self-operatedMillions+

How to choose by case

Small / AWS / SaaS startup

Kinesis Data Streams + Lambda. Managed, near-zero operations. Runs from tens of thousands of yen monthly.

Mid-size / GCP / data-analytics-focused

Pub/Sub + Dataflow. Beam-based with low learning cost. Excellent integration with BigQuery.

Large / can self-operate

Kafka + Flink. The strongest combination, but requires an ops team. 2-3 dedicated SREs are wanted.

Want SQL only

ksqlDB + Kafka or Materialize. SQL-writing analysts can operate it.

Don’t really need immediacy

Microbatch (dbt + 15-min schedule). Avoid streaming operations while looking real-time-ish.

Author’s note - the real meaning of “we want real-time”

There’s a story often told about a project where the customer said “we want a real-time dashboard,” and the team “spent 3 months building a Kafka + Flink composition,” only to re-hear later that the actual business requirement was “30-min delay is fine.” For a project that cron with a 15-min schedule would have covered, the team then spent the next year being chased by midnight incident response - a typical case told paired with that punchline.

Another famous one is the November 2020 large-scale AWS Kinesis outage. In US-East region, Kinesis Data Streams went down for hours, dragging in even AWS’s own management console and CloudWatch - an event widely talked about as the lesson that the moment you depend on a streaming platform, its outage stops all your business. Real-time platforms can become “not just a convenient tool but a new single point of failure.”

I myself developed the habit, when a customer asks “we want to see numbers in real time,” to first ask back “is a few seconds’ delay troublesome?” or “what about 5 minutes?” - because I’ve taken bumps from this kind of project in the past. Both show that going off the basic “streaming only when truly needed” causes operational load and availability risk to rebound simultaneously. If microbatch suffices, it’s safest and cheapest - the practical conclusion.

When told “real-time,” break it down by numbers first. 5-min and 100ms delays are different worlds.

Phased decision matrix for freshness x streaming adoption

When told “real-time,” first break it down by numbers - that’s practice. The optimum differs per freshness requirement.

Freshness requirementAdopted techMonthly ops costRequired SREs
Daily (yesterday’s by morning)Daily batch (dbt)Thousands of yen0 (concurrent role)
Hours-delay OKHourly batchTens of thousands0 (concurrent role)
5-15 min delay OKMicrobatch (15-min dbt)Tens of thousands0
1 min to a few secLightweight streaming (Pub/Sub + Lambda)Hundreds of thousands1
Sub-100ms requiredSerious streaming (Kafka + Flink)Millions+2-3 dedicated

“The substantive lower bound for adopting streaming is 2+ dedicated SREs.” Adopt below that and the team melts under 24/7 incident response, window design, and Exactly-Once operations. The empirical rule: 90% of business requirements are fine with daily batch or microbatch. To “look real-time-ish,” microbatch is enough.

Real-time is 10x batch’s operational cost. Limit it to truly necessary scenes.

Streaming-operation pitfalls and forbidden moves

Here are the typical accidents in streaming. All of them lead directly to data loss, double processing, or full service stoppage.

Forbidden moveWhy it’s bad
Adopt streaming immediately on “customer wants real-time”Many actually OK with 30-min delay. Hear out requirements first
At-Least-Once without idempotencyNetwork failures cause double payments, double inventory deduction
Self-operate Kafka without dedicated SREZookeeper/KRaft management, partition design melt the team
Don’t distinguish event time and processing timeLate-arriving events break aggregation. Strict window design
Operate Kafka with JSON freedom (no schema)Consumers keep breaking. Protobuf/Avro + Schema Registry required
The moment you depend on a streaming platform, all business dependsSPOF like the November 2020 AWS Kinesis outage
Trust Exactly-Once entirelyMeaningless without idempotency on the consumer side. Both shield and spear needed
Operate without a DLQ (Dead Letter Queue)Failed messages retry forever, clog up
Set monitoring at batch granularityStreaming needs second-level monitoring. Constantly watch message lag and consumer lag
No prep for traffic surgesKafka partition shortage causes processing delay to snowball
Schema changes without forward compatibilityAll old-version consumers die. Follow Avro/Protobuf compatibility rules

The November 25, 2020 large-scale AWS Kinesis outage was an event where us-east-1’s Kinesis Data Streams stopped for hours, with cascading impact to many AWS services like CloudWatch, Cognito, and SQS. A lesson showing that the moment you depend on a streaming platform, its outage stops all your business.

For serious streaming, 2-3 dedicated SREs and Schema Registry are the minimum requirements.

AI-era perspective

When AI-driven development (vibe coding) and AI usage are the premise, streaming gains importance as the path that immediately reflects AI-agent decisions. Fraud/anomaly detection and dynamic recommendation premise platforms where AI judges in real time.

Favored in the AI eraDisfavored in the AI era
Managed (Kinesis, Pub/Sub)Self-built Kafka (operations hard for AI to learn)
Schema-driven (Avro, Protobuf)JSON freedom
Event-driven (documented)Implicit timing dependencies
SQL-centric (ksqlDB, Flink SQL)Custom DSL (Domain-Specific Language)

In the era where AI writes processing, streams with explicit schema and contract are favored. Manage types with Protobuf + Schema Registry, building a state where AI can safely generate processing - the new standard.

For real-time platforms, make schemas explicit. AI-era streaming lives or dies by “types.”

Common misconceptions

  • Adopt because real-time is cooler - operational cost is an order of magnitude higher. 90% of business requirements are fine with batch. Tech selection is decided by necessity
  • Putting in Kafka solves problems - Kafka is just the “foundation.” Processing engines (Flink etc.), schema management, monitoring, and ops regime are also needed - heavy overall
  • Exactly-Once means safe - guarantees aren’t omnipotent. Without consumer-side idempotency, end-to-end isn’t protected
  • Streaming so it’s fast - depending on design it can be slower than batch. Wrong window/state/retry choices yield no performance

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Is real-time truly needed (re-confirm requirements)
  • Message queue (Kafka / Kinesis / Pub-Sub)
  • Processing engine (Flink / ksqlDB / Spark Streaming / not needed)
  • Guarantee level (At-Least-Once / Exactly-Once)
  • Schema management (Avro/Protobuf + Schema Registry)
  • Window design (time types, delay tolerance)
  • Monitoring/alerting (SLO (Service Level Objective), metrics, failure notifications)

How to make the final call

The core of streaming is starting from awareness that operational cost is 10x batch. Real-time isn’t chosen because it’s cool - 90% of business requirements are fine with batch or microbatch. Sub-100ms latency truly creates value only in limited cases - payments, fraud detection, ad bidding, IoT control - and choosing streaming elsewhere exhausts the team in 24/7 ops, Exactly-Once design, and window-management difficulty. The correct judgment order is “first ask if truly needed,” then “ask if there’s an ops regime.”

The decisive axis is whether schemas and types are explicit. As streaming’s importance grows as a path for AI agents to make real-time decisions, with JSON-freedom design, AI can’t generate processing. Make types explicit with Protobuf/Avro + Schema Registry, write SQL-centric (ksqlDB/Flink SQL) - the form of a streaming platform that doesn’t become outdated in the AI era.

Selection priorities

  1. Question whether real-time is truly needed - if a few minutes’ delay is OK, substitute with microbatch
  2. Managed first - lower ops load with Kinesis/Pub-Sub; Kafka self-operation requires dedicated SREs
  3. Make schemas explicit - Protobuf/Avro + Schema Registry; no JSON freedom
  4. Design consumers idempotent - Exactly-Once is the shield, idempotency is the spear; both together

“Streaming only when truly needed.” Settle ops regime and schema design before choosing.

Summary

This article covered streaming, including selection of Kafka/Kinesis/Pub-Sub/Flink/ksqlDB, Exactly-Once and window processing, the freshness x operational-cost matrix, and judgment axes for avoiding over-investment in real-time.

Question whether real-time is truly needed, prioritize managed services, make schemas explicit, and design consumers idempotent. That is the practical answer for streaming in 2026.

Next time we’ll cover data governance (master management, catalog, regulatory compliance).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.