About this article
This article is the first article in the “Data Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of data architecture.
In 2026, “we want our data trained into AI” has become a default executive request, and whether your data is set up to be handled across systems is being re-evaluated as the precondition for AI utilization. This article surveys OLTP/OLAP separation, the data flow, datastore choices, and the structural fact that the quality of your data architecture sets the upper limit of AI utilization.
More articles in this category
What data architecture covers
There was a time when “operational DB = data architecture” was synonymous, but as analytics, ML, and AI utilization have become routine, OLTP/OLAP separation, streaming processing, and data lakes have entered the picture, and data architecture has solidified as its own design domain.
Data is the gasoline of the digital economy. The quality of your data strategy maps directly to business competitiveness.
Why treat it as a separate architecture
1. Operational and analytical demands are fundamentally different
Operational: one transaction, fast and safe. Analytical: aggregate and analyze large volumes. Trying to do both on one DB drops both kinds of performance, so splitting the structure is mainstream.
2. Data spans organizations
Customer and product data are referenced by multiple systems. Holding them in different shapes per system fragments the data and makes enterprise-wide analytics impossible.
3. Data value is determined by accumulation
Applications can be rebuilt; past data is gone forever. A sloppy initial design means no usable data when you finally want to use it five years later.
Major data classifications
| Class | Trait | Typical storage |
|---|---|---|
| Operational (OLTP) | Transactions, orders, customers — daily business | RDB (PostgreSQL, MySQL) |
| Analytical (OLAP) | Operational data shaped for analytics | DWH (Data Warehouse — Snowflake, BigQuery) |
| Unstructured | Images, video, logs, documents | Object storage (S3, etc.) |
| Event | Clicks, page views, sensor values — time-series | Streaming (Kafka, etc.) |
Each data type has different optimal storage and processing, so trying to handle all of it in one DB breaks down.
Typical data flow
The flow from operational systems through an analytics platform to BI (Business Intelligence) tools and ML has converged on roughly this pattern:
flowchart LR
OP["Operational systems<br/>(OLTP)"] -->|raw data| DL["Data lake<br/>(S3, etc.)"]
DL -->|ETL/ELT| DWH["Data warehouse<br/>(BigQuery / Snowflake)"]
DWH --> BI["BI tools<br/>(Looker / Tableau)"]
DWH --> ML["ML platform"]
DWH --> AI["AI agents"]
classDef op fill:#dbeafe,stroke:#2563eb;
classDef store fill:#fef3c7,stroke:#d97706;
classDef use fill:#fae8ff,stroke:#a21caf;
class OP op;
class DL,DWH store;
class BI,ML,AI use;
Data lake = “a place to dump raw data.” DWH = “a DB shaped for analysis.” ETL/ELT (Extract / Transform / Load) = “the moving and reshaping machinery.” Memorizing those framings is step one.
Datastore choices
| Type | Strong at | Examples |
|---|---|---|
| RDB (relational) | ACID, complex JOINs | PostgreSQL, MySQL |
| KVS (Key-Value) | Ultra-fast single-key reads | Redis, DynamoDB |
| Document DB | JSON-flavored flexible schema | MongoDB, Firestore |
| Columnar DB (OLAP) | Aggregating large volumes | BigQuery, Snowflake |
| Time-series DB | Sensor / metric retention | InfluxDB, TimescaleDB |
| Graph DB | Exploring relationships | Neo4j |
| Vector DB | Similarity search, AI embeddings | Pinecone, pgvector |
What you must decide — what’s your project’s answer?
For each item, articulate your project’s answer in 1-2 sentences.
Datastore / search / cache
| Item | Examples |
|---|---|
| Operational DB | PostgreSQL / MySQL / SQL Server |
| Analytical DWH | Snowflake / BigQuery / Redshift |
| Data lake | S3 / ADLS / GCS |
| Cache | Redis / Memcached |
| Search engine | Elasticsearch / OpenSearch |
| Vector DB | pgvector / Pinecone / Weaviate |
Data integration / governance
| Item | Examples |
|---|---|
| ETL/ELT | dbt / Airflow / Dataflow |
| Streaming | Kafka / Kinesis / Pub/Sub |
| Data modeling | Star / 3NF / Data Vault |
| Master management | Centralized MDM (Master Data Management) / federated SoR (System of Record) |
| Data catalog | DataHub / Glue / Collibra |
| Access control | IAM / Row-Level Security |
Author’s note — “stored, but not usable” cases
There’s a recurring story: a site decided to “just dump JSON event logs and analyze later” for years, only to find at use time that “field names differ between versions, timezones depend on the device and are inconsistent, and rows with missing keys flood the dataset” — almost no records were re-processable. The accumulation succeeded, but they hadn’t accumulated it in a usable form.
Another case: an operational DB and analytics ran on the same PostgreSQL, and every nightly batch dragged daytime order processing. The fix was bolting on a DWH after the fact, then spending six months in dual-maintenance mode. Effort that wouldn’t have been spent if OLTP and OLAP had been separate from day one.
I personally watched a single-PostgreSQL setup running both BI dashboards and daily aggregation slow the entire service down on monthly-batch days. The root cause is “accumulating with the wrong tool/structure for the data’s character.” Design-time decisions determine your utilization runway five years later — these are visible-from-outside cases of that fact.
Data, unlike applications, cannot be rebuilt. Design as a 5-year asset.
Volume × use ladder
Note: industry rates as of April 2026. Periodic refresh required.
In data architecture the optimal answer is determined by volume and freshness requirements. Up front: PostgreSQL alone is enough up to ~1 TB. At 10 TB, separate the DWH. Lakehouse and streaming become realistic only past 100 TB.
| Volume | Use case | Recommended DB | Freshness | Monthly cost (est) |
|---|---|---|---|---|
| Up to 100 GB | Operational CRUD | PostgreSQL alone | Real-time | ~$30 |
| Up to 1 TB | Operational + analytics | PostgreSQL + read replica | Daily OK | ~$300 |
| Up to 10 TB | Operational + BI | Aurora + BigQuery / Snowflake | Hourly | ~$3k |
| Up to 100 TB | Analytics-centric | DWH (Snowflake) + S3 | Minutes-to-seconds | ~$3-30k |
| 100 TB+ | ML + streaming | Lakehouse + Kafka + Flink | Real-time | $30k+ |
“Real-time streaming costs ~10x to operate” is the rule of thumb. 90% of business needs are met by daily batch. When you hear “real-time”, question it first — micro-batching (15-min cycles) is enough for most cases.
Data architecture evolves in stages with scale and freshness. Aiming at real-time from day one is over-investment.
Architecture-level traps
| Forbidden move | Why |
|---|---|
| Run analytical queries on the operational DB | Operational performance drops, customer perception suffers — OLTP/OLAP separation required |
| Sloppy schemaless JSON storage | Becomes unusable for AI 5 years later — make types explicit from day one |
| UUID v4 as primary key for high-volume inserts | Index locality breaks; switch to UUID v7 (RFC standardized 2024) |
| Drop foreign-key constraints for performance | Integrity destroyed, orphan data accumulates |
| Store binary (images, video) in the DB itself | S3 + URL reference is the rule; classic DB bloat |
| ”Maybe we’ll use it” dump-everything to S3 | Becomes a swamp without a catalog; reprocessing costs more than designing fresh |
| Aim for fully Centralized MDM in one go | Major outage to existing core systems; phase via Coexistence |
| Move PII to analytics DB without masking | The Meta 2023 EUR 1.2B fine pattern |
| Petabyte-scale streaming without a dedicated SRE | Becomes a SPOF, like the 2020 AWS Kinesis outage |
| Operate without a data catalog | Analytics speed drops 10x trying to answer “where is this data?” |
Data, unlike apps, cannot be rebuilt. Design as a 5-year asset.
The AI-era lens
With AI-driven development (vibe coding) and AI utilization (RAG, agents) as the assumption, data architecture’s importance for “a shape AI can consume” jumps. AI accuracy is a function of clean structured data and meaningful metadata, so data tidiness = ceiling on AI utilization.
| AI-era favorable | AI-era unfavorable |
|---|---|
| Schema definitions, explicit types | Sloppy schemaless JSON |
| Data catalog and metadata maintained | No documentation, tribal knowledge |
| Vector DB and RAG (Retrieval-Augmented Generation) ready | Classical RDB only |
| History retained, time-series tracked | Overwrite-only |
When AI agents start manipulating data, structured data that SQL can hit, names readable in natural language, and clear permissions matter more than they did for human designers.
The quality of your data architecture becomes the AI utilization ceiling.
Common misreadings
- One DB for everything is easier → Running operational (OLTP) and analytical (OLAP) on one DB drops both performances, and by the time you notice, both need a redo. Separate from day one.
- Schemaless = future-friendly → Schemaless looks flexible but data quality goes uncontrolled, and in the AI era it becomes “an asset AI cannot read.” Make types explicit from day one.
- Just keep the data, you can use it later → Sloppy accumulation produces a future state of “no usable data.” Without deciding “what / what type / what granularity” to capture, volume produces zero value.
- We’ll figure out analytics later → Bolting on operational-to-analytical (ETL/ELT) flows after the fact creates load on the existing DB and rework. Decide on the data lake + DWH structure up front.
How to make the final call
The core of data architecture is the asymmetry: applications can be rebuilt, but past data is gone forever. To avoid the “no usable data” state when you start to use it five years later, build schema definitions, history retention, and metadata maintenance in from the start. Separate operational (OLTP) from analytical (OLAP) early and pick the right tool for each.
The decisive axis is “the AI utilization ceiling is set by data tidiness.” When AI agents are hitting SQL and RAG is doing vector search, schemaless sloppy JSON is effectively unusable. Type definitions, data catalogs, and vector DB readiness aren’t luxuries — they’re standard equipment in the AI era.
Selection priority
- Separate operational and analytical — different tools for OLTP vs OLAP.
- Make schemas and types explicit from day one — don’t escape into schemaless; lay the foundation for future AI/analytics use.
- Pick the tool to fit the data type — RDB / Vector DB / object storage, mixed appropriately.
- Build metadata and catalogs — manage in a form readable by humans and AI alike.
“Data architecture is a 5-year asset.” Today’s tidiness sets tomorrow’s AI runway.
Summary
This article covered the big picture of data architecture — OLTP/OLAP separation, the per-type storage map, the volume × freshness ladder, and AI-era standard equipment.
Separate operational from analytical, make schemas and types explicit, match tools to data types, and design as a 5-year asset. The realistic answer for 2026.
The next article covers datastore selection (when to use RDB / KVS / columnar / vector DB / etc.).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (39/89)