About this article
This article is the first article in the “Data Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of data architecture.
In 2026, “we want our data trained into AI” has become a default executive request, and whether your data is set up to be handled across systems is being re-evaluated as the precondition for AI utilization. This article surveys OLTP/OLAP separation, the data flow, datastore choices, and the structural fact that the quality of your data architecture sets the upper limit of AI utilization.
A full list of all articles in this category, with summaries and learning points, is available at the following page.
What is data architecture
Think of a company’s vault and ledger. Cash (operational data) is stored securely, while the ledger (analytical data) reveals the state of the business. Both deal with the same “money,” yet how they’re stored and how they’re used are completely different. Mix the two and both daily transactions and financial reporting slow down.
Data architecture is the discipline of designing where data lives, in what format, and how it flows between stores. It covers the separation of operational (OLTP) and analytical (OLAP) systems, datastore selection, and data-flow design.
Without data architecture, data gets siloed per system, making cross-organizational analytics and AI utilization impossible.
What data architecture covers
There was a time when “operational DB = data architecture” was synonymous, but as analytics, ML, and AI utilization have become routine, OLTP/OLAP separation, streaming processing, and data lakes have entered the picture, and data architecture has solidified as its own design domain.
Data is the gasoline of the digital economy. The quality of your data strategy maps directly to business competitiveness.
Why treat it as a separate architecture
1. Operational and analytical demands are fundamentally different
Operational: one transaction, fast and safe. Analytical: aggregate and analyze large volumes. Trying to do both on one DB drops both kinds of performance, so splitting the structure is mainstream.
2. Data spans organizations
Customer and product data are referenced by multiple systems. Holding them in different shapes per system fragments the data and makes enterprise-wide analytics impossible.
3. Data value is determined by accumulation
Applications can be rebuilt; past data is gone forever. A sloppy initial design means no usable data when you finally want to use it five years later.
Major data classifications
| Class | Trait | Typical storage |
|---|---|---|
| Operational (OLTP) | Transactions, orders, customers — daily business | RDB (PostgreSQL, MySQL) |
| Analytical (OLAP) | Operational data shaped for analytics | DWH |
| Unstructured | Images, video, logs, documents | Object storage (S3, etc.) |
| Event | Clicks, page views, sensor values — time-series | Streaming (Kafka, etc.) |
Each data type has different optimal storage and processing, so trying to handle all of it in one DB breaks down.
Typical data flow
The flow from operational systems through an analytics platform to BI tools and ML has converged on roughly this pattern:
Data lake = “a place to dump raw data.” DWH = “a DB shaped for analysis.” ETL/ELT (Extract / Transform / Load) = “the moving and reshaping machinery.” Memorizing those framings is step one.
Datastore choices
| Type | Strong at | Examples |
|---|---|---|
| RDB (relational) | ACID, complex JOINs | PostgreSQL, MySQL |
| KVS | Ultra-fast single-key reads | Redis, DynamoDB |
| Document DB | JSON-flavored flexible schema | MongoDB, Firestore |
| Columnar DB (OLAP) | Aggregating large volumes | BigQuery, Snowflake |
| Time-series DB | Sensor / metric retention | InfluxDB, TimescaleDB |
| Graph DB | Exploring relationships | Neo4j |
| Vector DB | Similarity search, AI embeddings | Pinecone, pgvector |
What you must decide — what’s your project’s answer?
For each item, articulate your project’s answer in 1-2 sentences.
Datastore / search / cache
| Item | Examples |
|---|---|
| Operational DB | PostgreSQL / MySQL / SQL Server |
| Analytical DWH | Snowflake / BigQuery / Redshift |
| Data lake | S3 / ADLS / GCS |
| Cache | Redis / Memcached |
| Search engine | Elasticsearch / OpenSearch |
| Vector DB | pgvector / Pinecone / Weaviate |
Data integration / governance
| Item | Examples |
|---|---|
| ETL/ELT | dbt / Airflow / Dataflow |
| Streaming | Kafka / Kinesis / Pub/Sub |
| Data modeling | Star / 3NF / Data Vault |
| Master management | Centralized MDM / federated SoR (System of Record) |
| Data catalog | DataHub / Glue / Collibra |
| Access control | IAM / Row-Level Security |
Author’s note — “stored, but not usable” cases
There’s a recurring story: a site decided to “just dump JSON event logs and analyze later” for years, only to find at use time that “field names differ between versions, timezones depend on the device and are inconsistent, and rows with missing keys flood the dataset” — almost no records were re-processable. The accumulation succeeded, but they hadn’t accumulated it in a usable form.
Another case: an operational DB and analytics ran on the same PostgreSQL, and every nightly batch dragged daytime order processing. The fix was bolting on a DWH after the fact, then spending six months in dual-maintenance mode. Effort that wouldn’t have been spent if OLTP and OLAP had been separate from day one.
I personally watched a single-PostgreSQL setup running both BI dashboards and daily aggregation slow the entire service down on monthly-batch days. The root cause is “accumulating with the wrong tool/structure for the data’s character.” Design-time decisions determine your utilization runway five years later — these are visible-from-outside cases of that fact.
Data, unlike applications, cannot be rebuilt. Design as a 5-year asset.
Volume × use ladder
Note: industry rates as of April 2026. Periodic refresh required.
In data architecture the optimal answer is determined by volume and freshness requirements. Up front: PostgreSQL alone is enough up to ~1 TB. At 10 TB, separate the DWH. Lakehouse and streaming become realistic only past 100 TB.
| Volume | Use case | Recommended DB | Freshness | Monthly cost (est) |
|---|---|---|---|---|
| Up to 100 GB | Operational CRUD | PostgreSQL alone | Real-time | ~$30 |
| Up to 1 TB | Operational + analytics | PostgreSQL + read replica | Daily OK | ~$300 |
| Up to 10 TB | Operational + BI | Aurora + BigQuery / Snowflake | Hourly | ~$3k |
| Up to 100 TB | Analytics-centric | DWH (Snowflake) + S3 | Minutes-to-seconds | ~$3-30k |
| 100 TB+ | ML + streaming | Lakehouse + Kafka + Flink | Real-time | $30k+ |
“Real-time streaming costs ~10x to operate” is the rule of thumb. 90% of business needs are met by daily batch. When you hear “real-time”, question it first — micro-batching (15-min cycles) is enough for most cases.
Data architecture evolves in stages with scale and freshness. Aiming at real-time from day one is over-investment.
Knowledge structure of this category
This category is composed of 7 articles in total. The structure follows store -> structure -> move -> govern, learning how to handle data step by step.
Store first selects where data lives (RDB, KVS, columnar, vector DB, etc.). Getting this wrong ripples through everything downstream.
Structure designs table schemas (normalization, star, etc.) and separates operational vs analytical platforms (DWH, data lake) on top of the chosen store.
Move designs how to transport structured data from operational DBs to the analytics platform. For most projects batch (ETL/ELT) is enough; streaming enters the picture only when real-time freshness is required.
Govern — data governance — is a cross-cutting theme covering catalog maintenance, quality management, and access-control design. In the AI era, metadata maintenance is a precondition for AI utilization, so it’s worth starting early rather than deferring.
Architecture-level traps
| Forbidden move | Why |
|---|---|
| Run analytical queries on the operational DB | Operational performance drops, customer perception suffers — OLTP/OLAP separation required |
| Sloppy schemaless JSON storage | Becomes unusable for AI 5 years later — make types explicit from day one |
| UUID v4 as primary key for high-volume inserts | Index locality breaks; switch to UUID v7 (RFC standardized 2024) |
| Drop foreign-key constraints for performance | Integrity destroyed, orphan data accumulates |
| Store binary (images, video) in the DB itself | S3 + URL reference is the rule; classic DB bloat |
| ”Maybe we’ll use it” dump-everything to S3 | Becomes a swamp without a catalog; reprocessing costs more than designing fresh |
| Aim for fully Centralized MDM in one go | Major outage to existing core systems; phase via Coexistence |
| Move PII to analytics DB without masking | The Meta 2023 EUR 1.2B fine pattern |
| Petabyte-scale streaming without a dedicated SRE | Becomes a SPOF, like the 2020 AWS Kinesis outage |
| Operate without a data catalog | Analytics speed drops 10x trying to answer “where is this data?” |
| Assuming one DB for everything is easier | Running OLTP and OLAP on one DB drops both performances; by the time you notice, both need a redo |
| Postponing analytics design for later | Bolting on ETL/ELT flows after the fact creates load on the existing DB and rework; decide the structure up front |
Data, unlike apps, cannot be rebuilt. Design as a 5-year asset.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Schema definitions, explicit types | Sloppy schemaless JSON |
| Data catalog and metadata maintained | No documentation, tribal knowledge |
| Vector DB and RAG (Retrieval-Augmented Generation) ready | Classical RDB only |
| History retained, time-series tracked | Overwrite-only |
- Separate operational and analytical — different tools for OLTP vs OLAP.
- Make schemas and types explicit from day one — don’t escape into schemaless; lay the foundation for future AI/analytics use.
- Pick the tool to fit the data type — RDB / Vector DB / object storage, mixed appropriately.
- Build metadata and catalogs — manage in a form readable by humans and AI alike.
Vector DB and RAG have become a new layer of data architecture
In products that leverage LLMs, a RAG pipeline — vectorizing internal documents and past Q&As for retrieval to improve LLM answer accuracy — is becoming a standard configuration. Alongside traditional RDB + object storage, vector DBs (pgvector, Pinecone, Weaviate) are establishing themselves as the third layer of data architecture.
Explicit schemas are the precondition for Text-to-SQL
To have AI generate SQL from natural-language questions like “show last month’s top 10 by revenue,” the table schemas, column meanings, and inter-table relationships must be explicit as metadata. With schemaless JSON storage or tables with unclear naming, AI cannot generate accurate SQL.
Summary
This article covered the big picture of data architecture — OLTP/OLAP separation, the per-type storage map, the volume × freshness ladder, and AI-era standard equipment.
Separate operational from analytical, make schemas and types explicit, match tools to data types, and design as a 5-year asset. The realistic answer for 2026.
The next article covers datastore selection (when to use RDB / KVS / columnar / vector DB / etc.).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (39/89)