Data Architecture Overview — The Foundation for AI in Your Org

About this article

This article is the first article in the “Data Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of data architecture.

In 2026, “we want our data trained into AI” has become a default executive request, and whether your data is set up to be handled across systems is being re-evaluated as the precondition for AI utilization. This article surveys OLTP/OLAP separation, the data flow, datastore choices, and the structural fact that the quality of your data architecture sets the upper limit of AI utilization.

What data architecture covers

There was a time when “operational DB = data architecture” was synonymous, but as analytics, ML, and AI utilization have become routine, OLTP/OLAP separation, streaming processing, and data lakes have entered the picture, and data architecture has solidified as its own design domain.

Data is the gasoline of the digital economy. The quality of your data strategy maps directly to business competitiveness.

Why treat it as a separate architecture

1. Operational and analytical demands are fundamentally different

Operational: one transaction, fast and safe. Analytical: aggregate and analyze large volumes. Trying to do both on one DB drops both kinds of performance, so splitting the structure is mainstream.

2. Data spans organizations

Customer and product data are referenced by multiple systems. Holding them in different shapes per system fragments the data and makes enterprise-wide analytics impossible.

3. Data value is determined by accumulation

Applications can be rebuilt; past data is gone forever. A sloppy initial design means no usable data when you finally want to use it five years later.

Major data classifications

Class	Trait	Typical storage
Operational (OLTP)	Transactions, orders, customers — daily business	RDB (PostgreSQL, MySQL)
Analytical (OLAP)	Operational data shaped for analytics	DWH (Data Warehouse — Snowflake, BigQuery)
Unstructured	Images, video, logs, documents	Object storage (S3, etc.)
Event	Clicks, page views, sensor values — time-series	Streaming (Kafka, etc.)

Each data type has different optimal storage and processing, so trying to handle all of it in one DB breaks down.

Typical data flow

The flow from operational systems through an analytics platform to BI (Business Intelligence) tools and ML has converged on roughly this pattern:

flowchart LR
    OP["Operational systems<br/>(OLTP)"] -->|raw data| DL["Data lake<br/>(S3, etc.)"]
    DL -->|ETL/ELT| DWH["Data warehouse<br/>(BigQuery / Snowflake)"]
    DWH --> BI["BI tools<br/>(Looker / Tableau)"]
    DWH --> ML["ML platform"]
    DWH --> AI["AI agents"]
    classDef op fill:#dbeafe,stroke:#2563eb;
    classDef store fill:#fef3c7,stroke:#d97706;
    classDef use fill:#fae8ff,stroke:#a21caf;
    class OP op;
    class DL,DWH store;
    class BI,ML,AI use;

Data lake = “a place to dump raw data.” DWH = “a DB shaped for analysis.” ETL/ELT (Extract / Transform / Load) = “the moving and reshaping machinery.” Memorizing those framings is step one.

Datastore choices

Type	Strong at	Examples
RDB (relational)	ACID, complex JOINs	PostgreSQL, MySQL
KVS (Key-Value)	Ultra-fast single-key reads	Redis, DynamoDB
Document DB	JSON-flavored flexible schema	MongoDB, Firestore
Columnar DB (OLAP)	Aggregating large volumes	BigQuery, Snowflake
Time-series DB	Sensor / metric retention	InfluxDB, TimescaleDB
Graph DB	Exploring relationships	Neo4j
Vector DB	Similarity search, AI embeddings	Pinecone, pgvector

What you must decide — what’s your project’s answer?

For each item, articulate your project’s answer in 1-2 sentences.

Datastore / search / cache

Item	Examples
Operational DB	PostgreSQL / MySQL / SQL Server
Analytical DWH	Snowflake / BigQuery / Redshift
Data lake	S3 / ADLS / GCS
Cache	Redis / Memcached
Search engine	Elasticsearch / OpenSearch
Vector DB	pgvector / Pinecone / Weaviate

Data integration / governance

Item	Examples
ETL/ELT	dbt / Airflow / Dataflow
Streaming	Kafka / Kinesis / Pub/Sub
Data modeling	Star / 3NF / Data Vault
Master management	Centralized MDM (Master Data Management) / federated SoR (System of Record)
Data catalog	DataHub / Glue / Collibra
Access control	IAM / Row-Level Security

Author’s note — “stored, but not usable” cases

There’s a recurring story: a site decided to “just dump JSON event logs and analyze later” for years, only to find at use time that “field names differ between versions, timezones depend on the device and are inconsistent, and rows with missing keys flood the dataset” — almost no records were re-processable. The accumulation succeeded, but they hadn’t accumulated it in a usable form.

Another case: an operational DB and analytics ran on the same PostgreSQL, and every nightly batch dragged daytime order processing. The fix was bolting on a DWH after the fact, then spending six months in dual-maintenance mode. Effort that wouldn’t have been spent if OLTP and OLAP had been separate from day one.

I personally watched a single-PostgreSQL setup running both BI dashboards and daily aggregation slow the entire service down on monthly-batch days. The root cause is “accumulating with the wrong tool/structure for the data’s character.” Design-time decisions determine your utilization runway five years later — these are visible-from-outside cases of that fact.

Data, unlike applications, cannot be rebuilt. Design as a 5-year asset.

Volume × use ladder

Note: industry rates as of April 2026. Periodic refresh required.

In data architecture the optimal answer is determined by volume and freshness requirements. Up front: PostgreSQL alone is enough up to ~1 TB. At 10 TB, separate the DWH. Lakehouse and streaming become realistic only past 100 TB.

Volume	Use case	Recommended DB	Freshness	Monthly cost (est)
Up to 100 GB	Operational CRUD	PostgreSQL alone	Real-time	~$30
Up to 1 TB	Operational + analytics	PostgreSQL + read replica	Daily OK	~$300
Up to 10 TB	Operational + BI	Aurora + BigQuery / Snowflake	Hourly	~$3k
Up to 100 TB	Analytics-centric	DWH (Snowflake) + S3	Minutes-to-seconds	~$3-30k
100 TB+	ML + streaming	Lakehouse + Kafka + Flink	Real-time	$30k+

“Real-time streaming costs ~10x to operate” is the rule of thumb. 90% of business needs are met by daily batch. When you hear “real-time”, question it first — micro-batching (15-min cycles) is enough for most cases.

Data architecture evolves in stages with scale and freshness. Aiming at real-time from day one is over-investment.

Architecture-level traps

Forbidden move	Why
Run analytical queries on the operational DB	Operational performance drops, customer perception suffers — OLTP/OLAP separation required
Sloppy schemaless JSON storage	Becomes unusable for AI 5 years later — make types explicit from day one
UUID v4 as primary key for high-volume inserts	Index locality breaks; switch to UUID v7 (RFC standardized 2024)
Drop foreign-key constraints for performance	Integrity destroyed, orphan data accumulates
Store binary (images, video) in the DB itself	S3 + URL reference is the rule; classic DB bloat
”Maybe we’ll use it” dump-everything to S3	Becomes a swamp without a catalog; reprocessing costs more than designing fresh
Aim for fully Centralized MDM in one go	Major outage to existing core systems; phase via Coexistence
Move PII to analytics DB without masking	The Meta 2023 EUR 1.2B fine pattern
Petabyte-scale streaming without a dedicated SRE	Becomes a SPOF, like the 2020 AWS Kinesis outage
Operate without a data catalog	Analytics speed drops 10x trying to answer “where is this data?”

Data, unlike apps, cannot be rebuilt. Design as a 5-year asset.

The AI-era lens

With AI-driven development (vibe coding) and AI utilization (RAG, agents) as the assumption, data architecture’s importance for “a shape AI can consume” jumps. AI accuracy is a function of clean structured data and meaningful metadata, so data tidiness = ceiling on AI utilization.

AI-era favorable	AI-era unfavorable
Schema definitions, explicit types	Sloppy schemaless JSON
Data catalog and metadata maintained	No documentation, tribal knowledge
Vector DB and RAG (Retrieval-Augmented Generation) ready	Classical RDB only
History retained, time-series tracked	Overwrite-only

When AI agents start manipulating data, structured data that SQL can hit, names readable in natural language, and clear permissions matter more than they did for human designers.

The quality of your data architecture becomes the AI utilization ceiling.

Common misreadings

One DB for everything is easier → Running operational (OLTP) and analytical (OLAP) on one DB drops both performances, and by the time you notice, both need a redo. Separate from day one.
Schemaless = future-friendly → Schemaless looks flexible but data quality goes uncontrolled, and in the AI era it becomes “an asset AI cannot read.” Make types explicit from day one.
Just keep the data, you can use it later → Sloppy accumulation produces a future state of “no usable data.” Without deciding “what / what type / what granularity” to capture, volume produces zero value.
We’ll figure out analytics later → Bolting on operational-to-analytical (ETL/ELT) flows after the fact creates load on the existing DB and rework. Decide on the data lake + DWH structure up front.

How to make the final call

The core of data architecture is the asymmetry: applications can be rebuilt, but past data is gone forever. To avoid the “no usable data” state when you start to use it five years later, build schema definitions, history retention, and metadata maintenance in from the start. Separate operational (OLTP) from analytical (OLAP) early and pick the right tool for each.

The decisive axis is “the AI utilization ceiling is set by data tidiness.” When AI agents are hitting SQL and RAG is doing vector search, schemaless sloppy JSON is effectively unusable. Type definitions, data catalogs, and vector DB readiness aren’t luxuries — they’re standard equipment in the AI era.

Selection priority

Separate operational and analytical — different tools for OLTP vs OLAP.
Make schemas and types explicit from day one — don’t escape into schemaless; lay the foundation for future AI/analytics use.
Pick the tool to fit the data type — RDB / Vector DB / object storage, mixed appropriately.
Build metadata and catalogs — manage in a form readable by humans and AI alike.

“Data architecture is a 5-year asset.” Today’s tidiness sets tomorrow’s AI runway.

Summary

This article covered the big picture of data architecture — OLTP/OLAP separation, the per-type storage map, the volume × freshness ladder, and AI-era standard equipment.

Separate operational from analytical, make schemas and types explicit, match tools to data types, and design as a 5-year asset. The realistic answer for 2026.

The next article covers datastore selection (when to use RDB / KVS / columnar / vector DB / etc.).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.