Data Architecture

Data Architecture Overview — The Foundation for AI in Your Org

Data Architecture Overview — The Foundation for AI in Your Org

About this article

This article is the first article in the “Data Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of data architecture.

In 2026, “we want our data trained into AI” has become a default executive request, and whether your data is set up to be handled across systems is being re-evaluated as the precondition for AI utilization. This article surveys OLTP/OLAP separation, the data flow, datastore choices, and the structural fact that the quality of your data architecture sets the upper limit of AI utilization.

What data architecture covers

There was a time when “operational DB = data architecture” was synonymous, but as analytics, ML, and AI utilization have become routine, OLTP/OLAP separation, streaming processing, and data lakes have entered the picture, and data architecture has solidified as its own design domain.

Data is the gasoline of the digital economy. The quality of your data strategy maps directly to business competitiveness.

Why treat it as a separate architecture

1. Operational and analytical demands are fundamentally different

Operational: one transaction, fast and safe. Analytical: aggregate and analyze large volumes. Trying to do both on one DB drops both kinds of performance, so splitting the structure is mainstream.

2. Data spans organizations

Customer and product data are referenced by multiple systems. Holding them in different shapes per system fragments the data and makes enterprise-wide analytics impossible.

3. Data value is determined by accumulation

Applications can be rebuilt; past data is gone forever. A sloppy initial design means no usable data when you finally want to use it five years later.

Major data classifications

ClassTraitTypical storage
Operational (OLTP)Transactions, orders, customers — daily businessRDB (PostgreSQL, MySQL)
Analytical (OLAP)Operational data shaped for analyticsDWH (Data Warehouse — Snowflake, BigQuery)
UnstructuredImages, video, logs, documentsObject storage (S3, etc.)
EventClicks, page views, sensor values — time-seriesStreaming (Kafka, etc.)

Each data type has different optimal storage and processing, so trying to handle all of it in one DB breaks down.

Typical data flow

The flow from operational systems through an analytics platform to BI (Business Intelligence) tools and ML has converged on roughly this pattern:

flowchart LR
    OP["Operational systems<br/>(OLTP)"] -->|raw data| DL["Data lake<br/>(S3, etc.)"]
    DL -->|ETL/ELT| DWH["Data warehouse<br/>(BigQuery / Snowflake)"]
    DWH --> BI["BI tools<br/>(Looker / Tableau)"]
    DWH --> ML["ML platform"]
    DWH --> AI["AI agents"]
    classDef op fill:#dbeafe,stroke:#2563eb;
    classDef store fill:#fef3c7,stroke:#d97706;
    classDef use fill:#fae8ff,stroke:#a21caf;
    class OP op;
    class DL,DWH store;
    class BI,ML,AI use;

Data lake = “a place to dump raw data.” DWH = “a DB shaped for analysis.” ETL/ELT (Extract / Transform / Load) = “the moving and reshaping machinery.” Memorizing those framings is step one.

Datastore choices

TypeStrong atExamples
RDB (relational)ACID, complex JOINsPostgreSQL, MySQL
KVS (Key-Value)Ultra-fast single-key readsRedis, DynamoDB
Document DBJSON-flavored flexible schemaMongoDB, Firestore
Columnar DB (OLAP)Aggregating large volumesBigQuery, Snowflake
Time-series DBSensor / metric retentionInfluxDB, TimescaleDB
Graph DBExploring relationshipsNeo4j
Vector DBSimilarity search, AI embeddingsPinecone, pgvector

What you must decide — what’s your project’s answer?

For each item, articulate your project’s answer in 1-2 sentences.

Datastore / search / cache

ItemExamples
Operational DBPostgreSQL / MySQL / SQL Server
Analytical DWHSnowflake / BigQuery / Redshift
Data lakeS3 / ADLS / GCS
CacheRedis / Memcached
Search engineElasticsearch / OpenSearch
Vector DBpgvector / Pinecone / Weaviate

Data integration / governance

ItemExamples
ETL/ELTdbt / Airflow / Dataflow
StreamingKafka / Kinesis / Pub/Sub
Data modelingStar / 3NF / Data Vault
Master managementCentralized MDM (Master Data Management) / federated SoR (System of Record)
Data catalogDataHub / Glue / Collibra
Access controlIAM / Row-Level Security

Author’s note — “stored, but not usable” cases

There’s a recurring story: a site decided to “just dump JSON event logs and analyze later” for years, only to find at use time that “field names differ between versions, timezones depend on the device and are inconsistent, and rows with missing keys flood the dataset” — almost no records were re-processable. The accumulation succeeded, but they hadn’t accumulated it in a usable form.

Another case: an operational DB and analytics ran on the same PostgreSQL, and every nightly batch dragged daytime order processing. The fix was bolting on a DWH after the fact, then spending six months in dual-maintenance mode. Effort that wouldn’t have been spent if OLTP and OLAP had been separate from day one.

I personally watched a single-PostgreSQL setup running both BI dashboards and daily aggregation slow the entire service down on monthly-batch days. The root cause is “accumulating with the wrong tool/structure for the data’s character.” Design-time decisions determine your utilization runway five years later — these are visible-from-outside cases of that fact.

Data, unlike applications, cannot be rebuilt. Design as a 5-year asset.

Volume × use ladder

Note: industry rates as of April 2026. Periodic refresh required.

In data architecture the optimal answer is determined by volume and freshness requirements. Up front: PostgreSQL alone is enough up to ~1 TB. At 10 TB, separate the DWH. Lakehouse and streaming become realistic only past 100 TB.

VolumeUse caseRecommended DBFreshnessMonthly cost (est)
Up to 100 GBOperational CRUDPostgreSQL aloneReal-time~$30
Up to 1 TBOperational + analyticsPostgreSQL + read replicaDaily OK~$300
Up to 10 TBOperational + BIAurora + BigQuery / SnowflakeHourly~$3k
Up to 100 TBAnalytics-centricDWH (Snowflake) + S3Minutes-to-seconds~$3-30k
100 TB+ML + streamingLakehouse + Kafka + FlinkReal-time$30k+

“Real-time streaming costs ~10x to operate” is the rule of thumb. 90% of business needs are met by daily batch. When you hear “real-time”, question it first — micro-batching (15-min cycles) is enough for most cases.

Data architecture evolves in stages with scale and freshness. Aiming at real-time from day one is over-investment.

Architecture-level traps

Forbidden moveWhy
Run analytical queries on the operational DBOperational performance drops, customer perception suffers — OLTP/OLAP separation required
Sloppy schemaless JSON storageBecomes unusable for AI 5 years later — make types explicit from day one
UUID v4 as primary key for high-volume insertsIndex locality breaks; switch to UUID v7 (RFC standardized 2024)
Drop foreign-key constraints for performanceIntegrity destroyed, orphan data accumulates
Store binary (images, video) in the DB itselfS3 + URL reference is the rule; classic DB bloat
”Maybe we’ll use it” dump-everything to S3Becomes a swamp without a catalog; reprocessing costs more than designing fresh
Aim for fully Centralized MDM in one goMajor outage to existing core systems; phase via Coexistence
Move PII to analytics DB without maskingThe Meta 2023 EUR 1.2B fine pattern
Petabyte-scale streaming without a dedicated SREBecomes a SPOF, like the 2020 AWS Kinesis outage
Operate without a data catalogAnalytics speed drops 10x trying to answer “where is this data?”

Data, unlike apps, cannot be rebuilt. Design as a 5-year asset.

The AI-era lens

With AI-driven development (vibe coding) and AI utilization (RAG, agents) as the assumption, data architecture’s importance for “a shape AI can consume” jumps. AI accuracy is a function of clean structured data and meaningful metadata, so data tidiness = ceiling on AI utilization.

AI-era favorableAI-era unfavorable
Schema definitions, explicit typesSloppy schemaless JSON
Data catalog and metadata maintainedNo documentation, tribal knowledge
Vector DB and RAG (Retrieval-Augmented Generation) readyClassical RDB only
History retained, time-series trackedOverwrite-only

When AI agents start manipulating data, structured data that SQL can hit, names readable in natural language, and clear permissions matter more than they did for human designers.

The quality of your data architecture becomes the AI utilization ceiling.

Common misreadings

  • One DB for everything is easier → Running operational (OLTP) and analytical (OLAP) on one DB drops both performances, and by the time you notice, both need a redo. Separate from day one.
  • Schemaless = future-friendly → Schemaless looks flexible but data quality goes uncontrolled, and in the AI era it becomes “an asset AI cannot read.” Make types explicit from day one.
  • Just keep the data, you can use it later → Sloppy accumulation produces a future state of “no usable data.” Without deciding “what / what type / what granularity” to capture, volume produces zero value.
  • We’ll figure out analytics later → Bolting on operational-to-analytical (ETL/ELT) flows after the fact creates load on the existing DB and rework. Decide on the data lake + DWH structure up front.

How to make the final call

The core of data architecture is the asymmetry: applications can be rebuilt, but past data is gone forever. To avoid the “no usable data” state when you start to use it five years later, build schema definitions, history retention, and metadata maintenance in from the start. Separate operational (OLTP) from analytical (OLAP) early and pick the right tool for each.

The decisive axis is “the AI utilization ceiling is set by data tidiness.” When AI agents are hitting SQL and RAG is doing vector search, schemaless sloppy JSON is effectively unusable. Type definitions, data catalogs, and vector DB readiness aren’t luxuries — they’re standard equipment in the AI era.

Selection priority

  1. Separate operational and analytical — different tools for OLTP vs OLAP.
  2. Make schemas and types explicit from day one — don’t escape into schemaless; lay the foundation for future AI/analytics use.
  3. Pick the tool to fit the data typeRDB / Vector DB / object storage, mixed appropriately.
  4. Build metadata and catalogs — manage in a form readable by humans and AI alike.

“Data architecture is a 5-year asset.” Today’s tidiness sets tomorrow’s AI runway.

Summary

This article covered the big picture of data architectureOLTP/OLAP separation, the per-type storage map, the volume × freshness ladder, and AI-era standard equipment.

Separate operational from analytical, make schemas and types explicit, match tools to data types, and design as a 5-year asset. The realistic answer for 2026.

The next article covers datastore selection (when to use RDB / KVS / columnar / vector DB / etc.).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.