Data Architecture

Data Architecture Overview — The Foundation for AI in Your Org

Data Architecture Overview — The Foundation for AI in Your Org

About this article

This article is the first article in the “Data Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of data architecture.

In 2026, “we want our data trained into AI” has become a default executive request, and whether your data is set up to be handled across systems is being re-evaluated as the precondition for AI utilization. This article surveys OLTP/OLAP separation, the data flow, datastore choices, and the structural fact that the quality of your data architecture sets the upper limit of AI utilization.

A full list of all articles in this category, with summaries and learning points, is available at the following page.

Data Architecture — Article Indexen.senkohome.com/arch-intro-index-data/

What is data architecture

Think of a company’s vault and ledger. Cash (operational data) is stored securely, while the ledger (analytical data) reveals the state of the business. Both deal with the same “money,” yet how they’re stored and how they’re used are completely different. Mix the two and both daily transactions and financial reporting slow down.

Data architecture is the discipline of designing where data lives, in what format, and how it flows between stores. It covers the separation of operational (OLTP) and analytical (OLAP) systems, datastore selection, and data-flow design.

Without data architecture, data gets siloed per system, making cross-organizational analytics and AI utilization impossible.

What data architecture covers

There was a time when “operational DB = data architecture” was synonymous, but as analytics, ML, and AI utilization have become routine, OLTP/OLAP separation, streaming processing, and data lakes have entered the picture, and data architecture has solidified as its own design domain.

Data is the gasoline of the digital economy. The quality of your data strategy maps directly to business competitiveness.

Why treat it as a separate architecture

1. Operational and analytical demands are fundamentally different

Operational: one transaction, fast and safe. Analytical: aggregate and analyze large volumes. Trying to do both on one DB drops both kinds of performance, so splitting the structure is mainstream.

2. Data spans organizations

Customer and product data are referenced by multiple systems. Holding them in different shapes per system fragments the data and makes enterprise-wide analytics impossible.

3. Data value is determined by accumulation

Applications can be rebuilt; past data is gone forever. A sloppy initial design means no usable data when you finally want to use it five years later.

Major data classifications

ClassTraitTypical storage
Operational (OLTP)Transactions, orders, customers — daily businessRDB (PostgreSQL, MySQL)
Analytical (OLAP)Operational data shaped for analyticsDWH
UnstructuredImages, video, logs, documentsObject storage (S3, etc.)
EventClicks, page views, sensor values — time-seriesStreaming (Kafka, etc.)

Each data type has different optimal storage and processing, so trying to handle all of it in one DB breaks down.

Typical data flow

The flow from operational systems through an analytics platform to BI tools and ML has converged on roughly this pattern:

Data Flow Overview: From Business Data to BI & Machine Learning A vault (business data) and a ledger (analytics data) differ in storage and usage Business Systems (OLTP) E-Commerce Order DB Customer CRM Access Logs IoT Sensors ETL/ELT Extract Transform Load dbt / Airflow Dataflow Data Lake Store raw data as-is S3 / ADLS / GCS Structured + unstructured data DWH Pre-formatted analytics DB BigQuery / Snowflake Star / snowflake schema Machine Learning / AI Training data → model training Vector DB → RAG search BI Tools Dashboards / reports Tableau / Looker / Metabase Real-time analytics Kafka / Kinesis → stream processing Separating transactional (OLTP) and analytical (OLAP) is mainstream Trying to cover everything with one DB leads to failure Data is the fuel of the digital economy. Apps can be rebuilt, but past data can never be recovered

Data lake = “a place to dump raw data.” DWH = “a DB shaped for analysis.” ETL/ELT (Extract / Transform / Load) = “the moving and reshaping machinery.” Memorizing those framings is step one.

Datastore choices

TypeStrong atExamples
RDB (relational)ACID, complex JOINsPostgreSQL, MySQL
KVSUltra-fast single-key readsRedis, DynamoDB
Document DBJSON-flavored flexible schemaMongoDB, Firestore
Columnar DB (OLAP)Aggregating large volumesBigQuery, Snowflake
Time-series DBSensor / metric retentionInfluxDB, TimescaleDB
Graph DBExploring relationshipsNeo4j
Vector DBSimilarity search, AI embeddingsPinecone, pgvector

What you must decide — what’s your project’s answer?

For each item, articulate your project’s answer in 1-2 sentences.

Datastore / search / cache

ItemExamples
Operational DBPostgreSQL / MySQL / SQL Server
Analytical DWHSnowflake / BigQuery / Redshift
Data lakeS3 / ADLS / GCS
CacheRedis / Memcached
Search engineElasticsearch / OpenSearch
Vector DBpgvector / Pinecone / Weaviate

Data integration / governance

ItemExamples
ETL/ELTdbt / Airflow / Dataflow
StreamingKafka / Kinesis / Pub/Sub
Data modelingStar / 3NF / Data Vault
Master managementCentralized MDM / federated SoR (System of Record)
Data catalogDataHub / Glue / Collibra
Access controlIAM / Row-Level Security

Author’s note — “stored, but not usable” cases

There’s a recurring story: a site decided to “just dump JSON event logs and analyze later” for years, only to find at use time that “field names differ between versions, timezones depend on the device and are inconsistent, and rows with missing keys flood the dataset” — almost no records were re-processable. The accumulation succeeded, but they hadn’t accumulated it in a usable form.

Another case: an operational DB and analytics ran on the same PostgreSQL, and every nightly batch dragged daytime order processing. The fix was bolting on a DWH after the fact, then spending six months in dual-maintenance mode. Effort that wouldn’t have been spent if OLTP and OLAP had been separate from day one.

I personally watched a single-PostgreSQL setup running both BI dashboards and daily aggregation slow the entire service down on monthly-batch days. The root cause is “accumulating with the wrong tool/structure for the data’s character.” Design-time decisions determine your utilization runway five years later — these are visible-from-outside cases of that fact.

Data, unlike applications, cannot be rebuilt. Design as a 5-year asset.

Volume × use ladder

Note: industry rates as of April 2026. Periodic refresh required.

In data architecture the optimal answer is determined by volume and freshness requirements. Up front: PostgreSQL alone is enough up to ~1 TB. At 10 TB, separate the DWH. Lakehouse and streaming become realistic only past 100 TB.

VolumeUse caseRecommended DBFreshnessMonthly cost (est)
Up to 100 GBOperational CRUDPostgreSQL aloneReal-time~$30
Up to 1 TBOperational + analyticsPostgreSQL + read replicaDaily OK~$300
Up to 10 TBOperational + BIAurora + BigQuery / SnowflakeHourly~$3k
Up to 100 TBAnalytics-centricDWH (Snowflake) + S3Minutes-to-seconds~$3-30k
100 TB+ML + streamingLakehouse + Kafka + FlinkReal-time$30k+

“Real-time streaming costs ~10x to operate” is the rule of thumb. 90% of business needs are met by daily batch. When you hear “real-time”, question it first — micro-batching (15-min cycles) is enough for most cases.

Data architecture evolves in stages with scale and freshness. Aiming at real-time from day one is over-investment.

Knowledge structure of this category

This category is composed of 7 articles in total. The structure follows store -> structure -> move -> govern, learning how to handle data step by step.

Data Architecture Section Structure Learn data handling step by step: Store → Structure → Move → Govern Store 1. Data Store Selection RDB / KVS / Columnar / Vector 2. Data Modeling Normalization / star schema / primary keys Getting this wrong affects everything Structure 3. DWH / Data Lake OLTP/OLAP separation 4. Lakehouse Unified architecture Transactional and analytical separate the platforms Move 5. ETL / ELT Batch Integration 6. Streaming Real-time integration 90% covered by daily batch is sufficient Govern 7. Data Governance Catalog / quality management Access Permissions Cross-cutting theme. In the AI era, metadata preparation is a prerequisite for AI utilization Unlike apps, data can't be rebuilt. Design it as an asset for 5 years from now In the AI era, metadata preparation is a prerequisite for AI utilization. Don't postpone it

Store first selects where data lives (RDB, KVS, columnar, vector DB, etc.). Getting this wrong ripples through everything downstream.

Structure designs table schemas (normalization, star, etc.) and separates operational vs analytical platforms (DWH, data lake) on top of the chosen store.

Move designs how to transport structured data from operational DBs to the analytics platform. For most projects batch (ETL/ELT) is enough; streaming enters the picture only when real-time freshness is required.

Govern — data governance — is a cross-cutting theme covering catalog maintenance, quality management, and access-control design. In the AI era, metadata maintenance is a precondition for AI utilization, so it’s worth starting early rather than deferring.

Architecture-level traps

Forbidden moveWhy
Run analytical queries on the operational DBOperational performance drops, customer perception suffers — OLTP/OLAP separation required
Sloppy schemaless JSON storageBecomes unusable for AI 5 years later — make types explicit from day one
UUID v4 as primary key for high-volume insertsIndex locality breaks; switch to UUID v7 (RFC standardized 2024)
Drop foreign-key constraints for performanceIntegrity destroyed, orphan data accumulates
Store binary (images, video) in the DB itselfS3 + URL reference is the rule; classic DB bloat
”Maybe we’ll use it” dump-everything to S3Becomes a swamp without a catalog; reprocessing costs more than designing fresh
Aim for fully Centralized MDM in one goMajor outage to existing core systems; phase via Coexistence
Move PII to analytics DB without maskingThe Meta 2023 EUR 1.2B fine pattern
Petabyte-scale streaming without a dedicated SREBecomes a SPOF, like the 2020 AWS Kinesis outage
Operate without a data catalogAnalytics speed drops 10x trying to answer “where is this data?”
Assuming one DB for everything is easierRunning OLTP and OLAP on one DB drops both performances; by the time you notice, both need a redo
Postponing analytics design for laterBolting on ETL/ELT flows after the fact creates load on the existing DB and rework; decide the structure up front

Data, unlike apps, cannot be rebuilt. Design as a 5-year asset.

AI decision axes

AI-era favorableAI-era unfavorable
Schema definitions, explicit typesSloppy schemaless JSON
Data catalog and metadata maintainedNo documentation, tribal knowledge
Vector DB and RAG (Retrieval-Augmented Generation) readyClassical RDB only
History retained, time-series trackedOverwrite-only
  1. Separate operational and analytical — different tools for OLTP vs OLAP.
  2. Make schemas and types explicit from day one — don’t escape into schemaless; lay the foundation for future AI/analytics use.
  3. Pick the tool to fit the data typeRDB / Vector DB / object storage, mixed appropriately.
  4. Build metadata and catalogs — manage in a form readable by humans and AI alike.

Vector DB and RAG have become a new layer of data architecture

In products that leverage LLMs, a RAG pipeline — vectorizing internal documents and past Q&As for retrieval to improve LLM answer accuracy — is becoming a standard configuration. Alongside traditional RDB + object storage, vector DBs (pgvector, Pinecone, Weaviate) are establishing themselves as the third layer of data architecture.

Explicit schemas are the precondition for Text-to-SQL

To have AI generate SQL from natural-language questions like “show last month’s top 10 by revenue,” the table schemas, column meanings, and inter-table relationships must be explicit as metadata. With schemaless JSON storage or tables with unclear naming, AI cannot generate accurate SQL.

Summary

This article covered the big picture of data architectureOLTP/OLAP separation, the per-type storage map, the volume × freshness ladder, and AI-era standard equipment.

Separate operational from analytical, make schemas and types explicit, match tools to data types, and design as a 5-year asset. The realistic answer for 2026.

The next article covers datastore selection (when to use RDB / KVS / columnar / vector DB / etc.).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.

📚 Series: Architecture Crash Course for the Generative-AI Era (39/89)