Data Architecture - Designing Company-Wide Data as Strategic

About this article

As the third installment of the “Enterprise Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains EA-perspective Data Architecture (DA).

While the Data Architecture chapter (40 series) handled “implementation of individual systems,” this article handles “cross-enterprise consistency.” For example, “centralizing the customer master” is this article, “which DB to place it on” is the 40 series’s job. This article covers MDM, data governance, company-wide data flow, and CDO / data steward roles - explained for CDO / data-department-head.

What is EA-perspective Data Architecture in the first place

Enterprise Data Flow and Data Catalog

Imagine a library’s classification system. If each branch organized its books using its own classification scheme, no one could instantly answer “which branch has this book?” Only with a shared classification and search system across all branches can someone at any branch find the book they need.

EA-perspective Data Architecture (DA) is the discipline of systematically organizing the entire enterprise’s data assets. While the individual-system DB design (40 series) handles “that operation’s data,” EA-DA draws a company-wide map of “which data exists where and how it flows.”

Without DA, the same customer data exists in different formats across departments, making company-wide analytics and AI utilization impossible.

Companies where “this month’s revenue” differs by department have broken DA

The 2nd EA layer (DA) systematically designs the company’s data assets. Different in viewpoint from the “DB selection / data foundation” handled in the data-architecture chapter, the goal is organizing at company level the types, relationships, flows, and owners of data the whole org handles.

While individual-system data architecture handles “data for running that operation,” EA’s data architecture handles all data as the company’s strategic asset. Draw on a single company-wide picture which system holds which data, where the original is, and where it flows.

Individual DB design = tactic, EA’s DA = strategy. Viewpoint one rank higher.

Why DA is needed

Integrating siloed data

When departments hold data in different systems, you reach the state of the same customer registered with 3 IDs. Need to organize data from company-wide viewpoint.

Foundation for data-driven management

To use company-wide data for management decisions, a map of where what is is needed. Companies without DA frequently see “numbers don’t match” problems in management meetings.

In management meetings, “this month’s revenue” from finance and “this month’s revenue” from sales differ by nearly 5%, with debates starting from there every month - reported scenarios. Tracing causes, returns / discounts / consumption tax / accounting timing are subtly different per department, with everyone’s numbers correct by their department’s definition. A typical example showing more serious than data itself is not aligning word definitions company-wide.

Regulatory / privacy compliance

For GDPR and Personal Information Protection Act compliance, a personal-info location map is required. DA setup is the premise of audit response.

Main DA components

EA’s data architecture captures company-wide data from multiple viewpoints. Beyond mere ER diagrams, it includes all viewpoints of strategy / operations / technology.

Element	Content
Conceptual data model	Major company-wide entities
Logical data model	Relationship / attribute details
Physical data model	Actual DB design
Data flow diagram	Inter-system data movement
Data catalog	Catalog of all data
Master Data Management	Uniqueness of core data
Data governance	Management regime / rules

Conceptual data model

What draws major entities handled company-wide is the conceptual data model. “Customer / product / order / employee / partner” - express the core “things” of corporate activity in about 10-30 items. Granularity understandable by business departments matters.

[Customer] -- purchase -- [Product]
   |                          |
   |                          |
   +-- delivery to --[Address]-- inventory -- [Warehouse]

The iron rule for conceptual models is drawing in business language. Write “customer” not “user_account” - making it the common language of business and tech.

Data domains

Areas grouping related data are data domains. “Customer domain” / “product domain” / “finance domain” - splitting by business function and placing data owners on each is the modern approach.

Domain	Major data
Customer	Customer master, behavior history, segments
Product	Product master, categories, prices
Transactions	Orders, deliveries, returns
Finance	Accounting, budget, actuals
HR	Employees, salary, evaluation
Partners	Business partners, contracts

In Data Mesh thinking, domains hold data ownership and responsibility, providing high-quality data complete within domain to other domains.

Master Data Management (MDM)

The mechanism centrally managing core data company-wide. With master data like “customer ID,” “product code,” and “partner code” differing per department, company-wide analysis becomes impossible. MDM creates the single source of truth.

MDM construction method	Content
Registry	Each system’s data as is, only IDs integrated
Consolidation	Read-only integrated data
Coexistence	Bidirectional sync with each system
Centralized	Aggregated to a single master system

Realistically, Coexistence is more often chosen - the realistic method of phased consistency without breaking existing systems.

The reason Coexistence is chosen over others is clear. Centralized is ideal but the migration cost of stopping existing core / CRM / ERP and consolidating into a single master is huge - few companies can complete this without stopping running businesses. Consolidation is read-only so updates remain in each system, ending up with continued dual management. Registry is the light method just connecting IDs, but can’t resolve attribute-value inconsistencies (same customer, different addresses, etc.). Coexistence keeps existing system updates alive while organizing master via bidirectional sync, with the trio of not breaking existing assets, suppressing initial cost, avoiding full-integration failure risk - fitting most realistic enterprises premising phased introduction.

Company-wide data flow diagram

Visualize inter-system data movement at company-wide unit. Drawing “which system receives data from where, sends where” reveals data dependencies.

[Core system] --orders--> [Inventory mgmt]
   |                          |
   |                          v
   +--customer info--> [CRM] --analysis--> [DWH]
                         |                    |
                         v                    v
                   [Email delivery]        [BI]

With diagrams at this level, “what’s the impact range when a system stops” is visible at a glance. Directly connects to incident response too.

Enterprise-level data catalog

The data catalog handled in the data-architecture chapter, deployed company-wide at EA level. Centrally manages data metadata, owners, and usage, realizing Google Search for data.

Tool	Characteristics
Collibra	Commercial, enterprise
DataHub	LinkedIn OSS
Alation	AI-equipped, commercial
Apache Atlas	Hadoop-system OSS
Informatica EDC	Integrated suite

Integration of departmental catalogs is the EA-level challenge, requiring devices to integrate disparate catalogs.

Data-governance regime

The org regime managing company-wide data. Beyond technology, role and authority design matters - establishing a data governance committee is general.

Role	Responsibility
Chief Data Officer (CDO)	Company-wide data strategy
Data governance committee	Rules, priorities
Data owner	Domain responsible
Data steward	Daily management
Data user	User, compliance obligation

Establishing CDO is a trend since 2015, a required position at companies treating data as management asset.

Data and cloud

Modern EA’s DA is designed premising cloud DWH, data lake, and lakehouse. The design paradigm has shifted from “aggregating internal DBs” to “integrated data foundation in cloud.”

Role	Major tool
DWH	Snowflake, BigQuery
Data lake	S3, GCS, ADLS
Lakehouse	Databricks, BigLake
Streaming	Kafka, Kinesis
ETL / ELT	Fivetran, dbt
Catalog	DataHub, Collibra

Redrawing EA’s DA premising cloud is becoming the work of 2020s enterprise architects.

Data security and privacy

EA’s DA also includes data-confidentiality classification. Govern who handles which data how via labeling of “public / internal / confidential / top secret.”

Class	Target	Handling
Public	Web pages, IR info	Free
Internal	Employee-facing info	In-house only
Confidential	Sales plans, contract info	Access restricted
Top secret	Personal info, financial secrets	Strong encryption, audit logs

A personal-info location map (PII Inventory, the catalog of Personally Identifiable Information) is a required output for GDPR compliance, uncreatable without EA’s DA in place.

Decision criterion 1: data-utilization strategy

The more companies utilize data as management asset, the more important EA’s DA. For companies seeing data only as operational logs, detailed DA is excessive.

Strategy	Recommended
Data as mere records	DA at minimum
Decisions via BI	Conceptual model + catalog
Auto-judgment via AI	Full DA + governance
Data itself is product	CDO + dedicated org

Decision criterion 2: org scale and complexity

The more complex the org, the higher DA-setup cost - but investment-effect ratio also larger. The needed DA depth differs between single-product small enterprises and diversified large enterprises.

Org	Recommended
Single business	Conceptual model + main DB design
Multiple businesses	Domain split + MDM
M&A in progress	Master alignment premising integration
Global	Per-region / per-regulation design

How to choose by case

Startup / single business

Conceptual model + BigQuery / Snowflake + dbt. Dedicated CDO unneeded, engineering manager concurrent. Data catalog enough with dbt docs, master integration starts when needed.

Mid-size enterprise / BI-driven management

Domain split + DataHub / Alation + data-steward placement. Split into 3-5 domains, place concurrent stewards on each. MDM with Coexistence for phased integration, deliver to decision-makers via BI tools (Tableau / Looker).

Large enterprise / diversified businesses

Establish CDO + Collibra / Informatica + dedicated MDM team. Place data-governance committee directly under management, standing M&A-response master-integration projects. Manage region-based / regulation-based DA in ArchiMate, auto-generate PII Inventory for GDPR / Personal Information Protection Act.

Companies where data is product (advertising, finance, SaaS)

Data Mesh + semantic layer (dbt semantic layer / Cube.js) + AI Ready design. Domains productize data and provide to other departments / customers, with AI agents autonomously querying via semantic layer. Attach freshness / quality SLAs to all data.

Phased MDM-integration practical matrix

MDM breaks down aiming for “perfect centralization,” so phased integration not breaking existing systems is the realistic answer.

Phase	Period	Coverage	Investment guideline
1. Current inventory	1-3 months	Grasp ID systems of major masters (customer, product)	Millions
2. Registry integration	6-12 months	Make IDs of each system mutually referenceable	Tens of millions
3. Coexistence bidirectional sync	1-2 years	Bidirectional sync with each system, attribute-value unification	Tens of millions to hundreds of millions
4. Golden Record establishment	3-5 years	Establish single authoritative data	Hundreds of millions
5. Centralized (ideal)	Long-term	Fully consolidate into single master	Practically impossible at many companies

Practical lower bound for MDM investment is mid-size enterprise and up. At startup / small SaaS, MDM is excessive - PostgreSQL master tables + common ID-naming conventions is enough. Uber’s 2014 “dashboard wars” (the same “weekly rides” coexisting in 3-5 versions, CEO and field numbers diverging) is the typical case showing the necessity of central MDM.

MDM goes phased integration via Coexistence. Aiming for perfect centralization always fails.

EA-perspective DA pitfalls and forbidden moves

Typical accident patterns in EA’s DA. All become causes of “same customer registered with 3 IDs,” “numbers diverge at management meetings”.

Forbidden move	Why it’s bad
Don’t align term definitions company-wide	”This month’s revenue” diverges 3-8% by department, parallel debates
Aim for master integration all at once Centralized	Migration stopping existing core, business-stop risk
Just install data catalog and abandon	No stewards, metadata not updated, rots
Split data domains by org name	Ownership disappears on org change, split by capability units
Don’t create PII Inventory	GDPR compliance impossible, same risk as Meta EUR 1.2B fine
Direct AI to DB without semantic layer	AI misunderstands “revenue,” mass-producing hallucinations
Talk company-wide data strategy without CDO	Doesn’t ride management agenda, stalls in dept warfare
Don’t operate data classification (public / internal / confidential / top secret)	Vague personal-info handling, regulatory violations
Try MDM introduction stopping existing systems	Business-stop, big firestorm; phased integration via Coexistence
Manage metadata in PDF / Excel	AI can’t read, not continuously updated, becomes outdated
DB design exists so EA’s DA is unneeded	Individual DB design and company-wide viewpoint are different; domain splitting and master alignment are work outside DB design
Buying a data catalog completes DA	Tools are means; regime, rules, and operations are substance — just installing leads to neglect and rot
CDO is only for large enterprises	Recently mid-size also places CDO; whether to put data strategy on management agenda
Master-data integration solves all problems	Integration itself is a hard project; the success secret is proceeding phased and realistically

Uber’s 2014 dashboard wars is told as a success case of in-house Michelangelo (ML platform) and Querybuilder (semantic layer) rooting the culture of “metric definitions agreed via GitHub PRs,” converting metric debates to engineering work.

EA-perspective DA is “word definitions before technology.” Aligning terms company-wide is the first step.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

Conceptual data model (major 10-30 entities)
Data-domain split (who owns what)
Master-data strategy (integration method)
Data catalog (tool, operation)
Governance regime (CDO, committee)
Data-classification policy (public / internal / confidential / top secret)
Cloud DWH strategy (Snowflake / BigQuery etc.)

Author’s note - “numbers don’t match” that stopped a new project

The real fear of company-wide data definitions being disparate surfaces not at incidents but at decision-making time.

A DX project of “creating a company-wide revenue dashboard” started at a mid-size retailer, and aggregating revenue data from finance / sales / e-commerce DBs revealed that the 3 numbers diverged 3-8% monthly. The cause was differences per system in “is revenue counted at order or shipping?” / “with or without consumption tax?” / “when are returns reflected?” - spending over half a year on investigation and definition agreement, the management-meeting dashboard came online 1.5 years late from start - repeatedly told as standard talking point.

Another, Uber’s 2014 “dashboard wars” is also a famous case. Uber, in rapid growth, made independent data pipelines per team, resulting in the same metric “weekly rides” coexisting in 3-5 versions on internal dashboards, with CEO numbers diverging from field numbers. Eventually Uber developed in-house Michelangelo (ML platform) and Querybuilder (semantic layer), switching to mechanisms of defining metrics company-wide once and reusing. Thereafter, the culture of “metric definitions agreed via GitHub PRs” took root inside Uber, with metric debates converted to engineering work.

Both slap home the decisive value of “aligning word definitions company-wide before technology.” At companies without EA’s DA, the moment AI agents are asked “what’s this month’s revenue?” AI returns 3 different answers.

How to choose

The core of EA-perspective DA is the viewpoint of designing not individual DBs but company-wide data as strategic asset. Siloed data, same customer registered with 3 IDs, numbers diverging in management meetings - these aren’t DB tech problems but flaws in enterprise-level data systems. The work of EA-level DA is splitting domains and placing data owners, securing core-data uniqueness via MDM, and creating a company-wide map via conceptual models, data flows, and catalogs. The realistic approach is phased integration via Coexistence MDM — aiming for perfect centralization breaks down.

Another decisive axis is building a data space autonomously understandable by AI agents. For AI hearing “revenue” to reach the correct aggregation logic, semantic layers (dbt semantic layer, Cube.js) are required. Only companies where Data Mesh has domains providing data products to other departments and AI, with API-referenceable / continuously-updated catalogs in place, can hold competitiveness even in AI usage.

AI-era decision axes

When AI-driven dev (vibe coding) and AI usage are the premise, EA’s DA is redesigned as a data space accessible by AI agents. In the era when AI autonomously seeks data, whether company-wide data is visible to AI decides competitiveness.

Favored in the AI era	Disfavored in the AI era
Data Mesh (domain ownership)	Centralized silos
API-referenceable data	Excel, files
Semantic layer (term definitions)	Undefined column names
Continuously-updated catalog	Half-year-old snapshots

As AI Ready data architecture, setup of semantic layers (dbt semantic layer, Cube.js, etc.) draws attention. Design where AI hearing “revenue” reaches the correct aggregation logic is needed.

AI-era DA designs in vocabulary AI understands. Semantic layer is key.

Selection priorities

Domain splitting and data owners - clarify ownership, governance foundation
MDM via phased Coexistence - perfect centralization fails, don’t break existing
Data classification for privacy compliance - public / internal / confidential / top secret, PII Inventory setup
Semantic layer to hand AI vocabulary - dbt semantic layer / Cube.js, AI Ready design

“Design data in vocabulary AI understands.” Domain split + MDM + semantic layer is the core.

Semantic layers make AI’s data understanding accurate

When business-term calculation definitions are codified in a dbt semantic layer or Cube.js — like “revenue = sum of amount in orders table (excluding returns)” — AI generates accurate SQL when asked “what’s this month’s revenue.” Without a semantic layer, AI guesses the “revenue” definition and can pick the wrong table.

Data mesh and AI compatibility

Data mesh (where domain teams own data quality and publication) pairs well with AI agents that autonomously discover and fetch needed data. When each domain publishes data products via API, AI can find data via catalog and fetch via API — building autonomous workflows.

https://en.senkohome.com/arch-intro-ea-aa/ https://en.senkohome.com/arch-intro-ea-ba/ https://en.senkohome.com/arch-intro-ea-overview/

Summary

This article covered EA-perspective Data Architecture, including conceptual models, domains, MDM, catalog, PII Inventory, semantic layer, and AI Ready design.

Clarify ownership via domain split and data owners, MDM phased via Coexistence, data classification for privacy, hand AI vocabulary via semantic layer. That is the practical answer for EA-perspective DA in 2026.

Next time we’ll cover Application Architecture (AA) (system portfolio, integration patterns).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.