Enterprise Architecture

Data Architecture - Designing Company-Wide Data as Strategic Asset

Data Architecture - Designing Company-Wide Data as Strategic Asset

About this article

As the third installment of the “Enterprise Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains EA-perspective Data Architecture (DA).

While the Data Architecture chapter (40 series) handled “implementation of individual systems,” this article handles “cross-enterprise consistency.” For example, “centralizing the customer master” is this article, “which DB to place it on” is the 40 series’s job. This article covers MDM, data governance, company-wide data flow, and CDO / data steward roles - explained for CDO / data-department-head.

What is EA-perspective Data Architecture in the first place

Imagine a library’s classification system. If each branch organized its books using its own classification scheme, no one could instantly answer “which branch has this book?” Only with a shared classification and search system across all branches can someone at any branch find the book they need.

EA-perspective Data Architecture (DA) is the discipline of systematically organizing the entire enterprise’s data assets. While the individual-system DB design (40 series) handles “that operation’s data,” EA-DA draws a company-wide map of “which data exists where and how it flows.”

Without DA, the same customer data exists in different formats across departments, making company-wide analytics and AI utilization impossible.

Companies where “this month’s revenue” differs by department have broken DA

The 2nd EA layer (DA) systematically designs the company’s data assets. Different in viewpoint from the “DB selection / data foundation” handled in the data-architecture chapter, the goal is organizing at company level the types, relationships, flows, and owners of data the whole org handles.

While individual-system data architecture handles “data for running that operation,” EA’s data architecture handles all data as the company’s strategic asset. Draw on a single company-wide picture which system holds which data, where the original is, and where it flows.

Individual DB design = tactic, EA’s DA = strategy. Viewpoint one rank higher.

Why DA is needed

Integrating siloed data

When departments hold data in different systems, you reach the state of the same customer registered with 3 IDs. Need to organize data from company-wide viewpoint.

Foundation for data-driven management

To use company-wide data for management decisions, a map of where what is is needed. Companies without DA frequently see “numbers don’t match” problems in management meetings.

In management meetings, “this month’s revenue” from finance and “this month’s revenue” from sales differ by nearly 5%, with debates starting from there every month - reported scenarios. Tracing causes, returns / discounts / consumption tax / accounting timing are subtly different per department, with everyone’s numbers correct by their department’s definition. A typical example showing more serious than data itself is not aligning word definitions company-wide.

Regulatory / privacy compliance

For GDPR (General Data Protection Regulation) and Personal Information Protection Act compliance, a personal-info location map is required. DA setup is the premise of audit response.

Main DA components

EA’s data architecture captures company-wide data from multiple viewpoints. Beyond mere ER diagrams, it includes all viewpoints of strategy / operations / technology.

ElementContent
Conceptual data modelMajor company-wide entities
Logical data modelRelationship / attribute details
Physical data modelActual DB design
Data flow diagramInter-system data movement
Data catalogCatalog of all data
Master Data ManagementUniqueness of core data
Data governanceManagement regime / rules

Conceptual data model

What draws major entities handled company-wide is the conceptual data model. “Customer / product / order / employee / partner” - express the core “things” of corporate activity in about 10-30 items. Granularity understandable by business departments matters.

[Customer] -- purchase -- [Product]
   |                          |
   |                          |
   +-- delivery to --[Address]-- inventory -- [Warehouse]

The iron rule for conceptual models is drawing in business language. Write “customer” not “user_account” - making it the common language of business and tech.

Data domains

Areas grouping related data are data domains. “Customer domain” / “product domain” / “finance domain” - splitting by business function and placing data owners on each is the modern approach.

DomainMajor data
CustomerCustomer master, behavior history, segments
ProductProduct master, categories, prices
TransactionsOrders, deliveries, returns
FinanceAccounting, budget, actuals
HREmployees, salary, evaluation
PartnersBusiness partners, contracts

In Data Mesh thinking, domains hold data ownership and responsibility, providing high-quality data complete within domain to other domains.

Master Data Management (MDM)

The mechanism centrally managing core data company-wide. With master data like “customer ID,” “product code,” and “partner code” differing per department, company-wide analysis becomes impossible. MDM creates the single source of truth.

flowchart TB
    subgraph BEFORE["Without MDM (typical failure)"]
        SYS1[CRM<br/>customer ID=A001] -.| SYS2[ERP<br/>customer ID=12345]
        SYS2 -.| SYS3[Accounting<br/>customer ID=Tokyo Taro]
        SYS3 -.| Q1[Company-wide analysis<br/>impossible]
    end
    subgraph AFTER["Coexistence MDM"]
        MDM[(MDM<br/>customer master)]
        SYS4[CRM] <-->|bidirectional sync| MDM
        SYS5[ERP] <-->|bidirectional sync| MDM
        SYS6[Accounting] <-->|bidirectional sync| MDM
        MDM --> ANALYTICS[Company-wide analysis<br/>possible]
    end
    classDef bad fill:#fee2e2,stroke:#dc2626;
    classDef good fill:#dcfce7,stroke:#16a34a;
    classDef mdm fill:#dbeafe,stroke:#2563eb,stroke-width:2px;
    class BEFORE,SYS1,SYS2,SYS3,Q1 bad;
    class AFTER,SYS4,SYS5,SYS6,ANALYTICS good;
    class MDM mdm;
MDM construction methodContent
RegistryEach system’s data as is, only IDs integrated
ConsolidationRead-only integrated data
CoexistenceBidirectional sync with each system
CentralizedAggregated to a single master system

Realistically, Coexistence is more often chosen - the realistic method of phased consistency without breaking existing systems.

The reason Coexistence is chosen over others is clear. Centralized is ideal but the migration cost of stopping existing core / CRM / ERP and consolidating into a single master is huge - few companies can complete this without stopping running businesses. Consolidation is read-only so updates remain in each system, ending up with continued dual management. Registry is the light method just connecting IDs, but can’t resolve attribute-value inconsistencies (same customer, different addresses, etc.). Coexistence keeps existing system updates alive while organizing master via bidirectional sync, with the trio of not breaking existing assets, suppressing initial cost, avoiding full-integration failure risk - fitting most realistic enterprises premising phased introduction.

Company-wide data flow diagram

Visualize inter-system data movement at company-wide unit. Drawing “which system receives data from where, sends where” reveals data dependencies.

[Core system] --orders--> [Inventory mgmt]
   |                          |
   |                          v
   +--customer info--> [CRM] --analysis--> [DWH]
                         |                    |
                         v                    v
                   [Email delivery]        [BI]

With diagrams at this level, “what’s the impact range when a system stops” is visible at a glance. Directly connects to incident response too.

Enterprise-level data catalog

The data catalog handled in the data-architecture chapter, deployed company-wide at EA level. Centrally manages data metadata, owners, and usage, realizing Google Search for data.

ToolCharacteristics
CollibraCommercial, enterprise
DataHubLinkedIn OSS
AlationAI-equipped, commercial
Apache AtlasHadoop-system OSS
Informatica EDCIntegrated suite

Integration of departmental catalogs is the EA-level challenge, requiring devices to integrate disparate catalogs.

Data-governance regime

The org regime managing company-wide data. Beyond technology, role and authority design matters - establishing a data governance committee is general.

RoleResponsibility
Chief Data Officer (CDO)Company-wide data strategy
Data governance committeeRules, priorities
Data ownerDomain responsible
Data stewardDaily management
Data userUser, compliance obligation

Establishing CDO is a trend since 2015, a required position at companies treating data as management asset.

Data and cloud

Modern EA’s DA is designed premising cloud DWH (Data Warehouse), data lake, and lakehouse. The design paradigm has shifted from “aggregating internal DBs” to “integrated data foundation in cloud.”

RoleMajor tool
DWHSnowflake, BigQuery
Data lakeS3, GCS, ADLS
LakehouseDatabricks, BigLake
StreamingKafka, Kinesis
ETL / ELTFivetran, dbt
CatalogDataHub, Collibra

Redrawing EA’s DA premising cloud is becoming the work of 2020s enterprise architects.

Data security and privacy

EA’s DA also includes data-confidentiality classification. Govern who handles which data how via labeling of “public / internal / confidential / top secret.”

ClassTargetHandling
PublicWeb pages, IR infoFree
InternalEmployee-facing infoIn-house only
ConfidentialSales plans, contract infoAccess restricted
Top secretPersonal info, financial secretsStrong encryption, audit logs

A personal-info location map (PII Inventory, the catalog of Personally Identifiable Information) is a required output for GDPR compliance, uncreatable without EA’s DA in place.

Decision criterion 1: data-utilization strategy

The more companies utilize data as management asset, the more important EA’s DA. For companies seeing data only as operational logs, detailed DA is excessive.

StrategyRecommended
Data as mere recordsDA at minimum
Decisions via BI (Business Intelligence)Conceptual model + catalog
Auto-judgment via AIFull DA + governance
Data itself is productCDO + dedicated org

Decision criterion 2: org scale and complexity

The more complex the org, the higher DA-setup cost - but investment-effect ratio also larger. The needed DA depth differs between single-product small enterprises and diversified large enterprises.

OrgRecommended
Single businessConceptual model + main DB design
Multiple businessesDomain split + MDM
M&A in progressMaster alignment premising integration
GlobalPer-region / per-regulation design

How to choose by case

Startup / single business

Conceptual model + BigQuery / Snowflake + dbt. Dedicated CDO unneeded, engineering manager concurrent. Data catalog enough with dbt docs, master integration starts when needed.

Mid-size enterprise / BI-driven management

Domain split + DataHub / Alation + data-steward placement. Split into 3-5 domains, place concurrent stewards on each. MDM with Coexistence for phased integration, deliver to decision-makers via BI tools (Tableau / Looker).

Large enterprise / diversified businesses

Establish CDO + Collibra / Informatica + dedicated MDM team. Place data-governance committee directly under management, standing M&A-response master-integration projects. Manage region-based / regulation-based DA in ArchiMate, auto-generate PII Inventory for GDPR / Personal Information Protection Act.

Companies where data is product (advertising, finance, SaaS)

Data Mesh + semantic layer (dbt semantic layer / Cube.js) + AI Ready design. Domains productize data and provide to other departments / customers, with AI agents autonomously querying via semantic layer. Attach freshness / quality SLAs to all data.

Phased MDM-integration practical matrix

MDM breaks down aiming for “perfect centralization,” so phased integration not breaking existing systems is the realistic answer.

PhasePeriodCoverageInvestment guideline
1. Current inventory1-3 monthsGrasp ID systems of major masters (customer, product)Millions
2. Registry integration6-12 monthsMake IDs of each system mutually referenceableTens of millions
3. Coexistence bidirectional sync1-2 yearsBidirectional sync with each system, attribute-value unificationTens of millions to hundreds of millions
4. Golden Record establishment3-5 yearsEstablish single authoritative dataHundreds of millions
5. Centralized (ideal)Long-termFully consolidate into single masterPractically impossible at many companies

Practical lower bound for MDM investment is mid-size enterprise and up. At startup / small SaaS, MDM is excessive - PostgreSQL master tables + common ID-naming conventions is enough. Uber’s 2014 “dashboard wars” (the same “weekly rides” coexisting in 3-5 versions, CEO and field numbers diverging) is the typical case showing the necessity of central MDM.

MDM goes phased integration via Coexistence. Aiming for perfect centralization always fails.

EA-perspective DA pitfalls and forbidden moves

Typical accident patterns in EA’s DA. All become causes of “same customer registered with 3 IDs,” “numbers diverge at management meetings”.

Forbidden moveWhy it’s bad
Don’t align term definitions company-wide”This month’s revenue” diverges 3-8% by department, parallel debates
Aim for master integration all at once CentralizedMigration stopping existing core, business-stop risk
Just install data catalog and abandonNo stewards, metadata not updated, rots
Split data domains by org nameOwnership disappears on org change, split by capability units
Don’t create PII InventoryGDPR compliance impossible, same risk as Meta EUR 1.2B fine
Direct AI to DB without semantic layerAI misunderstands “revenue,” mass-producing hallucinations
Talk company-wide data strategy without CDODoesn’t ride management agenda, stalls in dept warfare
Don’t operate data classification (public / internal / confidential / top secret)Vague personal-info handling, regulatory violations
Try MDM introduction stopping existing systemsBusiness-stop, big firestorm; phased integration via Coexistence
Manage metadata in PDF / ExcelAI can’t read, not continuously updated, becomes outdated
DB design exists so EA’s DA is unneededIndividual DB design and company-wide viewpoint are different; domain splitting and master alignment are work outside DB design
Buying a data catalog completes DATools are means; regime, rules, and operations are substance — just installing leads to neglect and rot
CDO is only for large enterprisesRecently mid-size also places CDO; whether to put data strategy on management agenda
Master-data integration solves all problemsIntegration itself is a hard project; the success secret is proceeding phased and realistically

Uber’s 2014 dashboard wars is told as a success case of in-house Michelangelo (ML platform) and Querybuilder (semantic layer) rooting the culture of “metric definitions agreed via GitHub PRs,” converting metric debates to engineering work.

EA-perspective DA is “word definitions before technology.” Aligning terms company-wide is the first step.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Conceptual data model (major 10-30 entities)
  • Data-domain split (who owns what)
  • Master-data strategy (integration method)
  • Data catalog (tool, operation)
  • Governance regime (CDO, committee)
  • Data-classification policy (public / internal / confidential / top secret)
  • Cloud DWH strategy (Snowflake / BigQuery etc.)

Author’s note - “numbers don’t match” that stopped a new project

The real fear of company-wide data definitions being disparate surfaces not at incidents but at decision-making time.

A DX project of “creating a company-wide revenue dashboard” started at a mid-size retailer, and aggregating revenue data from finance / sales / e-commerce DBs revealed that the 3 numbers diverged 3-8% monthly. The cause was differences per system in “is revenue counted at order or shipping?” / “with or without consumption tax?” / “when are returns reflected?” - spending over half a year on investigation and definition agreement, the management-meeting dashboard came online 1.5 years late from start - repeatedly told as standard talking point.

Another, Uber’s 2014 “dashboard wars” is also a famous case. Uber, in rapid growth, made independent data pipelines per team, resulting in the same metric “weekly rides” coexisting in 3-5 versions on internal dashboards, with CEO numbers diverging from field numbers. Eventually Uber developed in-house Michelangelo (ML platform) and Querybuilder (semantic layer), switching to mechanisms of defining metrics company-wide once and reusing. Thereafter, the culture of “metric definitions agreed via GitHub PRs” took root inside Uber, with metric debates converted to engineering work.

Both slap home the decisive value of “aligning word definitions company-wide before technology.” At companies without EA’s DA, the moment AI agents are asked “what’s this month’s revenue?” AI returns 3 different answers.

How to choose

The core of EA-perspective DA is the viewpoint of designing not individual DBs but company-wide data as strategic asset. Siloed data, same customer registered with 3 IDs, numbers diverging in management meetings - these aren’t DB tech problems but flaws in enterprise-level data systems. The work of EA-level DA is splitting domains and placing data owners, securing core-data uniqueness via MDM, and creating a company-wide map via conceptual models, data flows, and catalogs. The realistic approach is phased integration via Coexistence MDM — aiming for perfect centralization breaks down.

Another decisive axis is building a data space autonomously understandable by AI agents. For AI hearing “revenue” to reach the correct aggregation logic, semantic layers (dbt semantic layer, Cube.js) are required. Only companies where Data Mesh has domains providing data products to other departments and AI, with API-referenceable / continuously-updated catalogs in place, can hold competitiveness even in AI usage.

AI-era decision axes

When AI-driven dev (vibe coding) and AI usage are the premise, EA’s DA is redesigned as a data space accessible by AI agents. In the era when AI autonomously seeks data, whether company-wide data is visible to AI decides competitiveness.

Favored in the AI eraDisfavored in the AI era
Data Mesh (domain ownership)Centralized silos
API-referenceable dataExcel, files
Semantic layer (term definitions)Undefined column names
Continuously-updated catalogHalf-year-old snapshots

As AI Ready data architecture, setup of semantic layers (dbt semantic layer, Cube.js, etc.) draws attention. Design where AI hearing “revenue” reaches the correct aggregation logic is needed.

AI-era DA designs in vocabulary AI understands. Semantic layer is key.

Selection priorities

  1. Domain splitting and data owners - clarify ownership, governance foundation
  2. MDM via phased Coexistence - perfect centralization fails, don’t break existing
  3. Data classification for privacy compliance - public / internal / confidential / top secret, PII Inventory setup
  4. Semantic layer to hand AI vocabulary - dbt semantic layer / Cube.js, AI Ready design

“Design data in vocabulary AI understands.” Domain split + MDM + semantic layer is the core.

Summary

This article covered EA-perspective Data Architecture, including conceptual models, domains, MDM, catalog, PII Inventory, semantic layer, and AI Ready design.

Clarify ownership via domain split and data owners, MDM phased via Coexistence, data classification for privacy, hand AI vocabulary via semantic layer. That is the practical answer for EA-perspective DA in 2026.

Next time we’ll cover Application Architecture (AA) (system portfolio, integration patterns).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.