Data Architecture

[Data Architecture] Data Governance - A Foundation Curated as a Dictionary for AI

[Data Architecture] Data Governance - A Foundation Curated as a Dictionary for AI

About this article

As the seventh installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data governance.

A technology-only data platform rots in 3 years. This article covers the components of governance - data catalog, metadata, lineage, quality management, data stewards, and access control - alongside an introduction roadmap by scale and regulation, and the structure where in the AI era governance functions as a dictionary for AI.

What is data governance in the first place

In a nutshell, data governance is “the mechanism for establishing and continuously enforcing rules on who can use what data and how across the entire company.”

Imagine a condo association. If residents dump trash whenever they want, leave personal belongings in shared areas, and copy keys freely, the whole building falls apart. It’s the management regulations (governance) and the building manager (data steward) that maintain order. Data is the same - without rules for definitions, naming, access permissions, and quality standards, different departments produce different numbers, personal information leaks, and AI learns from bad data. That’s the collapse governance prevents.

What data governance handles

Just building a data platform causes “tables of unknown origin,” “same-named columns with different definitions,” and “personal data leaking without permission” to accumulate over time, until the platform itself becomes untrustworthy. Governance is the organizational mechanism preventing this.

A technology-only data platform rots in 3 years. It must be defended through institution and operations.

Why it’s needed

1. Company-wide data use turns chaotic

When departments start using data with different definitions, the problem of “sales numbers differ by department” always arises. Unifying definitions, naming, and usage rules is essential.

2. Risk of personal/confidential information leaks

GDPR (General Data Protection Regulation, the EU’s personal-data protection regulation), Japan’s Personal Information Protection Act, the My Number Act - violations bring fines from hundreds of millions to billions. In May 2023, Meta was fined EUR 1.2B (about JPY 200 billion) for transferring EU citizens’ data to the US in violation of GDPR. The largest record since GDPR took effect, this is an era where “one storage-location design choice” can produce a fine at the company-survival level.

3. A precondition for AI utilization

AI accuracy is decided by data quality. Giving AI data with unknown definition or unknown reliability spikes the risk of wrong judgments.

Main components

To implement data governance, combine the following elements. Any one alone is insufficient - it functions as a trinity of organization, institution, and technology.

flowchart TB
    subgraph TECH["Technology"]
        CAT[Data catalog]
        META[Metadata management]
        LIN[Lineage]
        QUAL[Quality management]
        AC[Access control]
    end
    subgraph ORG["Organization"]
        STEW[Data stewards]
        CDO[CDO]
    end
    subgraph RULE["Institution"]
        POL[Policy<br/>retention/encryption/disposal]
    end
    GOAL([Trustworthy<br/>data platform])
    TECH --> GOAL
    ORG --> GOAL
    RULE --> GOAL
    classDef tech fill:#dbeafe,stroke:#2563eb;
    classDef org fill:#fef3c7,stroke:#d97706;
    classDef rule fill:#fae8ff,stroke:#a21caf;
    classDef goal fill:#dcfce7,stroke:#16a34a;
    class TECH,CAT,META,LIN,QUAL,AC tech;
    class ORG,STEW,CDO org;
    class RULE,POL rule;
    class GOAL goal;
ElementRole
Data catalogCatalog of where what is
Metadata managementEach data’s definition, owner, update frequency
LineageVisualization of data transformations and flow
Quality managementDetection of definition violations, missing data, duplicates
Access controlPermissions on who can see what
Data stewardsPerson responsible for each data
PolicyRules for retention period, encryption, disposal

Data catalog

A data catalog is the organization’s catalog of all data, with the mechanism of central searchability for “where, what data, with what definition, managed by whom.” It originated when Google built an internal mechanism to “search data like Google Search,” and today commercial and OSS tools are provided by various vendors.

Without a catalog, analysts ask people every time “where is this data?” and “what is this number?”, dropping data utilization speed to 1/10.

ProsCons
Faster data discoveryHeavy initial metadata setup
Shorter new-hire onboardingInformation rots if neglected
Becomes evidence for auditsTool fees, operational cost
Premise for AI integrationDoesn’t function without a steward regime

At small scale, dbt docs is enough. Consider dedicated tools from mid-scale onward.

Catalog tool options

ToolWhen to choose
dbt docsAlready on dbt, analytics-model-centric. Lightweight and free
DataHub (LinkedIn OSS)Mid-to-large, want to grow OSS while customizing
Amundsen (Lyft OSS)Lighter and simpler than DataHub. Want to lower the entry bar
Apache AtlasLegacy DWH environment centered on Hadoop/Hive
CollibraEnterprise, want to seriously build a governance regime
AlationWant AI features (natural-language search etc.)

DataHub is becoming the OSS de facto. Commercial-side has the two giants Collibra (full-feature) and Alation (strong AI), with investment scales for large enterprises.

Metadata management

The contents of the data catalog are metadata (data about data). For each table/column, record “what it’s for,” “how to use it,” and “who to ask,” so users don’t get lost.

Metadata typeContent
Business metadataDefinition, meaning, business glossary
Technical metadataSchema, types, constraints, indexes
Operational metadataUpdate frequency, SLA (Service Level Agreement), job history
Owner infoData steward, contact
Quality metadataTest results, anomaly-detection scores

Metadata needs both auto-collection (catalog tool scans) and manual entry (steward writes), and full automation is impossible.

Data lineage

Data lineage is the mechanism for tracking where data came from, how it’s transformed, and where it flows. The DAG (Directed Acyclic Graph) auto-generated by dbt is one form of lineage, visualizing what columns of which business DB this aggregated result is derived from.

Use caseContent
Impact investigation”What breaks if I change this column?”
Cause tracking”Where did this dashboard’s number go wrong?”
Audit response”Where is personal data flowing to?”
Data retirement”Who’s using this table?”

Without lineage in place, you can’t delete old tables, and mystery tables pile up.

Data quality

Data quality is measured along 6 viewpoints. Auto-testing these and validating during pipeline execution is the modern best practice. dbt’s tests feature and Great Expectations are used.

ViewpointMeaningExample
CompletenessNo missing dataNo NULL in required items
UniquenessNo duplicatesUser IDs not duplicated
AccuracyMatches realitySales amounts match actuals
ConsistencyNo logical contradictionsend_date > start_date
TimelinessUpdates aren’t laggingDaily data updated daily
Referential integrityForeign keys existUser ID exists

Data without guaranteed quality becomes the worst risk of misleading executive decisions.

Data stewards

A data steward is a human role taking responsibility for each data set. The organization needs people deciding the things technology alone can’t solve - “what’s this data’s definition?” and “how may it be used?”

RoleResponsibility scope
Business stewardManage definition, use, business terms
Technical stewardSchema, pipeline, quality
Data ownerFinal approval, scope of disclosure
Data custodianDaily operations, access grants

Stewards are assigned at least one per data set to clarify the locus of responsibility. The basis of governance is not leaving “tables no one manages” alone.

Access control and policy

Data containing personal/confidential information is strictly managed for who can access. Mere “table-level permissions” aren’t enough - row-level and column-level access control becomes necessary.

Control levelContent
Table levelRead/write permission per table
Column levelSalary column visible only to HR
Row levelSee only your department’s data
Dynamic maskingMask PII (Personally Identifiable Information) at query time
Retention periodAuto-delete after N days

BigQuery and Snowflake natively support column-level and row-level policies, and SQL-side automatic filtering removes the need for app-side individual implementation.

Decision criteria

1. Org scale

The need for data governance varies with organization scale. At small scale, putting in excessive mechanisms freezes operations, so phased adoption matched to scale is realistic.

ScaleRecommended
Startup (~30 people)dbt docs only is enough
Small/mid (~300)dbt docs + named data stewards
Mid (~3,000)DataHub / Amundsen + organizational regime
Large (3,000+)Collibra / Alation + dedicated department

Putting in heavy tools from the start outpaces operations, so the basis is starting light and growing it.

2. Regulatory requirements

Legal-regulation requirements vary by industry, region, and data handled. In strictly-regulated industries, governance investment is required, and non-compliance can mean exiting the business.

Industry/regionMajor regulationGovernance requirement
Handle personal data in EUGDPRRight to erasure, consent management, portability
Japan, personal informationPersonal Information Protection ActNo use beyond purpose, third-party-provision management
MedicalHIPAA (US health-info protection law), Medical Information GuidelinesStrong encryption, audit logs
FinancePCI DSS (credit-card industry standard), FISC (financial institution safety standards)Strict access control, defense in depth
Public companiesJ-SOX (Japanese SOX, internal control reporting), SOXTraceability of accounting data

3. AI utilization level

If you’ll seriously use AI, governance must be designed on the premise that AI agents use the data. AI believes given data as is, so bad-quality data produces AI lies.

AI utilization levelRequired governance elements
BI-reference onlyCatalog, naming unification
Text-to-SQLMetadata setup required
RAG / AI agents+ lineage, quality tests
Autonomous AI judgment+ audit logs, result explainability

How to choose by case

Startup with only an analytics team using data

dbt + dbt docs + dbt tests. With minimum composition, “definitions, quality, documentation” turns. Even without a dedicated steward, an analytics engineer can run it concurrently.

Mid-size company with cross-department data use

DataHub or Amundsen + departmental steward designation. In addition to the catalog, build a regime where “this metric’s owner is so-and-so in sales” is clear.

Regulated industries (finance, medical, public)

Collibra or Alation + dedicated governance department. Audit and legal-regulation responses are required, justifying commercial-tool investment.

Companies seriously running AI agents/RAG

DataHub + dbt tests + API-retrievable metadata setup. Because AI auto-explores data, API-retrievable metadata is required. PDF or Excel glossaries are unreadable to AI.

B2C handling massive personal data

Row-level/column-level access control + dynamic masking. Achievable with BigQuery/Snowflake standard features. Retention-period policies also matter, designing “delete data not in use.”

Phased governance-introduction roadmap by regulation/scale

Governance “all-at-once large-scale” freezes operations, so the realistic path is growing it phase by phase. The phase is decided by regulatory requirements and organization scale.

PhaseOrg scaleRegulationIntroduction elementsMonthly investment
1. Minimum~30Nonedbt docs + dbt tests$0
2. Basic~300Has personal info+ named data stewards, naming conventiontens-of-thousands of yen
3. Mid-scale~3,000GDPR / Personal Info Act+ DataHub / Amundsen OSS + row/column access controlhundreds-of-thousands
4. Enterprise3,000+GDPR + industry regs (HIPAA / PCI DSS / FISC)+ Collibra / Alation + dedicated governance deptmillions+
5. Regulated/listedAll scalesJ-SOX, audit responseHistory retention, audit logs, full lineagevaries

“GDPR fines reach the higher of 4% of annual revenue or EUR 20M (about JPY 3B).” The May 2023 Meta EUR 1.2B (about JPY 200B) fine showed an era where one data-storage-location design mistake creates a fine at the company-survival level. The iron rule is “input regulatory requirements first.”

The moment you handle personal data, governance is a legal obligation. Bolting on later isn’t possible.

Author’s note - mountains of “ownerless tables” and the EUR 1.2B fine

There’s a story often heard about a mid-size SaaS company where the data-analytics team was actively using dbt and BigQuery, but “didn’t put in governance mechanisms” - and in 3 years, thousands of “tables of unknown origin” piled up. Tables from ex-employees, intermediate tables from experiments, urgent-aggregation tables from half a year ago - none deletable with confidence, leaving only storage cost and confusion.

A more serious case is the May 2023 incident where Meta was fined EUR 1.2B (about JPY 200B) for GDPR violation by transferring EU citizens’ data to the US. The largest record since GDPR took effect, it remains a talking point as a case showing an era where “one storage-location design choice creates fines at the company-survival level.” The early-2017 large-scale MongoDB ransomware case (instances with auth-setting forgotten breached in tens of thousands worldwide) is also told as a representative example of “governance absence connecting directly to incidents.”

I myself, in a previous job, saw groups of tables in the state of “unclear what tables, but can’t confirm if it’s OK to delete,” and felt how governance absence quietly produces debt. Both leave the common lesson that “putting in tools alone doesn’t protect.” The reality that without the trinity of institution, people, and technology, the platform actually becomes liability - is told as a case appearing regardless of scale.

Governance is a trinity of tools, institution, and people. Missing any one and it doesn’t function.

Governance-operation pitfalls and forbidden moves

Here are the typical accidents in governance. All of them are direct causes of audit-response failure, regulatory violations, AI misjudgment.

Forbidden moveWhy it’s bad
Mistakenly think governance is realized just by putting in toolsCatalog gets neglected and rots. Stewards, policies, operational processes are required
Leave “tables no one manages” aloneThousands of mystery tables in 3 years, secret SQL only on retirees’ PCs
Load personal data into DWH without maskingGDPR / Personal Info Act violation. Row/column access control + dynamic masking
Store audit logs only inside the production accountLogs erased on breach. Separate account + WORM storage
Accumulate data without retention policiesCan’t respond to GDPR’s right to erasure. Legal risk
Verbal/person-locked data definitionsDefinitions lost on retirement/transfer, number reliability collapses
Apply uniform governance to all dataCost explodes, operations break. Prioritize by importance
Manage with PDF / Excel glossariesUnreadable to AI, unsearchable. Move to API-retrievable catalogs
Don’t immediately stop retirees’ access permissionsUnauthorized access, leak incidents. Automate same-day deprovisioning
Make stewards all concurrent rolesA state where no one takes responsibility. One owner per data set
Assuming “governance is the audit department’s job”It’s everyone’s job, including analytics and development; without the field writing metadata, catalogs can’t be built
Avoiding governance assuming “governance = restriction”Not restriction but a foundation for safe utilization; good governance accelerates use

The “early-2017 large-scale MongoDB ransomware case” (instances with auth-setting forgotten breached worldwide) is a representative case of governance absence “connecting directly to incidents.” “Meta’s 2023 EUR 1.2B fine” was also a result of dismissing GDPR’s storage-location regulation.

Governance is a trinity of tools, institution, and people. Missing any one and it doesn’t function.

AI decision axes

AI-era favorableAI-era unfavorable
Full metadata, with column descriptionsNaming like col1, col2
Natural-language definitions, use descriptionsSchema-only info
Lineage and update-frequency explicitBlack-box transformations
Catalogs AI can retrieve via APIPDF / Excel glossaries
  1. Phased adoption matched to scale — dbt docs for startups, DataHub for mid-size, Collibra for large enterprises.
  2. Always name stewards — one or more per data set; don’t leave tables unmanaged.
  3. Auto-test quality — validate completeness, uniqueness, consistency every time with dbt tests / Great Expectations.
  4. Metadata AI can read — API-retrievable, natural-language definitions, lineage visualized.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Data catalog (DataHub / Amundsen / Collibra / dbt docs)
  • Data stewards (who owns what)
  • Quality tests (dbt tests, Great Expectations)
  • Access-control method (table/column/row level)
  • Personal info handling (masking, retention)
  • Data classification (public / internal / confidential / top secret)
  • Audit logs (who referenced what when)

Summary

This article covered data governance, including data catalog, metadata, lineage, quality management, stewards, access control, a phased roadmap by scale and regulation, and AI-era governance.

Phased adoption matched to scale, always name stewards, auto-test quality, and curate into AI-readable metadata. That is the practical answer for data governance in 2026.

Next time we’ll start a new category (Security Architecture).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.