[Data Architecture] Data Governance - A Foundation Curated as a Dictionary for AI

About this article

As the seventh installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data governance.

A technology-only data platform rots in 3 years. This article covers the components of governance - data catalog, metadata, lineage, quality management, data stewards, and access control - alongside an introduction roadmap by scale and regulation, and the structure where in the AI era governance functions as a dictionary for AI.

What is data governance in the first place

In a nutshell, data governance is “the mechanism for establishing and continuously enforcing rules on who can use what data and how across the entire company.”

Imagine a condo association. If residents dump trash whenever they want, leave personal belongings in shared areas, and copy keys freely, the whole building falls apart. It’s the management regulations (governance) and the building manager (data steward) that maintain order. Data is the same - without rules for definitions, naming, access permissions, and quality standards, different departments produce different numbers, personal information leaks, and AI learns from bad data. That’s the collapse governance prevents.

What data governance handles

Just building a data platform causes “tables of unknown origin,” “same-named columns with different definitions,” and “personal data leaking without permission” to accumulate over time, until the platform itself becomes untrustworthy. Governance is the organizational mechanism preventing this.

A technology-only data platform rots in 3 years. It must be defended through institution and operations.

Why it’s needed

1. Company-wide data use turns chaotic

When departments start using data with different definitions, the problem of “sales numbers differ by department” always arises. Unifying definitions, naming, and usage rules is essential.

2. Risk of personal/confidential information leaks

GDPR (General Data Protection Regulation, the EU’s personal-data protection regulation), Japan’s Personal Information Protection Act, the My Number Act - violations bring fines from hundreds of millions to billions. In May 2023, Meta was fined EUR 1.2B (about JPY 200 billion) for transferring EU citizens’ data to the US in violation of GDPR. The largest record since GDPR took effect, this is an era where “one storage-location design choice” can produce a fine at the company-survival level.

3. A precondition for AI utilization

AI accuracy is decided by data quality. Giving AI data with unknown definition or unknown reliability spikes the risk of wrong judgments.

Main components

To implement data governance, combine the following elements. Any one alone is insufficient - it functions as a trinity of organization, institution, and technology.

flowchart TB
    subgraph TECH["Technology"]
        CAT[Data catalog]
        META[Metadata management]
        LIN[Lineage]
        QUAL[Quality management]
        AC[Access control]
    end
    subgraph ORG["Organization"]
        STEW[Data stewards]
        CDO[CDO]
    end
    subgraph RULE["Institution"]
        POL[Policy<br/>retention/encryption/disposal]
    end
    GOAL([Trustworthy<br/>data platform])
    TECH --> GOAL
    ORG --> GOAL
    RULE --> GOAL
    classDef tech fill:#dbeafe,stroke:#2563eb;
    classDef org fill:#fef3c7,stroke:#d97706;
    classDef rule fill:#fae8ff,stroke:#a21caf;
    classDef goal fill:#dcfce7,stroke:#16a34a;
    class TECH,CAT,META,LIN,QUAL,AC tech;
    class ORG,STEW,CDO org;
    class RULE,POL rule;
    class GOAL goal;

Element	Role
Data catalog	Catalog of where what is
Metadata management	Each data’s definition, owner, update frequency
Lineage	Visualization of data transformations and flow
Quality management	Detection of definition violations, missing data, duplicates
Access control	Permissions on who can see what
Data stewards	Person responsible for each data
Policy	Rules for retention period, encryption, disposal

Data catalog

A data catalog is the organization’s catalog of all data, with the mechanism of central searchability for “where, what data, with what definition, managed by whom.” It originated when Google built an internal mechanism to “search data like Google Search,” and today commercial and OSS tools are provided by various vendors.

Without a catalog, analysts ask people every time “where is this data?” and “what is this number?”, dropping data utilization speed to 1/10.

Pros	Cons
Faster data discovery	Heavy initial metadata setup
Shorter new-hire onboarding	Information rots if neglected
Becomes evidence for audits	Tool fees, operational cost
Premise for AI integration	Doesn’t function without a steward regime

At small scale, dbt docs is enough. Consider dedicated tools from mid-scale onward.

Catalog tool options

Tool	When to choose
dbt docs	Already on dbt, analytics-model-centric. Lightweight and free
DataHub (LinkedIn OSS)	Mid-to-large, want to grow OSS while customizing
Amundsen (Lyft OSS)	Lighter and simpler than DataHub. Want to lower the entry bar
Apache Atlas	Legacy DWH environment centered on Hadoop/Hive
Collibra	Enterprise, want to seriously build a governance regime
Alation	Want AI features (natural-language search etc.)

DataHub is becoming the OSS de facto. Commercial-side has the two giants Collibra (full-feature) and Alation (strong AI), with investment scales for large enterprises.

Metadata management

The contents of the data catalog are metadata (data about data). For each table/column, record “what it’s for,” “how to use it,” and “who to ask,” so users don’t get lost.

Metadata type	Content
Business metadata	Definition, meaning, business glossary
Technical metadata	Schema, types, constraints, indexes
Operational metadata	Update frequency, SLA (Service Level Agreement), job history
Owner info	Data steward, contact
Quality metadata	Test results, anomaly-detection scores

Metadata needs both auto-collection (catalog tool scans) and manual entry (steward writes), and full automation is impossible.

Data lineage

Data lineage is the mechanism for tracking where data came from, how it’s transformed, and where it flows. The DAG (Directed Acyclic Graph) auto-generated by dbt is one form of lineage, visualizing what columns of which business DB this aggregated result is derived from.

Use case	Content
Impact investigation	”What breaks if I change this column?”
Cause tracking	”Where did this dashboard’s number go wrong?”
Audit response	”Where is personal data flowing to?”
Data retirement	”Who’s using this table?”

Without lineage in place, you can’t delete old tables, and mystery tables pile up.

Data quality

Data quality is measured along 6 viewpoints. Auto-testing these and validating during pipeline execution is the modern best practice. dbt’s tests feature and Great Expectations are used.

Viewpoint	Meaning	Example
Completeness	No missing data	No NULL in required items
Uniqueness	No duplicates	User IDs not duplicated
Accuracy	Matches reality	Sales amounts match actuals
Consistency	No logical contradictions	end_date > start_date
Timeliness	Updates aren’t lagging	Daily data updated daily
Referential integrity	Foreign keys exist	User ID exists

Data without guaranteed quality becomes the worst risk of misleading executive decisions.

Data stewards

A data steward is a human role taking responsibility for each data set. The organization needs people deciding the things technology alone can’t solve - “what’s this data’s definition?” and “how may it be used?”

Role	Responsibility scope
Business steward	Manage definition, use, business terms
Technical steward	Schema, pipeline, quality
Data owner	Final approval, scope of disclosure
Data custodian	Daily operations, access grants

Stewards are assigned at least one per data set to clarify the locus of responsibility. The basis of governance is not leaving “tables no one manages” alone.

Access control and policy

Data containing personal/confidential information is strictly managed for who can access. Mere “table-level permissions” aren’t enough - row-level and column-level access control becomes necessary.

Control level	Content
Table level	Read/write permission per table
Column level	Salary column visible only to HR
Row level	See only your department’s data
Dynamic masking	Mask PII (Personally Identifiable Information) at query time
Retention period	Auto-delete after N days

BigQuery and Snowflake natively support column-level and row-level policies, and SQL-side automatic filtering removes the need for app-side individual implementation.

Decision criteria

1. Org scale

The need for data governance varies with organization scale. At small scale, putting in excessive mechanisms freezes operations, so phased adoption matched to scale is realistic.

Scale	Recommended
Startup (~30 people)	dbt docs only is enough
Small/mid (~300)	dbt docs + named data stewards
Mid (~3,000)	DataHub / Amundsen + organizational regime
Large (3,000+)	Collibra / Alation + dedicated department

Putting in heavy tools from the start outpaces operations, so the basis is starting light and growing it.

2. Regulatory requirements

Legal-regulation requirements vary by industry, region, and data handled. In strictly-regulated industries, governance investment is required, and non-compliance can mean exiting the business.

Industry/region	Major regulation	Governance requirement
Handle personal data in EU	GDPR	Right to erasure, consent management, portability
Japan, personal information	Personal Information Protection Act	No use beyond purpose, third-party-provision management
Medical	HIPAA (US health-info protection law), Medical Information Guidelines	Strong encryption, audit logs
Finance	PCI DSS (credit-card industry standard), FISC (financial institution safety standards)	Strict access control, defense in depth
Public companies	J-SOX (Japanese SOX, internal control reporting), SOX	Traceability of accounting data

3. AI utilization level

If you’ll seriously use AI, governance must be designed on the premise that AI agents use the data. AI believes given data as is, so bad-quality data produces AI lies.

AI utilization level	Required governance elements
BI-reference only	Catalog, naming unification
Text-to-SQL	Metadata setup required
RAG / AI agents	+ lineage, quality tests
Autonomous AI judgment	+ audit logs, result explainability

How to choose by case

Startup with only an analytics team using data

dbt + dbt docs + dbt tests. With minimum composition, “definitions, quality, documentation” turns. Even without a dedicated steward, an analytics engineer can run it concurrently.

Mid-size company with cross-department data use

DataHub or Amundsen + departmental steward designation. In addition to the catalog, build a regime where “this metric’s owner is so-and-so in sales” is clear.

Regulated industries (finance, medical, public)

Collibra or Alation + dedicated governance department. Audit and legal-regulation responses are required, justifying commercial-tool investment.

Companies seriously running AI agents/RAG

DataHub + dbt tests + API-retrievable metadata setup. Because AI auto-explores data, API-retrievable metadata is required. PDF or Excel glossaries are unreadable to AI.

B2C handling massive personal data

Row-level/column-level access control + dynamic masking. Achievable with BigQuery/Snowflake standard features. Retention-period policies also matter, designing “delete data not in use.”

Phased governance-introduction roadmap by regulation/scale

Governance “all-at-once large-scale” freezes operations, so the realistic path is growing it phase by phase. The phase is decided by regulatory requirements and organization scale.

Phase	Org scale	Regulation	Introduction elements	Monthly investment
1. Minimum	~30	None	dbt docs + dbt tests	$0
2. Basic	~300	Has personal info	+ named data stewards, naming convention	tens-of-thousands of yen
3. Mid-scale	~3,000	GDPR / Personal Info Act	+ DataHub / Amundsen OSS + row/column access control	hundreds-of-thousands
4. Enterprise	3,000+	GDPR + industry regs (HIPAA / PCI DSS / FISC)	+ Collibra / Alation + dedicated governance dept	millions+
5. Regulated/listed	All scales	J-SOX, audit response	History retention, audit logs, full lineage	varies

“GDPR fines reach the higher of 4% of annual revenue or EUR 20M (about JPY 3B).” The May 2023 Meta EUR 1.2B (about JPY 200B) fine showed an era where one data-storage-location design mistake creates a fine at the company-survival level. The iron rule is “input regulatory requirements first.”

The moment you handle personal data, governance is a legal obligation. Bolting on later isn’t possible.

Author’s note - mountains of “ownerless tables” and the EUR 1.2B fine

There’s a story often heard about a mid-size SaaS company where the data-analytics team was actively using dbt and BigQuery, but “didn’t put in governance mechanisms” - and in 3 years, thousands of “tables of unknown origin” piled up. Tables from ex-employees, intermediate tables from experiments, urgent-aggregation tables from half a year ago - none deletable with confidence, leaving only storage cost and confusion.

A more serious case is the May 2023 incident where Meta was fined EUR 1.2B (about JPY 200B) for GDPR violation by transferring EU citizens’ data to the US. The largest record since GDPR took effect, it remains a talking point as a case showing an era where “one storage-location design choice creates fines at the company-survival level.” The early-2017 large-scale MongoDB ransomware case (instances with auth-setting forgotten breached in tens of thousands worldwide) is also told as a representative example of “governance absence connecting directly to incidents.”

I myself, in a previous job, saw groups of tables in the state of “unclear what tables, but can’t confirm if it’s OK to delete,” and felt how governance absence quietly produces debt. Both leave the common lesson that “putting in tools alone doesn’t protect.” The reality that without the trinity of institution, people, and technology, the platform actually becomes liability - is told as a case appearing regardless of scale.

Governance is a trinity of tools, institution, and people. Missing any one and it doesn’t function.

Governance-operation pitfalls and forbidden moves

Here are the typical accidents in governance. All of them are direct causes of audit-response failure, regulatory violations, AI misjudgment.

Forbidden move	Why it’s bad
Mistakenly think governance is realized just by putting in tools	Catalog gets neglected and rots. Stewards, policies, operational processes are required
Leave “tables no one manages” alone	Thousands of mystery tables in 3 years, secret SQL only on retirees’ PCs
Load personal data into DWH without masking	GDPR / Personal Info Act violation. Row/column access control + dynamic masking
Store audit logs only inside the production account	Logs erased on breach. Separate account + WORM storage
Accumulate data without retention policies	Can’t respond to GDPR’s right to erasure. Legal risk
Verbal/person-locked data definitions	Definitions lost on retirement/transfer, number reliability collapses
Apply uniform governance to all data	Cost explodes, operations break. Prioritize by importance
Manage with PDF / Excel glossaries	Unreadable to AI, unsearchable. Move to API-retrievable catalogs
Don’t immediately stop retirees’ access permissions	Unauthorized access, leak incidents. Automate same-day deprovisioning
Make stewards all concurrent roles	A state where no one takes responsibility. One owner per data set
Assuming “governance is the audit department’s job”	It’s everyone’s job, including analytics and development; without the field writing metadata, catalogs can’t be built
Avoiding governance assuming “governance = restriction”	Not restriction but a foundation for safe utilization; good governance accelerates use

The “early-2017 large-scale MongoDB ransomware case” (instances with auth-setting forgotten breached worldwide) is a representative case of governance absence “connecting directly to incidents.” “Meta’s 2023 EUR 1.2B fine” was also a result of dismissing GDPR’s storage-location regulation.

Governance is a trinity of tools, institution, and people. Missing any one and it doesn’t function.

AI decision axes

AI-era favorable	AI-era unfavorable
Full metadata, with column descriptions	Naming like `col1, col2`
Natural-language definitions, use descriptions	Schema-only info
Lineage and update-frequency explicit	Black-box transformations
Catalogs AI can retrieve via API	PDF / Excel glossaries

Phased adoption matched to scale — dbt docs for startups, DataHub for mid-size, Collibra for large enterprises.
Always name stewards — one or more per data set; don’t leave tables unmanaged.
Auto-test quality — validate completeness, uniqueness, consistency every time with dbt tests / Great Expectations.
Metadata AI can read — API-retrievable, natural-language definitions, lineage visualized.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

Data catalog (DataHub / Amundsen / Collibra / dbt docs)
Data stewards (who owns what)
Quality tests (dbt tests, Great Expectations)
Access-control method (table/column/row level)
Personal info handling (masking, retention)
Data classification (public / internal / confidential / top secret)
Audit logs (who referenced what when)

Summary

This article covered data governance, including data catalog, metadata, lineage, quality management, stewards, access control, a phased roadmap by scale and regulation, and AI-era governance.

Phased adoption matched to scale, always name stewards, auto-test quality, and curate into AI-readable metadata. That is the practical answer for data governance in 2026.

Next time we’ll start a new category (Security Architecture).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.

[Data Architecture] Data Governance - A Foundation Curated as a Dictionary for AI

About this article

What is data governance in the first place

What data governance handles

Why it’s needed

1. Company-wide data use turns chaotic

2. Risk of personal/confidential information leaks

3. A precondition for AI utilization

Main components

Data catalog

Catalog tool options

Metadata management

Data lineage

Data quality

Data stewards

Access control and policy

Decision criteria

1. Org scale

2. Regulatory requirements

3. AI utilization level

How to choose by case

Startup with only an analytics team using data

Mid-size company with cross-department data use

Regulated industries (finance, medical, public)

Companies seriously running AI agents/RAG

B2C handling massive personal data

Phased governance-introduction roadmap by regulation/scale

Author’s note - mountains of “ownerless tables” and the EUR 1.2B fine

Governance-operation pitfalls and forbidden moves

AI decision axes

What to decide - what is your project’s answer?

Summary

Related Articles