About this article
As the seventh installment of the âData Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains data governance.
A technology-only data platform rots in 3 years. This article covers the components of governance - data catalog, metadata, lineage, quality management, data stewards, and access control - alongside an introduction roadmap by scale and regulation, and the structure where in the AI era governance functions as a dictionary for AI.
What is data governance in the first place
In a nutshell, data governance is âthe mechanism for establishing and continuously enforcing rules on who can use what data and how across the entire company.â
Imagine a condo association. If residents dump trash whenever they want, leave personal belongings in shared areas, and copy keys freely, the whole building falls apart. Itâs the management regulations (governance) and the building manager (data steward) that maintain order. Data is the same - without rules for definitions, naming, access permissions, and quality standards, different departments produce different numbers, personal information leaks, and AI learns from bad data. Thatâs the collapse governance prevents.
What data governance handles
Just building a data platform causes âtables of unknown origin,â âsame-named columns with different definitions,â and âpersonal data leaking without permissionâ to accumulate over time, until the platform itself becomes untrustworthy. Governance is the organizational mechanism preventing this.
A technology-only data platform rots in 3 years. It must be defended through institution and operations.
Why itâs needed
1. Company-wide data use turns chaotic
When departments start using data with different definitions, the problem of âsales numbers differ by departmentâ always arises. Unifying definitions, naming, and usage rules is essential.
2. Risk of personal/confidential information leaks
GDPR (General Data Protection Regulation, the EUâs personal-data protection regulation), Japanâs Personal Information Protection Act, the My Number Act - violations bring fines from hundreds of millions to billions. In May 2023, Meta was fined EUR 1.2B (about JPY 200 billion) for transferring EU citizensâ data to the US in violation of GDPR. The largest record since GDPR took effect, this is an era where âone storage-location design choiceâ can produce a fine at the company-survival level.
3. A precondition for AI utilization
AI accuracy is decided by data quality. Giving AI data with unknown definition or unknown reliability spikes the risk of wrong judgments.
Main components
To implement data governance, combine the following elements. Any one alone is insufficient - it functions as a trinity of organization, institution, and technology.
flowchart TB
subgraph TECH["Technology"]
CAT[Data catalog]
META[Metadata management]
LIN[Lineage]
QUAL[Quality management]
AC[Access control]
end
subgraph ORG["Organization"]
STEW[Data stewards]
CDO[CDO]
end
subgraph RULE["Institution"]
POL[Policy<br/>retention/encryption/disposal]
end
GOAL([Trustworthy<br/>data platform])
TECH --> GOAL
ORG --> GOAL
RULE --> GOAL
classDef tech fill:#dbeafe,stroke:#2563eb;
classDef org fill:#fef3c7,stroke:#d97706;
classDef rule fill:#fae8ff,stroke:#a21caf;
classDef goal fill:#dcfce7,stroke:#16a34a;
class TECH,CAT,META,LIN,QUAL,AC tech;
class ORG,STEW,CDO org;
class RULE,POL rule;
class GOAL goal;
| Element | Role |
|---|---|
| Data catalog | Catalog of where what is |
| Metadata management | Each dataâs definition, owner, update frequency |
| Lineage | Visualization of data transformations and flow |
| Quality management | Detection of definition violations, missing data, duplicates |
| Access control | Permissions on who can see what |
| Data stewards | Person responsible for each data |
| Policy | Rules for retention period, encryption, disposal |
Data catalog
A data catalog is the organizationâs catalog of all data, with the mechanism of central searchability for âwhere, what data, with what definition, managed by whom.â It originated when Google built an internal mechanism to âsearch data like Google Search,â and today commercial and OSS tools are provided by various vendors.
Without a catalog, analysts ask people every time âwhere is this data?â and âwhat is this number?â, dropping data utilization speed to 1/10.
| Pros | Cons |
|---|---|
| Faster data discovery | Heavy initial metadata setup |
| Shorter new-hire onboarding | Information rots if neglected |
| Becomes evidence for audits | Tool fees, operational cost |
| Premise for AI integration | Doesnât function without a steward regime |
At small scale, dbt docs is enough. Consider dedicated tools from mid-scale onward.
Catalog tool options
| Tool | When to choose |
|---|---|
| dbt docs | Already on dbt, analytics-model-centric. Lightweight and free |
| DataHub (LinkedIn OSS) | Mid-to-large, want to grow OSS while customizing |
| Amundsen (Lyft OSS) | Lighter and simpler than DataHub. Want to lower the entry bar |
| Apache Atlas | Legacy DWH environment centered on Hadoop/Hive |
| Collibra | Enterprise, want to seriously build a governance regime |
| Alation | Want AI features (natural-language search etc.) |
DataHub is becoming the OSS de facto. Commercial-side has the two giants Collibra (full-feature) and Alation (strong AI), with investment scales for large enterprises.
Metadata management
The contents of the data catalog are metadata (data about data). For each table/column, record âwhat itâs for,â âhow to use it,â and âwho to ask,â so users donât get lost.
| Metadata type | Content |
|---|---|
| Business metadata | Definition, meaning, business glossary |
| Technical metadata | Schema, types, constraints, indexes |
| Operational metadata | Update frequency, SLA (Service Level Agreement), job history |
| Owner info | Data steward, contact |
| Quality metadata | Test results, anomaly-detection scores |
Metadata needs both auto-collection (catalog tool scans) and manual entry (steward writes), and full automation is impossible.
Data lineage
Data lineage is the mechanism for tracking where data came from, how itâs transformed, and where it flows. The DAG (Directed Acyclic Graph) auto-generated by dbt is one form of lineage, visualizing what columns of which business DB this aggregated result is derived from.
| Use case | Content |
|---|---|
| Impact investigation | âWhat breaks if I change this column?â |
| Cause tracking | âWhere did this dashboardâs number go wrong?â |
| Audit response | âWhere is personal data flowing to?â |
| Data retirement | âWhoâs using this table?â |
Without lineage in place, you canât delete old tables, and mystery tables pile up.
Data quality
Data quality is measured along 6 viewpoints. Auto-testing these and validating during pipeline execution is the modern best practice. dbtâs tests feature and Great Expectations are used.
| Viewpoint | Meaning | Example |
|---|---|---|
| Completeness | No missing data | No NULL in required items |
| Uniqueness | No duplicates | User IDs not duplicated |
| Accuracy | Matches reality | Sales amounts match actuals |
| Consistency | No logical contradictions | end_date > start_date |
| Timeliness | Updates arenât lagging | Daily data updated daily |
| Referential integrity | Foreign keys exist | User ID exists |
Data without guaranteed quality becomes the worst risk of misleading executive decisions.
Data stewards
A data steward is a human role taking responsibility for each data set. The organization needs people deciding the things technology alone canât solve - âwhatâs this dataâs definition?â and âhow may it be used?â
| Role | Responsibility scope |
|---|---|
| Business steward | Manage definition, use, business terms |
| Technical steward | Schema, pipeline, quality |
| Data owner | Final approval, scope of disclosure |
| Data custodian | Daily operations, access grants |
Stewards are assigned at least one per data set to clarify the locus of responsibility. The basis of governance is not leaving âtables no one managesâ alone.
Access control and policy
Data containing personal/confidential information is strictly managed for who can access. Mere âtable-level permissionsâ arenât enough - row-level and column-level access control becomes necessary.
| Control level | Content |
|---|---|
| Table level | Read/write permission per table |
| Column level | Salary column visible only to HR |
| Row level | See only your departmentâs data |
| Dynamic masking | Mask PII (Personally Identifiable Information) at query time |
| Retention period | Auto-delete after N days |
BigQuery and Snowflake natively support column-level and row-level policies, and SQL-side automatic filtering removes the need for app-side individual implementation.
Decision criteria
1. Org scale
The need for data governance varies with organization scale. At small scale, putting in excessive mechanisms freezes operations, so phased adoption matched to scale is realistic.
| Scale | Recommended |
|---|---|
| Startup (~30 people) | dbt docs only is enough |
| Small/mid (~300) | dbt docs + named data stewards |
| Mid (~3,000) | DataHub / Amundsen + organizational regime |
| Large (3,000+) | Collibra / Alation + dedicated department |
Putting in heavy tools from the start outpaces operations, so the basis is starting light and growing it.
2. Regulatory requirements
Legal-regulation requirements vary by industry, region, and data handled. In strictly-regulated industries, governance investment is required, and non-compliance can mean exiting the business.
| Industry/region | Major regulation | Governance requirement |
|---|---|---|
| Handle personal data in EU | GDPR | Right to erasure, consent management, portability |
| Japan, personal information | Personal Information Protection Act | No use beyond purpose, third-party-provision management |
| Medical | HIPAA (US health-info protection law), Medical Information Guidelines | Strong encryption, audit logs |
| Finance | PCI DSS (credit-card industry standard), FISC (financial institution safety standards) | Strict access control, defense in depth |
| Public companies | J-SOX (Japanese SOX, internal control reporting), SOX | Traceability of accounting data |
3. AI utilization level
If youâll seriously use AI, governance must be designed on the premise that AI agents use the data. AI believes given data as is, so bad-quality data produces AI lies.
| AI utilization level | Required governance elements |
|---|---|
| BI-reference only | Catalog, naming unification |
| Text-to-SQL | Metadata setup required |
| RAG / AI agents | + lineage, quality tests |
| Autonomous AI judgment | + audit logs, result explainability |
How to choose by case
Startup with only an analytics team using data
dbt + dbt docs + dbt tests. With minimum composition, âdefinitions, quality, documentationâ turns. Even without a dedicated steward, an analytics engineer can run it concurrently.
Mid-size company with cross-department data use
DataHub or Amundsen + departmental steward designation. In addition to the catalog, build a regime where âthis metricâs owner is so-and-so in salesâ is clear.
Regulated industries (finance, medical, public)
Collibra or Alation + dedicated governance department. Audit and legal-regulation responses are required, justifying commercial-tool investment.
Companies seriously running AI agents/RAG
DataHub + dbt tests + API-retrievable metadata setup. Because AI auto-explores data, API-retrievable metadata is required. PDF or Excel glossaries are unreadable to AI.
B2C handling massive personal data
Row-level/column-level access control + dynamic masking. Achievable with BigQuery/Snowflake standard features. Retention-period policies also matter, designing âdelete data not in use.â
Phased governance-introduction roadmap by regulation/scale
Governance âall-at-once large-scaleâ freezes operations, so the realistic path is growing it phase by phase. The phase is decided by regulatory requirements and organization scale.
| Phase | Org scale | Regulation | Introduction elements | Monthly investment |
|---|---|---|---|---|
| 1. Minimum | ~30 | None | dbt docs + dbt tests | $0 |
| 2. Basic | ~300 | Has personal info | + named data stewards, naming convention | tens-of-thousands of yen |
| 3. Mid-scale | ~3,000 | GDPR / Personal Info Act | + DataHub / Amundsen OSS + row/column access control | hundreds-of-thousands |
| 4. Enterprise | 3,000+ | GDPR + industry regs (HIPAA / PCI DSS / FISC) | + Collibra / Alation + dedicated governance dept | millions+ |
| 5. Regulated/listed | All scales | J-SOX, audit response | History retention, audit logs, full lineage | varies |
âGDPR fines reach the higher of 4% of annual revenue or EUR 20M (about JPY 3B).â The May 2023 Meta EUR 1.2B (about JPY 200B) fine showed an era where one data-storage-location design mistake creates a fine at the company-survival level. The iron rule is âinput regulatory requirements first.â
The moment you handle personal data, governance is a legal obligation. Bolting on later isnât possible.
Authorâs note - mountains of âownerless tablesâ and the EUR 1.2B fine
Thereâs a story often heard about a mid-size SaaS company where the data-analytics team was actively using dbt and BigQuery, but âdidnât put in governance mechanismsâ - and in 3 years, thousands of âtables of unknown originâ piled up. Tables from ex-employees, intermediate tables from experiments, urgent-aggregation tables from half a year ago - none deletable with confidence, leaving only storage cost and confusion.
A more serious case is the May 2023 incident where Meta was fined EUR 1.2B (about JPY 200B) for GDPR violation by transferring EU citizensâ data to the US. The largest record since GDPR took effect, it remains a talking point as a case showing an era where âone storage-location design choice creates fines at the company-survival level.â The early-2017 large-scale MongoDB ransomware case (instances with auth-setting forgotten breached in tens of thousands worldwide) is also told as a representative example of âgovernance absence connecting directly to incidents.â
I myself, in a previous job, saw groups of tables in the state of âunclear what tables, but canât confirm if itâs OK to delete,â and felt how governance absence quietly produces debt. Both leave the common lesson that âputting in tools alone doesnât protect.â The reality that without the trinity of institution, people, and technology, the platform actually becomes liability - is told as a case appearing regardless of scale.
Governance is a trinity of tools, institution, and people. Missing any one and it doesnât function.
Governance-operation pitfalls and forbidden moves
Here are the typical accidents in governance. All of them are direct causes of audit-response failure, regulatory violations, AI misjudgment.
| Forbidden move | Why itâs bad |
|---|---|
| Mistakenly think governance is realized just by putting in tools | Catalog gets neglected and rots. Stewards, policies, operational processes are required |
| Leave âtables no one managesâ alone | Thousands of mystery tables in 3 years, secret SQL only on retireesâ PCs |
| Load personal data into DWH without masking | GDPR / Personal Info Act violation. Row/column access control + dynamic masking |
| Store audit logs only inside the production account | Logs erased on breach. Separate account + WORM storage |
| Accumulate data without retention policies | Canât respond to GDPRâs right to erasure. Legal risk |
| Verbal/person-locked data definitions | Definitions lost on retirement/transfer, number reliability collapses |
| Apply uniform governance to all data | Cost explodes, operations break. Prioritize by importance |
| Manage with PDF / Excel glossaries | Unreadable to AI, unsearchable. Move to API-retrievable catalogs |
| Donât immediately stop retireesâ access permissions | Unauthorized access, leak incidents. Automate same-day deprovisioning |
| Make stewards all concurrent roles | A state where no one takes responsibility. One owner per data set |
| Assuming âgovernance is the audit departmentâs jobâ | Itâs everyoneâs job, including analytics and development; without the field writing metadata, catalogs canât be built |
| Avoiding governance assuming âgovernance = restrictionâ | Not restriction but a foundation for safe utilization; good governance accelerates use |
The âearly-2017 large-scale MongoDB ransomware caseâ (instances with auth-setting forgotten breached worldwide) is a representative case of governance absence âconnecting directly to incidents.â âMetaâs 2023 EUR 1.2B fineâ was also a result of dismissing GDPRâs storage-location regulation.
Governance is a trinity of tools, institution, and people. Missing any one and it doesnât function.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Full metadata, with column descriptions | Naming like col1, col2 |
| Natural-language definitions, use descriptions | Schema-only info |
| Lineage and update-frequency explicit | Black-box transformations |
| Catalogs AI can retrieve via API | PDF / Excel glossaries |
- Phased adoption matched to scale â dbt docs for startups, DataHub for mid-size, Collibra for large enterprises.
- Always name stewards â one or more per data set; donât leave tables unmanaged.
- Auto-test quality â validate completeness, uniqueness, consistency every time with dbt tests / Great Expectations.
- Metadata AI can read â API-retrievable, natural-language definitions, lineage visualized.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Data catalog (DataHub / Amundsen / Collibra / dbt docs)
- Data stewards (who owns what)
- Quality tests (dbt tests, Great Expectations)
- Access-control method (table/column/row level)
- Personal info handling (masking, retention)
- Data classification (public / internal / confidential / top secret)
- Audit logs (who referenced what when)
Summary
This article covered data governance, including data catalog, metadata, lineage, quality management, stewards, access control, a phased roadmap by scale and regulation, and AI-era governance.
Phased adoption matched to scale, always name stewards, auto-test quality, and curate into AI-readable metadata. That is the practical answer for data governance in 2026.
Next time weâll start a new category (Security Architecture).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (45/89)