Data Architecture

[Data Architecture] Data Modeling - Schemas Readable by Both AI and Humans

[Data Architecture] Data Modeling - Schemas Readable by Both AI and Humans

About this article

As the third installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data modeling.

DBs outlive code, and changing the structure of a table grown to tens of millions of rows can require hours of downtime. The first round of modeling stays effective for a decade. This article covers the 3 stages of conceptual/logical/physical modeling, normalization and denormalization, primary-key design, indexes, schema-change strategy, and soft delete and history management - presenting how to build “schemas readable by both AI and humans.”

What data modeling handles

Once a DB is in operation, table-structure changes are extremely high-cost. Adding even one column to a table with tens of millions of rows can require hours of downtime, and modeling mistakes become long-term debt.

Apps can be rewritten. Data models cannot be rewritten - that’s reality.

Why modeling matters most

1. You must capture the essence of the business

Rather than just lining up “columns shown on screen,” you need to translate business rules and relationships into structure. Get this wrong and you can’t keep up with later business changes.

2. Impact ripples to surrounding systems

Business DBs are referenced from many directions - other systems, analytics platforms, reports, BI. The impact range of schema changes is wider than imagined, and careful design is needed.

3. Performance characteristics are decided in design

Decisions made at the modeling stage - indexes, normalization level, partition strategy - decide the future performance ceiling. Reaching back later requires major rewrites.

The 3 stages of modeling

Data modeling proceeds in 3 stages as a rule. Diving straight into table definitions (physical design) means you can’t grasp the business essence and is the source of breakdowns.

flowchart LR
    BIZ([Business understanding])
    CON["Conceptual model<br/>list business entities<br/>entity list / ER diagram"]
    LOG["Logical model<br/>define attributes and relations<br/>normalization"]
    PHY["Physical model<br/>implementation matched to DB product<br/>CREATE TABLE/INDEX"]
    DB[(production DB)]
    BIZ --> CON --> LOG --> PHY --> DB
    BAD[Diving straight from physical<br/>= can't grasp business, breakdown]
    BAD -.->|antipattern| PHY
    classDef biz fill:#fef3c7,stroke:#d97706;
    classDef step fill:#dbeafe,stroke:#2563eb;
    classDef db fill:#dcfce7,stroke:#16a34a;
    classDef bad fill:#fee2e2,stroke:#dc2626;
    class BIZ biz;
    class CON,LOG,PHY step;
    class DB db;
    class BAD bad;
StageWhat you doOutput
Conceptual modelList the “things” handled in the businessEntity list, ER diagram
Logical modelDefine attributes and relations, normalizeLogical ER diagram, attribute spec
Physical modelImplementation form fit to the DB productCREATE TABLE, indexes

Large-scale projects clearly separate these 3 stages, but in small/mid-scale projects, combining logical and physical is realistic. Still, always do the conceptual model first.

What is normalization

Normalization is a design principle for eliminating data duplication and preventing inconsistency on update. It’s a theory proposed by E.F. Codd in the 1970s, with normal forms from 1NF to 5NF. In practice, 3NF is general; beyond that is theoretical/academic territory.

For example, writing a customer’s name into the “orders table” means updating all order records when the customer’s name changes. Splitting it into “make a customer table separately, the order only holds customer ID” is normalization. You build a state where one change keeps the whole consistent.

StageContent
1NF (1st Normal Form)Each cell has one value. Move repeating items to a separate table
2NF (2nd Normal Form)Separate columns that don’t depend on the entire primary key
3NF (3rd Normal Form)Separate columns that depend on non-key columns (transitive dependency)

In practice, 3NF is enough. 4NF and 5NF are theory.

The thinking behind denormalization

Normalization excels at consistency, but has the side effect of JOINs increasing and read performance dropping. The deliberate move to “leave redundancy on purpose” is denormalization. In analytics DBs, denormalization is standard, and in business DBs it’s applied partially where read performance is needed.

NormalizedDenormalized
High update consistencyUpdate consistency drops
JOIN-heavy, slow readsNo JOIN, fast reads
Suited for OLTP (business DB)Suited for OLAP (analytics DB)
Less data duplicationMore data duplication

The basic motion for business DBs is design at 3NF, then denormalize only where needed - and “denormalize from the start” is dangerous.

Star schema (for analytics DBs)

The most-used analytics-DB schema is the star schema. With a “fact” table at the center and “dimension” tables around it - a star shape - it’s optimized for fast aggregation queries.

For sales analysis, for instance:

  • Fact table: sales line items (timestamp, product ID, customer ID, amount)
  • Dimension tables: product, customer, date, store

In this shape, multi-axis aggregation like “sales by region, month, product category” runs extremely fast. Almost all analytics platforms - BigQuery, Snowflake, Tableau - assume this shape.

Business DB: 3NF. Analytics DB: star schema. Change shape per use case.

Primary key design

The primary key is the ID uniquely identifying each record, and the choice strongly affects later design. In practice, auto-numbered surrogate keys (sequential, UUID) are mainstream, and the iron rule is to avoid using business values (email, phone) as primary keys.

If you make a business value the primary key, when that value changes, all foreign references must be updated - which can’t withstand real business changes like name change or email change.

PK typeCharacteristicsSuited for
Sequential (BIGINT)Small and fast, orderedBusiness DB, internal use only
UUID v4Distributed-generation possible, unorderedDistributed systems, public-facing
UUID v7Time-ordered, distributed-generation possiblePost-2024 new DBs
Business valueMeaning is readableDon’t use as a rule

Today, UUID v7 (RFC-finalized in 2024) is evaluated as the optimum that combines time order and randomness.

For new-DB primary keys: UUID v7 or BIGINT. v4 invites index performance degradation.

Index design

Indexes are auxiliary data structures that speed up search - correctly placed they get hundreds of times faster, but place too many and updates slow down and the DB bloats. The basic policy is “place on columns used in WHERE and JOIN, not on others.”

Composite-index order matters: only left-prefix matching works. An index of (a, b, c) works for WHERE a=?, WHERE a=? AND b=?, and WHERE a=? AND b=? AND c=?, but not for WHERE b=? alone.

Index typeUse case
B-TreeStandard, equality search, range search
HashEquality only (narrow use cases)
GIN / GiSTFull-text search, JSON, geospatial
Partial indexConditional (excluding deleted etc.)

Initially place the bare minimum, and add later only on actually-slow queries - that’s the rule.

Schema-change strategy (migration)

Schema changes on a running DB are the most accident-prone area. Column add/delete/type-change/constraint-change each have different danger levels, and the rule is to “change incrementally while preserving backward compatibility.”

ChangeDangerStrategy
Column add (NULLable)LowAdd directly
Column add (NOT NULL)MidNULL first → backfill → NOT NULL
Column deleteMidStop usage in code first → delete later
Type changeHighAdd new column → migrate data → switch
RenameHighHave both → phased switch

Use a migration tool (Flyway, Liquibase, Prisma Migrate, dbt, etc.) and manage history in Git - this is required. Modifying the DB by hand is an accident factory.

Soft delete and history management

How to handle deleted records is something the design must always settle. There are roughly 3 options, decided by business requirements and legal obligations.

MethodContentSuited for
Physical deleteRemove the record from the DBCache, logs, data with no audit need
Soft deleteLogical delete via deleted_at columnGeneral business data
History tableAccumulate changes in a separate tableFinance, audit-target, with legal requirements

When personal-data deletion requests arise (GDPR etc.), physical delete becomes mandatory in some scenes. Reconciling soft delete and law requires nailing down requirements upfront.

For finance/medical/public, history tables are often legally mandatory - check business requirements.

Decision criteria

1. Business complexity

Required modeling depth varies with business complexity. Simple CRUD apps need only 3NF, but the more complex the business rules, the more you need modeling that’s conscious of DDD (Domain-Driven Design) aggregate boundaries.

ComplexityRecommended approach
Simple CRUD (in-house tools, admin panels)3NF straightforwardly
Mid-size business (e-commerce, SaaS)3NF + aware of aggregate boundaries
Advanced business (finance, insurance, medical)DDD + event/history retention

2. Scale assumption

Assumed scale affects modeling. If you’ll handle massive data in the future, consider partitioning (internally splitting one table by time/customer ID into multiple regions) and sharding (horizontal distribution across multiple DBs) at initial design.

  • Under 1M records: don’t worry about anything
  • 10M+: consider partitioning major tables
  • 100M+: partitioning/sharding required, also consider denormalization

3. Team capability

Modeling depends on the team’s overall SQL/DB understanding. Advanced models (EAV (Entity-Attribute-Value), Polymorphic, event sourcing, etc.) become hell if the team can’t manage them.

Team capabilityRecommended level
SQL beginner, first-time DB designPlain 3NF, no extra tricks
Intermediate, has DB design experience3NF + soft delete + history
Advanced, has DDD experienceDDD aggregates + event sourcing

How to choose by case

In-house admin panel / simple CRUD

Design 3NF straightforwardly. UUID v7 + soft delete + audit columns (created_at, updated_at) is enough.

EC / SaaS business DB

3NF + history retention. Hold order history and price-change history in separate tables. Prepare an analytics DB separately and sync with ETL/ELT.

Finance / medical / public sector

History tables required, physical delete forbidden. Accumulate all changes in separate tables for audit response.

Analytics platform (DWH)

Star schema. ETL business-DB data, denormalize, and load. Follow BigQuery/Snowflake best practices.

Numerical gates and index thresholds for tables

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

Tracking modeling quality by numbers is the modern standard. Below are the industry standard guidelines.

MetricThresholdWhat to do if exceeded
Columns per table20 or fewerMixed responsibilities. Consider splitting
Rows per table (typical)~100M rowsBeyond that consider partitioning
Indexes per table5 or fewerUpdate performance drops. Review
JOIN depthUp to 3-4 levelsBeyond that denormalize or organize via views
Schema change (large table)Within 1 hour, no downtimeUse online DDL / pg_repack etc.
Composite-index column count3 or fewerBeyond that, narrow use cases
Proportion of NULLable columnsNOT NULL whenever possibleMake constraints work
PK typeUUID v7 or BIGINTv4 degrades index performance

Tables with “10M+ rows” experience minutes-to-hours of lock during ALTER TABLE, so non-stop changes via PostgreSQL pg_repack / MySQL gh-ost or pt-online-schema-change are mandatory. Operating with the expand/contract pattern (4 stages: column add → write to both → switch → drop old column) is the standard.

Numerical gates are the DB version of ESLint / SonarQube. Mechanically checkable.

Author’s note - the full-scan hell born of “metadata JSONB”

There’s a story sometimes told about a business app where every attribute of a user profile was crammed into one JSONB (JSON Binary, the JSON storage type with fast indexed search) column called metadata. The motivation was “flexibility for future expansion,” but just filtering “users where company name contains X” started running a full scan every time, and the admin panel became unable to load as data grew.

I’ve also seen a similar site where a UI searching hundreds of thousands of users started taking 3+ seconds, and the problem only surfaced then. Even with first-aid GIN indexes, performance was far slower than normal columns, and the case is often told paired with the punchline: “we eventually spent six months on a major refactor separating major attributes into normal columns.”

A similar case: adopting UUID v4 as PK, with records growing the “physical placement of the index becoming dispersed,” and writes getting heavy - performance degradation became prominent at the tens-of-millions scale. Since 2024, UUID v7 (with time order) has been standardized, and this problem can be avoided at the design stage. The lesson common to both cases is “flexible” and “sloppy” are paper-thin apart.

Limit JSON to the partial areas where schemas fluctuate. Search targets get split into normal columns.

Data modeling pitfalls and forbidden moves

Here are the typical accidents in modeling. The schema is the DB’s skeleton, so it’s the most expensive area to fix later.

Forbidden moveWhy it’s bad
Physical delete with no history retained for businessAudit/legal requirements stop you instantly. History tables required for finance/medical/tax
Build all columns as NULLableConstraints don’t work, data quality decays. Required items NOT NULL
Cram search targets into JSONBEven GIN-indexed it’s slower than normal columns. Major attributes go in normal columns
Use UUID v4 as primary key for mass insertIndex locality degrades. Use UUID v7 (time-ordered) or BIGINT
Email/phone as primary keyMajor refactor of all foreign keys on name/email change
Leave tmp_ / bak_ / old_ tables in productionUnclear “is it OK to delete?” debt years later
Drop all foreign-key constraints for performanceIntegrity destroyed, orphan data accumulates. Add unless you’re advanced
Migrating via GUI by handHistory/reproducibility/rollback impossible. Manage in Git via Flyway / Prisma Migrate / Liquibase
Naming in Japanese romaji (chushin_jusho)Neither AI nor humans can read. English snake_case is standard
Place preventive indexes on every columnEvery update updates every index, write latency. Place after measuring queries
Schema change during business hoursProduction stops on lock. Maintenance window or online DDL

The 2024 RFC-finalization of UUID v7 (RFC 9562) is a new version that combines UUID v4’s randomness with timestamps - v7 is standard for new projects. Index performance degradation in legacy systems that adopted UUID v4 was a frequently reported problem in the 2020s.

Foreign-key constraints are the last bastion of integrity. Cutting them for performance is a textbook bad move.

AI-era perspective

When AI-driven development (vibe coding) and AI usage are the premise, data modeling matters in being a structure AI can read and understand. DBs with explicit schemas and table/column names that have meaning as natural language let AI generate SQL accurately. Conversely, DBs with abbreviations, meaningless naming, and JSON abuse drop AI generation accuracy.

Favored in the AI eraDisfavored in the AI era
Clear 3NF structure, clear namingJSON abuse, schema-less
Natural English naming (users, orders)Abbreviations, Japanese-romaji naming
Explicit foreign-key constraintsNo constraints, relations need to be guessed
Comments describing business meaningBare tables without comments

In an era where AI agents operate DBs via Text-to-SQL, “the schema itself becomes documentation.” The quality of naming and comments directly correlates with AI generation accuracy.

Schemas friendly to AI are friendly to humans. Explicit 3NF + natural English naming is strongest.

Common misconceptions

  • For now, JSON columns for flexibility is safe - JSON abuse is the schema-less trap. Search and indexes work poorly, breaking down with scale. Limit JSON to “the partial areas where schemas fluctuate”
  • UUID v4 is the only choice for IDs - v4 is too random and index efficiency drops. UUID v7 is the modern optimum that combines time order and randomness
  • Foreign-key constraints are annoying so cut them - foreign-key constraints are the last bastion of integrity. Cutting for performance is an advanced judgment; in early days, always add them
  • The more normalized the better - 4NF and beyond are academic. The practical answer is to stop at 3NF and “secure JOIN performance”

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Normalization level (3NF as base)
  • PK strategy (UUID v7 recommended)
  • Soft-delete policy (deleted_at column / history table)
  • Audit columns (created_at, updated_at, created_by)
  • Naming convention (English snake_case is standard)
  • Migration tool (Flyway, Prisma Migrate, etc.)
  • Initial set of indexes

How to make the final call

The core of data modeling is translating business essence into structure, not lining up columns shown on screen. Once a DB is in operation, schema changes become extremely high-cost - tables of tens of millions of rows require hours of downtime just to add a column. So firming the business skeleton in conceptual → logical → physical order, and preserving the basic forms of “business DB at 3NF, analytics DB at star schema,” is rational. Both excessive normalization (4NF+) and denormalization from the start are unneeded detours in practice.

The decisive axis is the question “is this a schema AI can read?” In an era where AI operates DBs via Text-to-SQL, the schema itself functions as the spec. A DB with natural English naming, explicit foreign-key constraints, and comments describing business meaning dramatically raises AI generation accuracy. Conversely, abbreviations, Japanese-romaji, and JSON abuse drop AI generation accuracy.

Selection priorities

  1. Business DB at 3NF - 4NF+ is academic; stop at 3NF and secure JOIN performance
  2. UUID v7 + soft delete + audit columns - the standard set for post-2024 new DBs
  3. Natural English naming + comments - raise AI generation accuracy, prepare for future Text-to-SQL
  4. Manage migrations in Git - keep history traceable via Flyway / Prisma Migrate

“Schemas readable by both AI and humans.” Explicit 3NF, naming conveys intent.

Summary

This article covered data modeling, including the 3 conceptual/logical/physical stages, 3NF and denormalization, primary-key design, indexes, schema-change strategy, soft delete, and history management.

Business DBs at 3NF, UUID v7 + soft delete + audit columns as the standard set, raise AI generation accuracy with natural English naming. That is the practical answer for data modeling in 2026.

Next time we’ll cover data platforms (DWH, data lake, BI integration).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.