About this article
As the third installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data modeling.
DBs outlive code, and changing the structure of a table grown to tens of millions of rows can require hours of downtime. The first round of modeling stays effective for a decade. This article covers the 3 stages of conceptual/logical/physical modeling, normalization and denormalization, primary-key design, indexes, schema-change strategy, and soft delete and history management - presenting how to build “schemas readable by both AI and humans.”
Other articles in this category
What data modeling handles
Once a DB is in operation, table-structure changes are extremely high-cost. Adding even one column to a table with tens of millions of rows can require hours of downtime, and modeling mistakes become long-term debt.
Apps can be rewritten. Data models cannot be rewritten - that’s reality.
Why modeling matters most
1. You must capture the essence of the business
Rather than just lining up “columns shown on screen,” you need to translate business rules and relationships into structure. Get this wrong and you can’t keep up with later business changes.
2. Impact ripples to surrounding systems
Business DBs are referenced from many directions - other systems, analytics platforms, reports, BI. The impact range of schema changes is wider than imagined, and careful design is needed.
3. Performance characteristics are decided in design
Decisions made at the modeling stage - indexes, normalization level, partition strategy - decide the future performance ceiling. Reaching back later requires major rewrites.
The 3 stages of modeling
Data modeling proceeds in 3 stages as a rule. Diving straight into table definitions (physical design) means you can’t grasp the business essence and is the source of breakdowns.
flowchart LR
BIZ([Business understanding])
CON["Conceptual model<br/>list business entities<br/>entity list / ER diagram"]
LOG["Logical model<br/>define attributes and relations<br/>normalization"]
PHY["Physical model<br/>implementation matched to DB product<br/>CREATE TABLE/INDEX"]
DB[(production DB)]
BIZ --> CON --> LOG --> PHY --> DB
BAD[Diving straight from physical<br/>= can't grasp business, breakdown]
BAD -.->|antipattern| PHY
classDef biz fill:#fef3c7,stroke:#d97706;
classDef step fill:#dbeafe,stroke:#2563eb;
classDef db fill:#dcfce7,stroke:#16a34a;
classDef bad fill:#fee2e2,stroke:#dc2626;
class BIZ biz;
class CON,LOG,PHY step;
class DB db;
class BAD bad;
| Stage | What you do | Output |
|---|---|---|
| Conceptual model | List the “things” handled in the business | Entity list, ER diagram |
| Logical model | Define attributes and relations, normalize | Logical ER diagram, attribute spec |
| Physical model | Implementation form fit to the DB product | CREATE TABLE, indexes |
Large-scale projects clearly separate these 3 stages, but in small/mid-scale projects, combining logical and physical is realistic. Still, always do the conceptual model first.
What is normalization
Normalization is a design principle for eliminating data duplication and preventing inconsistency on update. It’s a theory proposed by E.F. Codd in the 1970s, with normal forms from 1NF to 5NF. In practice, 3NF is general; beyond that is theoretical/academic territory.
For example, writing a customer’s name into the “orders table” means updating all order records when the customer’s name changes. Splitting it into “make a customer table separately, the order only holds customer ID” is normalization. You build a state where one change keeps the whole consistent.
| Stage | Content |
|---|---|
| 1NF (1st Normal Form) | Each cell has one value. Move repeating items to a separate table |
| 2NF (2nd Normal Form) | Separate columns that don’t depend on the entire primary key |
| 3NF (3rd Normal Form) | Separate columns that depend on non-key columns (transitive dependency) |
In practice, 3NF is enough. 4NF and 5NF are theory.
The thinking behind denormalization
Normalization excels at consistency, but has the side effect of JOINs increasing and read performance dropping. The deliberate move to “leave redundancy on purpose” is denormalization. In analytics DBs, denormalization is standard, and in business DBs it’s applied partially where read performance is needed.
| Normalized | Denormalized |
|---|---|
| High update consistency | Update consistency drops |
| JOIN-heavy, slow reads | No JOIN, fast reads |
| Suited for OLTP (business DB) | Suited for OLAP (analytics DB) |
| Less data duplication | More data duplication |
The basic motion for business DBs is design at 3NF, then denormalize only where needed - and “denormalize from the start” is dangerous.
Star schema (for analytics DBs)
The most-used analytics-DB schema is the star schema. With a “fact” table at the center and “dimension” tables around it - a star shape - it’s optimized for fast aggregation queries.
For sales analysis, for instance:
- Fact table: sales line items (timestamp, product ID, customer ID, amount)
- Dimension tables: product, customer, date, store
In this shape, multi-axis aggregation like “sales by region, month, product category” runs extremely fast. Almost all analytics platforms - BigQuery, Snowflake, Tableau - assume this shape.
Business DB: 3NF. Analytics DB: star schema. Change shape per use case.
Primary key design
The primary key is the ID uniquely identifying each record, and the choice strongly affects later design. In practice, auto-numbered surrogate keys (sequential, UUID) are mainstream, and the iron rule is to avoid using business values (email, phone) as primary keys.
If you make a business value the primary key, when that value changes, all foreign references must be updated - which can’t withstand real business changes like name change or email change.
| PK type | Characteristics | Suited for |
|---|---|---|
| Sequential (BIGINT) | Small and fast, ordered | Business DB, internal use only |
| UUID v4 | Distributed-generation possible, unordered | Distributed systems, public-facing |
| UUID v7 | Time-ordered, distributed-generation possible | Post-2024 new DBs |
| Business value | Meaning is readable | Don’t use as a rule |
Today, UUID v7 (RFC-finalized in 2024) is evaluated as the optimum that combines time order and randomness.
For new-DB primary keys: UUID v7 or BIGINT. v4 invites index performance degradation.
Index design
Indexes are auxiliary data structures that speed up search - correctly placed they get hundreds of times faster, but place too many and updates slow down and the DB bloats. The basic policy is “place on columns used in WHERE and JOIN, not on others.”
Composite-index order matters: only left-prefix matching works. An index of (a, b, c) works for WHERE a=?, WHERE a=? AND b=?, and WHERE a=? AND b=? AND c=?, but not for WHERE b=? alone.
| Index type | Use case |
|---|---|
| B-Tree | Standard, equality search, range search |
| Hash | Equality only (narrow use cases) |
| GIN / GiST | Full-text search, JSON, geospatial |
| Partial index | Conditional (excluding deleted etc.) |
Initially place the bare minimum, and add later only on actually-slow queries - that’s the rule.
Schema-change strategy (migration)
Schema changes on a running DB are the most accident-prone area. Column add/delete/type-change/constraint-change each have different danger levels, and the rule is to “change incrementally while preserving backward compatibility.”
| Change | Danger | Strategy |
|---|---|---|
| Column add (NULLable) | Low | Add directly |
| Column add (NOT NULL) | Mid | NULL first → backfill → NOT NULL |
| Column delete | Mid | Stop usage in code first → delete later |
| Type change | High | Add new column → migrate data → switch |
| Rename | High | Have both → phased switch |
Use a migration tool (Flyway, Liquibase, Prisma Migrate, dbt, etc.) and manage history in Git - this is required. Modifying the DB by hand is an accident factory.
Soft delete and history management
How to handle deleted records is something the design must always settle. There are roughly 3 options, decided by business requirements and legal obligations.
| Method | Content | Suited for |
|---|---|---|
| Physical delete | Remove the record from the DB | Cache, logs, data with no audit need |
| Soft delete | Logical delete via deleted_at column | General business data |
| History table | Accumulate changes in a separate table | Finance, audit-target, with legal requirements |
When personal-data deletion requests arise (GDPR etc.), physical delete becomes mandatory in some scenes. Reconciling soft delete and law requires nailing down requirements upfront.
For finance/medical/public, history tables are often legally mandatory - check business requirements.
Decision criteria
1. Business complexity
Required modeling depth varies with business complexity. Simple CRUD apps need only 3NF, but the more complex the business rules, the more you need modeling that’s conscious of DDD (Domain-Driven Design) aggregate boundaries.
| Complexity | Recommended approach |
|---|---|
| Simple CRUD (in-house tools, admin panels) | 3NF straightforwardly |
| Mid-size business (e-commerce, SaaS) | 3NF + aware of aggregate boundaries |
| Advanced business (finance, insurance, medical) | DDD + event/history retention |
2. Scale assumption
Assumed scale affects modeling. If you’ll handle massive data in the future, consider partitioning (internally splitting one table by time/customer ID into multiple regions) and sharding (horizontal distribution across multiple DBs) at initial design.
- Under 1M records: don’t worry about anything
- 10M+: consider partitioning major tables
- 100M+: partitioning/sharding required, also consider denormalization
3. Team capability
Modeling depends on the team’s overall SQL/DB understanding. Advanced models (EAV (Entity-Attribute-Value), Polymorphic, event sourcing, etc.) become hell if the team can’t manage them.
| Team capability | Recommended level |
|---|---|
| SQL beginner, first-time DB design | Plain 3NF, no extra tricks |
| Intermediate, has DB design experience | 3NF + soft delete + history |
| Advanced, has DDD experience | DDD aggregates + event sourcing |
How to choose by case
In-house admin panel / simple CRUD
Design 3NF straightforwardly. UUID v7 + soft delete + audit columns (created_at, updated_at) is enough.
EC / SaaS business DB
3NF + history retention. Hold order history and price-change history in separate tables. Prepare an analytics DB separately and sync with ETL/ELT.
Finance / medical / public sector
History tables required, physical delete forbidden. Accumulate all changes in separate tables for audit response.
Analytics platform (DWH)
Star schema. ETL business-DB data, denormalize, and load. Follow BigQuery/Snowflake best practices.
Numerical gates and index thresholds for tables
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
Tracking modeling quality by numbers is the modern standard. Below are the industry standard guidelines.
| Metric | Threshold | What to do if exceeded |
|---|---|---|
| Columns per table | 20 or fewer | Mixed responsibilities. Consider splitting |
| Rows per table (typical) | ~100M rows | Beyond that consider partitioning |
| Indexes per table | 5 or fewer | Update performance drops. Review |
| JOIN depth | Up to 3-4 levels | Beyond that denormalize or organize via views |
| Schema change (large table) | Within 1 hour, no downtime | Use online DDL / pg_repack etc. |
| Composite-index column count | 3 or fewer | Beyond that, narrow use cases |
| Proportion of NULLable columns | NOT NULL whenever possible | Make constraints work |
| PK type | UUID v7 or BIGINT | v4 degrades index performance |
Tables with “10M+ rows” experience minutes-to-hours of lock during ALTER TABLE, so non-stop changes via PostgreSQL pg_repack / MySQL gh-ost or pt-online-schema-change are mandatory. Operating with the expand/contract pattern (4 stages: column add → write to both → switch → drop old column) is the standard.
Numerical gates are the DB version of ESLint / SonarQube. Mechanically checkable.
Author’s note - the full-scan hell born of “metadata JSONB”
There’s a story sometimes told about a business app where every attribute of a user profile was crammed into one JSONB (JSON Binary, the JSON storage type with fast indexed search) column called metadata. The motivation was “flexibility for future expansion,” but just filtering “users where company name contains X” started running a full scan every time, and the admin panel became unable to load as data grew.
I’ve also seen a similar site where a UI searching hundreds of thousands of users started taking 3+ seconds, and the problem only surfaced then. Even with first-aid GIN indexes, performance was far slower than normal columns, and the case is often told paired with the punchline: “we eventually spent six months on a major refactor separating major attributes into normal columns.”
A similar case: adopting UUID v4 as PK, with records growing the “physical placement of the index becoming dispersed,” and writes getting heavy - performance degradation became prominent at the tens-of-millions scale. Since 2024, UUID v7 (with time order) has been standardized, and this problem can be avoided at the design stage. The lesson common to both cases is “flexible” and “sloppy” are paper-thin apart.
Limit JSON to the partial areas where schemas fluctuate. Search targets get split into normal columns.
Data modeling pitfalls and forbidden moves
Here are the typical accidents in modeling. The schema is the DB’s skeleton, so it’s the most expensive area to fix later.
| Forbidden move | Why it’s bad |
|---|---|
| Physical delete with no history retained for business | Audit/legal requirements stop you instantly. History tables required for finance/medical/tax |
| Build all columns as NULLable | Constraints don’t work, data quality decays. Required items NOT NULL |
| Cram search targets into JSONB | Even GIN-indexed it’s slower than normal columns. Major attributes go in normal columns |
| Use UUID v4 as primary key for mass insert | Index locality degrades. Use UUID v7 (time-ordered) or BIGINT |
| Email/phone as primary key | Major refactor of all foreign keys on name/email change |
Leave tmp_ / bak_ / old_ tables in production | Unclear “is it OK to delete?” debt years later |
| Drop all foreign-key constraints for performance | Integrity destroyed, orphan data accumulates. Add unless you’re advanced |
| Migrating via GUI by hand | History/reproducibility/rollback impossible. Manage in Git via Flyway / Prisma Migrate / Liquibase |
Naming in Japanese romaji (chushin_jusho) | Neither AI nor humans can read. English snake_case is standard |
| Place preventive indexes on every column | Every update updates every index, write latency. Place after measuring queries |
| Schema change during business hours | Production stops on lock. Maintenance window or online DDL |
The 2024 RFC-finalization of UUID v7 (RFC 9562) is a new version that combines UUID v4’s randomness with timestamps - v7 is standard for new projects. Index performance degradation in legacy systems that adopted UUID v4 was a frequently reported problem in the 2020s.
Foreign-key constraints are the last bastion of integrity. Cutting them for performance is a textbook bad move.
AI-era perspective
When AI-driven development (vibe coding) and AI usage are the premise, data modeling matters in being a structure AI can read and understand. DBs with explicit schemas and table/column names that have meaning as natural language let AI generate SQL accurately. Conversely, DBs with abbreviations, meaningless naming, and JSON abuse drop AI generation accuracy.
| Favored in the AI era | Disfavored in the AI era |
|---|---|
| Clear 3NF structure, clear naming | JSON abuse, schema-less |
| Natural English naming (users, orders) | Abbreviations, Japanese-romaji naming |
| Explicit foreign-key constraints | No constraints, relations need to be guessed |
| Comments describing business meaning | Bare tables without comments |
In an era where AI agents operate DBs via Text-to-SQL, “the schema itself becomes documentation.” The quality of naming and comments directly correlates with AI generation accuracy.
Schemas friendly to AI are friendly to humans. Explicit 3NF + natural English naming is strongest.
Common misconceptions
- For now, JSON columns for flexibility is safe - JSON abuse is the schema-less trap. Search and indexes work poorly, breaking down with scale. Limit JSON to “the partial areas where schemas fluctuate”
- UUID v4 is the only choice for IDs - v4 is too random and index efficiency drops. UUID v7 is the modern optimum that combines time order and randomness
- Foreign-key constraints are annoying so cut them - foreign-key constraints are the last bastion of integrity. Cutting for performance is an advanced judgment; in early days, always add them
- The more normalized the better - 4NF and beyond are academic. The practical answer is to stop at 3NF and “secure JOIN performance”
What to decide - what is your project’s answer?
For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”
- Normalization level (3NF as base)
- PK strategy (UUID v7 recommended)
- Soft-delete policy (deleted_at column / history table)
- Audit columns (created_at, updated_at, created_by)
- Naming convention (English snake_case is standard)
- Migration tool (Flyway, Prisma Migrate, etc.)
- Initial set of indexes
How to make the final call
The core of data modeling is translating business essence into structure, not lining up columns shown on screen. Once a DB is in operation, schema changes become extremely high-cost - tables of tens of millions of rows require hours of downtime just to add a column. So firming the business skeleton in conceptual → logical → physical order, and preserving the basic forms of “business DB at 3NF, analytics DB at star schema,” is rational. Both excessive normalization (4NF+) and denormalization from the start are unneeded detours in practice.
The decisive axis is the question “is this a schema AI can read?” In an era where AI operates DBs via Text-to-SQL, the schema itself functions as the spec. A DB with natural English naming, explicit foreign-key constraints, and comments describing business meaning dramatically raises AI generation accuracy. Conversely, abbreviations, Japanese-romaji, and JSON abuse drop AI generation accuracy.
Selection priorities
- Business DB at 3NF - 4NF+ is academic; stop at 3NF and secure JOIN performance
- UUID v7 + soft delete + audit columns - the standard set for post-2024 new DBs
- Natural English naming + comments - raise AI generation accuracy, prepare for future Text-to-SQL
- Manage migrations in Git - keep history traceable via Flyway / Prisma Migrate
“Schemas readable by both AI and humans.” Explicit 3NF, naming conveys intent.
Summary
This article covered data modeling, including the 3 conceptual/logical/physical stages, 3NF and denormalization, primary-key design, indexes, schema-change strategy, soft delete, and history management.
Business DBs at 3NF, UUID v7 + soft delete + audit columns as the standard set, raise AI generation accuracy with natural English naming. That is the practical answer for data modeling in 2026.
Next time we’ll cover data platforms (DWH, data lake, BI integration).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (41/89)