About this article
As the third installment of the âData Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains data modeling.
DBs outlive code, and changing the structure of a table grown to tens of millions of rows can require hours of downtime. The first round of modeling stays effective for a decade. This article covers the 3 stages of conceptual/logical/physical modeling, normalization and denormalization, primary-key design, indexes, schema-change strategy, and soft delete and history management - presenting how to build âschemas readable by both AI and humans.â
What is data modeling in the first place
In a nutshell, data modeling is âdeciding the organizational rules for the data your application handles.â
Imagine packing for a move. If you just shove clothes, dishes, and documents into cardboard boxes, you wonât know where anything is at the new place. But if you set classification rules - âclothes go in garment cases, dishes in cushioned boxes, documents in file boxesâ - anyone can quickly retrieve what they need. Data modeling works the same way: itâs the process of designing âin what structure and with what relationships to storeâ the information handled by the business. If this design is sloppy, youâll later face slow searches, inability to aggregate, and changes breaking everything.
What data modeling handles
Once a DB is in operation, table-structure changes are extremely high-cost. Adding even one column to a table with tens of millions of rows can require hours of downtime, and modeling mistakes become long-term debt.
Apps can be rewritten. Data models cannot be rewritten - thatâs reality.
Why modeling matters most
1. You must capture the essence of the business
Rather than just lining up âcolumns shown on screen,â you need to translate business rules and relationships into structure. Get this wrong and you canât keep up with later business changes.
2. Impact ripples to surrounding systems
Business DBs are referenced from many directions - other systems, analytics platforms, reports, BI. The impact range of schema changes is wider than imagined, and careful design is needed.
3. Performance characteristics are decided in design
Decisions made at the modeling stage - indexes, normalization level, partition strategy - decide the future performance ceiling. Reaching back later requires major rewrites.
The 3 stages of modeling
Data modeling proceeds in 3 stages as a rule. Diving straight into table definitions (physical design) means you canât grasp the business essence and is the source of breakdowns.
| Stage | What you do | Output |
|---|---|---|
| Conceptual model | List the âthingsâ handled in the business | Entity list, ER diagram |
| Logical model | Define attributes and relations, normalize | Logical ER diagram, attribute spec |
| Physical model | Implementation form fit to the DB product | CREATE TABLE, indexes |
Large-scale projects clearly separate these 3 stages, but in small/mid-scale projects, combining logical and physical is realistic. Still, always do the conceptual model first.
What is normalization
Normalization is a design principle for eliminating data duplication and preventing inconsistency on update. Itâs a theory proposed by E.F. Codd in the 1970s, with normal forms from 1NF to 5NF. In practice, 3NF is general; beyond that is theoretical/academic territory.
For example, writing a customerâs name into the âorders tableâ means updating all order records when the customerâs name changes. Splitting it into âmake a customer table separately, the order only holds customer IDâ is normalization. You build a state where one change keeps the whole consistent.
| Stage | Content |
|---|---|
| 1NF (1st Normal Form) | Each cell has one value. Move repeating items to a separate table |
| 2NF (2nd Normal Form) | Separate columns that donât depend on the entire primary key |
| 3NF (3rd Normal Form) | Separate columns that depend on non-key columns (transitive dependency) |
In practice, 3NF is enough. 4NF and 5NF are theory.
The thinking behind denormalization
Normalization excels at consistency, but has the side effect of JOINs increasing and read performance dropping. The deliberate move to âleave redundancy on purposeâ is denormalization. In analytics DBs, denormalization is standard, and in business DBs itâs applied partially where read performance is needed.
| Normalized | Denormalized |
|---|---|
| High update consistency | Update consistency drops |
| JOIN-heavy, slow reads | No JOIN, fast reads |
| Suited for OLTP (business DB) | Suited for OLAP (analytics DB) |
| Less data duplication | More data duplication |
The basic motion for business DBs is design at 3NF, then denormalize only where needed - and âdenormalize from the startâ is dangerous.
Star schema (for analytics DBs)
The most-used analytics-DB schema is the star schema. With a âfactâ table at the center and âdimensionâ tables around it - a star shape - itâs optimized for fast aggregation queries.
For sales analysis, for instance:
- Fact table: sales line items (timestamp, product ID, customer ID, amount)
- Dimension tables: product, customer, date, store
In this shape, multi-axis aggregation like âsales by region, month, product categoryâ runs extremely fast. Almost all analytics platforms - BigQuery, Snowflake, Tableau - assume this shape.
Business DB: 3NF. Analytics DB: star schema. Change shape per use case.
Primary key design
The primary key is the ID uniquely identifying each record, and the choice strongly affects later design. In practice, auto-numbered surrogate keys (sequential, UUID) are mainstream, and the iron rule is to avoid using business values (email, phone) as primary keys.
If you make a business value the primary key, when that value changes, all foreign references must be updated - which canât withstand real business changes like name change or email change.
| PK type | Characteristics | Suited for |
|---|---|---|
| Sequential (BIGINT) | Small and fast, ordered | Business DB, internal use only |
| UUID v4 | Distributed-generation possible, unordered | Distributed systems, public-facing |
| UUID v7 | Time-ordered, distributed-generation possible | Post-2024 new DBs |
| Business value | Meaning is readable | Donât use as a rule |
Today, UUID v7 (RFC-finalized in 2024) is evaluated as the optimum that combines time order and randomness.
For new-DB primary keys: UUID v7 or BIGINT. v4 invites index performance degradation.
Index design
Indexes are auxiliary data structures that speed up search - correctly placed they get hundreds of times faster, but place too many and updates slow down and the DB bloats. The basic policy is âplace on columns used in WHERE and JOIN, not on others.â
Composite-index order matters: only left-prefix matching works. An index of (a, b, c) works for WHERE a=?, WHERE a=? AND b=?, and WHERE a=? AND b=? AND c=?, but not for WHERE b=? alone.
| Index type | Use case |
|---|---|
| B-Tree | Standard, equality search, range search |
| Hash | Equality only (narrow use cases) |
| GIN / GiST | Full-text search, JSON, geospatial |
| Partial index | Conditional (excluding deleted etc.) |
Initially place the bare minimum, and add later only on actually-slow queries - thatâs the rule.
Schema-change strategy (migration)
Schema changes on a running DB are the most accident-prone area. Column add/delete/type-change/constraint-change each have different danger levels, and the rule is to âchange incrementally while preserving backward compatibility.â
| Change | Danger | Strategy |
|---|---|---|
| Column add (NULLable) | Low | Add directly |
| Column add (NOT NULL) | Mid | NULL first â backfill â NOT NULL |
| Column delete | Mid | Stop usage in code first â delete later |
| Type change | High | Add new column â migrate data â switch |
| Rename | High | Have both â phased switch |
Use a migration tool (Flyway, Liquibase, Prisma Migrate, dbt, etc.) and manage history in Git - this is required. Modifying the DB by hand is an accident factory.
Soft delete and history management
How to handle deleted records is something the design must always settle. There are roughly 3 options, decided by business requirements and legal obligations.
| Method | Content | Suited for |
|---|---|---|
| Physical delete | Remove the record from the DB | Cache, logs, data with no audit need |
| Soft delete | Logical delete via deleted_at column | General business data |
| History table | Accumulate changes in a separate table | Finance, audit-target, with legal requirements |
When personal-data deletion requests arise (GDPR etc.), physical delete becomes mandatory in some scenes. Reconciling soft delete and law requires nailing down requirements upfront.
For finance/medical/public, history tables are often legally mandatory - check business requirements.
Decision criteria
1. Business complexity
Required modeling depth varies with business complexity. Simple CRUD apps need only 3NF, but the more complex the business rules, the more you need modeling thatâs conscious of DDD aggregate boundaries.
| Complexity | Recommended approach |
|---|---|
| Simple CRUD (in-house tools, admin panels) | 3NF straightforwardly |
| Mid-size business (e-commerce, SaaS) | 3NF + aware of aggregate boundaries |
| Advanced business (finance, insurance, medical) | DDD + event/history retention |
2. Scale assumption
Assumed scale affects modeling. If youâll handle massive data in the future, consider partitioning (internally splitting one table by time/customer ID into multiple regions) and sharding (horizontal distribution across multiple DBs) at initial design.
- Under 1M records: donât worry about anything
- 10M+: consider partitioning major tables
- 100M+: partitioning/sharding required, also consider denormalization
3. Team capability
Modeling depends on the teamâs overall SQL/DB understanding. Advanced models (EAV (Entity-Attribute-Value), Polymorphic, event sourcing, etc.) become hell if the team canât manage them.
| Team capability | Recommended level |
|---|---|
| SQL beginner, first-time DB design | Plain 3NF, no extra tricks |
| Intermediate, has DB design experience | 3NF + soft delete + history |
| Advanced, has DDD experience | DDD aggregates + event sourcing |
How to choose by case
In-house admin panel / simple CRUD
Design 3NF straightforwardly. UUID v7 + soft delete + audit columns (created_at, updated_at) is enough.
EC / SaaS business DB
3NF + history retention. Hold order history and price-change history in separate tables. Prepare an analytics DB separately and sync with ETL/ELT.
Finance / medical / public sector
History tables required, physical delete forbidden. Accumulate all changes in separate tables for audit response.
Analytics platform (DWH)
Star schema. ETL business-DB data, denormalize, and load. Follow BigQuery/Snowflake best practices.
Numerical gates and index thresholds for tables
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
Tracking modeling quality by numbers is the modern standard. Below are the industry standard guidelines.
| Metric | Threshold | What to do if exceeded |
|---|---|---|
| Columns per table | 20 or fewer | Mixed responsibilities. Consider splitting |
| Rows per table (typical) | ~100M rows | Beyond that consider partitioning |
| Indexes per table | 5 or fewer | Update performance drops. Review |
| JOIN depth | Up to 3-4 levels | Beyond that denormalize or organize via views |
| Schema change (large table) | Within 1 hour, no downtime | Use online DDL / pg_repack etc. |
| Composite-index column count | 3 or fewer | Beyond that, narrow use cases |
| Proportion of NULLable columns | NOT NULL whenever possible | Make constraints work |
| PK type | UUID v7 or BIGINT | v4 degrades index performance |
Tables with â10M+ rowsâ experience minutes-to-hours of lock during ALTER TABLE, so non-stop changes via PostgreSQL pg_repack / MySQL gh-ost or pt-online-schema-change are mandatory. Operating with the expand/contract pattern (4 stages: column add â write to both â switch â drop old column) is the standard.
Numerical gates are the DB version of ESLint / SonarQube. Mechanically checkable.
Authorâs note - the full-scan hell born of âmetadata JSONBâ
Thereâs a story sometimes told about a business app where every attribute of a user profile was crammed into one JSONB (JSON Binary, the JSON storage type with fast indexed search) column called metadata. The motivation was âflexibility for future expansion,â but just filtering âusers where company name contains Xâ started running a full scan every time, and the admin panel became unable to load as data grew.
Iâve also seen a similar site where a UI searching hundreds of thousands of users started taking 3+ seconds, and the problem only surfaced then. Even with first-aid GIN indexes, performance was far slower than normal columns, and the case is often told paired with the punchline: âwe eventually spent six months on a major refactor separating major attributes into normal columns.â
A similar case: adopting UUID v4 as PK, with records growing the âphysical placement of the index becoming dispersed,â and writes getting heavy - performance degradation became prominent at the tens-of-millions scale. Since 2024, UUID v7 (with time order) has been standardized, and this problem can be avoided at the design stage. The lesson common to both cases is âflexibleâ and âsloppyâ are paper-thin apart.
Limit JSON to the partial areas where schemas fluctuate. Search targets get split into normal columns.
Data modeling pitfalls and forbidden moves
Here are the typical accidents in modeling. The schema is the DBâs skeleton, so itâs the most expensive area to fix later.
| Forbidden move | Why itâs bad |
|---|---|
| Physical delete with no history retained for business | Audit/legal requirements stop you instantly. History tables required for finance/medical/tax |
| Build all columns as NULLable | Constraints donât work, data quality decays. Required items NOT NULL |
| Cram search targets into JSONB | Even GIN-indexed itâs slower than normal columns. Major attributes go in normal columns |
| Use UUID v4 as primary key for mass insert | Index locality degrades. Use UUID v7 (time-ordered) or BIGINT |
| Email/phone as primary key | Major refactor of all foreign keys on name/email change |
Leave tmp_ / bak_ / old_ tables in production | Unclear âis it OK to delete?â debt years later |
| Drop all foreign-key constraints for performance | Integrity destroyed, orphan data accumulates. Add unless youâre advanced |
| Migrating via GUI by hand | History/reproducibility/rollback impossible. Manage in Git via Flyway / Prisma Migrate / Liquibase |
Naming in Japanese romaji (chushin_jusho) | Neither AI nor humans can read. English snake_case is standard |
| Place preventive indexes on every column | Every update updates every index, write latency. Place after measuring queries |
| Schema change during business hours | Production stops on lock. Maintenance window or online DDL |
| Over-normalizing on the assumption âthe more normalized the betterâ | 4NF and beyond are academic; JOIN proliferation causes performance problems in practice |
| Abusing JSON columns assuming âflexible JSON is safeâ | Limit JSON to the partial areas where schemas fluctuate; search breaks down at scale |
The 2024 RFC-finalization of UUID v7 (RFC 9562) is a new version that combines UUID v4âs randomness with timestamps - v7 is standard for new projects. Index performance degradation in legacy systems that adopted UUID v4 was a frequently reported problem in the 2020s.
Foreign-key constraints are the last bastion of integrity. Cutting them for performance is a textbook bad move.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Clear 3NF structure, clear naming | JSON abuse, schema-less |
| Natural English naming (users, orders) | Abbreviations, Japanese-romaji naming |
| Explicit foreign-key constraints | No constraints, relations need to be guessed |
| Comments describing business meaning | Bare tables without comments |
- Business DB at 3NF â 4NF+ is academic; stop at 3NF and secure JOIN performance.
- UUID v7 + soft delete + audit columns â the standard set for post-2024 new DBs.
- Natural English naming + comments â raise AI generation accuracy, prepare for future Text-to-SQL.
- Manage migrations in Git â keep history traceable via Flyway / Prisma Migrate.
English naming + COMMENTs decisively change AI-generated SQL accuracy
When table names use natural English like users, orders, order_items, and each column has COMMENTs (âorder confirmed date,â âtax-included amount,â etc.), AI can generate accurate queries via Text-to-SQL. With abbreviated naming like tbl_001 or kbn_cd, AI canât infer column meanings and SQL becomes inaccurate.
Explicit foreign key constraints help AI generate JOINs
When foreign key constraints are explicit in the DB, AI accurately grasps inter-table relationships and doesnât mistake JOIN conditions. When relationships are managed only in app code without constraints, AI must guess âwhich table to JOIN,â increasing the possibility of writing incorrect JOIN conditions.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Normalization level (3NF as base)
- PK strategy (UUID v7 recommended)
- Soft-delete policy (deleted_at column / history table)
- Audit columns (created_at, updated_at, created_by)
- Naming convention (English snake_case is standard)
- Migration tool (Flyway, Prisma Migrate, etc.)
- Initial set of indexes
Related Articles
https://en.senkohome.com/arch-intro-data-overview/ https://en.senkohome.com/arch-intro-data-platform/ https://en.senkohome.com/arch-intro-index-data/
Summary
This article covered data modeling, including the 3 conceptual/logical/physical stages, 3NF and denormalization, primary-key design, indexes, schema-change strategy, soft delete, and history management.
Business DBs at 3NF, UUID v7 + soft delete + audit columns as the standard set, raise AI generation accuracy with natural English naming. That is the practical answer for data modeling in 2026.
Next time weâll cover data platforms (DWH, data lake, BI integration).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (41/89)