[Data Architecture] Data Store Selection - RDB-Centric + Use-Case-Specific

About this article

As the second installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data store selection.

In 2026, using multiple stores per app for different use cases has become normal. This article specializes in app-level detailed selection, presenting the strengths and weaknesses of RDB/KVS/document/columnar/time-series/search/vector and a phased matrix by data volume x use case (the system-wide placement strategy lives in the separate “System Architecture” article).

This article’s coverage

Data store selection splits into two stages: system-wide placement strategy and individual app detailed selection. This article specializes in the latter - the practical work of “choosing a concrete data store per app.”

Article	Coverage
10/06 Data store	System-wide placement strategy / Polyglot Persistence / overall map of scaling strategies
This article	Individual app detailed selection / early decision flow / phased selection by data volume x use case
40/02 Data modeling	Table design, normalization, primary keys, indexes
40/03 Data platform	DWH / data lake / BI integration

The question of this article is “which DB type fits this app?” Whole-system placement is handled in the system chapter.

Data stores are “houses you can’t move out of”

Choose wrong and you hit the fundamental issues of no speed, no scale, dropping dev efficiency. On the other hand, the categories are too many to scan, so the realistic approach is to first understand the shape of business requirements and then pick the appropriate category.

Data stores are the foundation of all application design. Without a settled DB, entity design, API design, and transaction boundaries can’t be settled. Furthermore, once operations begin, data migration is extremely high-cost - “changing later is practically impossible” as a decision. On top of that, data stores are tightly coupled to cloud vendors - even between “relatively close products” like AWS Aurora to GCP Cloud SQL, the bar is high.

“For now, RDB” is correct in many cases. But knowing the exceptions is the architect’s job.

Main classification

flowchart TB
    DATA{What's the<br/>nature of the data?}
    DATA -->|structured, integrity required| RDB[RDB<br/>PostgreSQL/MySQL]
    DATA -->|simple Key->Value<br/>ultra-fast| KVS[KVS<br/>Redis/DynamoDB]
    DATA -->|flexible schema, JSON| DOC[Document<br/>MongoDB/Firestore]
    DATA -->|aggregation/analysis| OLAP[Columnar OLAP<br/>BigQuery/Snowflake]
    DATA -->|time-series/sensors| TS[Time-series DB<br/>InfluxDB/Timescale]
    DATA -->|full-text search| SE[Search engine<br/>Elasticsearch]
    DATA -->|similarity/AI| VEC[Vector DB<br/>pgvector/Pinecone]
    classDef root fill:#fef3c7,stroke:#d97706;
    classDef rdb fill:#dbeafe,stroke:#2563eb;
    classDef kvs fill:#fee2e2,stroke:#dc2626;
    classDef doc fill:#fae8ff,stroke:#a21caf;
    classDef olap fill:#dcfce7,stroke:#16a34a;
    classDef other fill:#f0f9ff,stroke:#0369a1;
    class DATA root;
    class RDB rdb;
    class KVS kvs;
    class DOC doc;
    class OLAP olap;
    class TS,SE,VEC other;

Category	Strength	Representatives
RDB (relational DB)	ACID integrity, JOIN, mature	PostgreSQL, MySQL
KVS (Key-Value)	Ultra-fast single-key lookup	Redis, DynamoDB
Document DB	Flexible schema, direct JSON storage	MongoDB, Firestore
Columnar DB (OLAP)	Aggregation/analysis orders of magnitude faster	BigQuery, Snowflake
Time-series DB	Accumulating metrics/sensor values	InfluxDB, TimescaleDB
Search engine	Full-text search, faceted search	Elasticsearch, OpenSearch
Vector DB	Similarity search, AI embeddings	pgvector, Pinecone

Early decision flow

Going down the questions in order, your top candidate gets pinned for your use case.

ACID integrity required?
  └ Yes → RDB (PostgreSQL)
  └ No  → ultra-fast on a single key?
            └ Yes → KVS (Redis/DynamoDB)
            └ No  → schema fluctuates?
                      └ Yes → Document DB
                      └ No  → mass-data aggregation/analysis?
                                └ Yes → Columnar DB (BigQuery/Snowflake)
                                └ No  → similarity search/AI embeddings?
                                          └ Yes → Vector DB (pgvector)
                                          └ No  → full-text search?
                                                    └ Yes → OpenSearch
                                                    └ No  → Time-series DB

The default of PostgreSQL + cache (Redis) as needed in two layers covers 90% of systems.

RDB (relational database)

The most traditional and powerful data store, handling row-and-column structured data via SQL. ACID integrity (Atomicity, Consistency, Isolation, Durability) is guaranteed, making it the standard for money calculations and inventory management - work that doesn’t tolerate mistakes. The flexibility of combining multiple tables with JOIN is unmatched by other DBs.

Weaknesses are “horizontal scaling (boosting performance by adding servers) is hard” and “schema changes are heavy,” but cloud-era managed RDBs (Aurora, Cloud SQL) have largely solved these, and at mid-scale and below, RDB is the only choice - safe to say.

Pros	Cons
Robust ACID integrity	Horizontal scaling is hard
Flexible data fetching with JOIN	Schema-change costs are high
Standard SQL with rich learning resources	Not great with unstructured data
Mature ecosystem	Performance degrades with massive data

Representatives: PostgreSQL, MySQL, SQL Server, Oracle

Without a special reason: PostgreSQL. It comes with full-text search, JSON, and geospatial as standard - very rare to be in trouble.

KVS (Key-Value Store)

A DB specialized for the simple structure of getting one value from one key. In exchange for losing search and JOIN, it has sub-millisecond response and near-infinite horizontal scalability. It shines where “lots of simple lookups” arise - sessions, cache, rankings, rate limiting.

There are also “KVS usable as the main DB” like DynamoDB, but JOIN and transaction constraints are severe, so it’s unsuitable as the main DB for business systems. The standard pattern in many sites is to use Redis as a cache layer alongside RDB.

Pros	Cons
Ultra-fast read/write	No JOIN, no complex search
Easy horizontal scaling	Few schema constraints
Simple to implement	Hard to keep consistency across multiple keys
Excellent as cache	High design difficulty as main DB

Representatives: Redis, Memcached, DynamoDB

For cache layer, Redis is the only choice. As main DB, DynamoDB is an option, but design carefully.

Document DB

A DB that stores JSON-format documents as is. Without pre-defining schemas, records with different structures can be stored - suitable for early phases with frequent spec changes or accumulating log data with undefined structure. The convenience of saving app-side objects directly is also attractive.

On the other hand, schemas tend to become vague, and problems like dropping data quality and JOIN-less, hard-to-design are likely - the larger the scale, the easier RDB becomes. After the MongoDB heyday of the 2010s, the modern trend is “substituting with PostgreSQL JSONB (JSON Binary, a JSON storage type with fast indexed search)” in growing numbers of cases.

Pros	Cons
Schema-less and flexible	Data quality drops easily
Save JSON as-is	Bad at JOIN
Fast initial development	Breaks down at scale
Natural representation of nested data	Limited ACID guarantees (depends on product)

Representatives: MongoDB, Firestore, CouchDB

The mainstream today is substituting with PostgreSQL’s JSONB type. Pure document DBs are seeing fewer use cases.

Columnar DB (OLAP)

A DB with structure specialized for analytical queries. Because it stores data column-wise rather than row-wise, aggregation like “summing sales over 10 million rows” runs orders of magnitude faster (seconds → milliseconds) - used as the heart of DWHs (data warehouses).

Modern composition is to use a business DB (RDB) and analysis DB (columnar) together, moving data from business DB to analysis DB via ETL/ELT (Extract/Transform/Load, the mechanism for transferring data to another DB). Running analytics on the RDB drops business-side performance, so this separation has become the de facto standard.

Pros	Cons
Aggregation/analysis ultra-fast	Bad at per-record updates
Scales to TB-PB class	Unsuitable for real-time updates
SQL-analyzable	Initial cost may be high
Cost-effective due to column compression	Cannot be used for OLTP

Representatives: BigQuery, Snowflake, Redshift, ClickHouse

The analytics platform is between BigQuery or Snowflake. Latecomer ClickHouse is rising for cost.

Time-series DB, search, vector

Use-case-specific emerging categories - basics is to adopt for specific use cases, not as the lead role. Cases of using one alone as the main DB are rare; introduce as a complement to RDB or columnar DB.

Category	Use case	Representatives
Time-series DB	Metrics, IoT sensors, stock prices	InfluxDB, TimescaleDB, Prometheus
Search engine	Full-text search, log search, faceted	Elasticsearch, OpenSearch
Vector DB	Similarity search of AI embeddings, RAG	pgvector, Pinecone, Weaviate

Vector DBs are an area that surged with the LLM (Large Language Model) / RAG craze, with options spanning from existing-DB extensions (pgvector) to dedicated products (Pinecone).

For ranges substitutable by PostgreSQL extensions (pgvector etc.), it’s better not to introduce a new vector DB. The advantage of doing it in one DB is large.

Decision criteria

1. Nature of the data

Data nature	Suited category
Structured, with relations (transactions, customers)	RDB
Simple key lookup (sessions, cache)	KVS
Flexible JSON structure (logs, settings)	RDB JSONB or Document DB
Mass aggregation/analysis	Columnar DB (DWH)
Continuous time-aligned data	Time-series DB
Full-text search, scoring	Search engine
AI embedding vectors	Vector DB

The default of RDB + auxiliary DB as needed is the safest. The “add a dedicated DB only for parts RDB can’t handle” approach is the practical optimum.

2. Scale and growth

Scale	Typical composition
Small (~1M records)	RDB only
Mid (~hundreds of millions)	RDB + Redis (cache)
Large (billions+)	RDB + Redis + columnar DW
Super-large (multi-region, multi-PB)	RDB + KVS + columnar + search + time-series

“Multi-DB from the start is over-design” - but you also need to avoid two-way mismatch: “scale is visible, RDB-only is insufficient.”

3. Team skills

DB operations are extremely important. The iron rule is “never introduce a DB no one can operate” - introducing a DB the team isn’t familiar with means inability to recover during outages, putting business continuity at risk.

PostgreSQL and MySQL are the standards in the sense that “lots of people know them.” Emerging DBs (CockroachDB, Neon, PlanetScale, etc.) have attractive features, but you need to “weigh adoption track records and in-house skills.”

“Can the team operate it?” is the first criterion of tech selection. You can’t copy Netflix just because they use it.

How to choose by case

Personal dev / small SaaS MVP

PostgreSQL alone. JSONB, full-text search, and geospatial in one shot, with very few cases of needing to add a dedicated DB. For managed: rich options like Supabase, Neon, RDS.

Mid-size business app (SaaS, in-house systems)

PostgreSQL + Redis. Main data ensures integrity in RDB, sessions and cache get speed from Redis. This composition holds up to a few million users.

B2C with full-text search important (e-commerce, job sites)

PostgreSQL + Redis + Elasticsearch/OpenSearch. Leave product search and faceted filtering to the dedicated engine. RDB LIKE search slows down with scale.

Analytics dashboards important (BI, KPI management)

Business DB (RDB) + analytics DB (BigQuery/Snowflake). Direct analysis on the business DB drops performance, so move via ETL/ELT to the analytics DB. At small scale, a PostgreSQL read replica can substitute.

Embedding AI / RAG (chatbots, search augmentation)

PostgreSQL + pgvector is the fastest route. At larger scale, consider dedicated vector DBs like Pinecone or Weaviate. Redis also supports Vector Search.

IoT / mass metrics ingestion

TimescaleDB (PostgreSQL extension) or InfluxDB. Mass time-series ingestion into RDB is an antipattern.

Phased selection table by data volume / access pattern

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For data stores, “selecting by guesswork” is a losing line. The practical rule is to narrow down mechanically by data volume and RPS (Requests Per Second).

Data volume	Write RPS	Read RPS	Main DB	Auxiliary composition
~1GB / 1M rows	~10	~100	PostgreSQL alone	None
~100GB / 10M rows	~100	~1,000	PostgreSQL + Redis	-
~1TB / 1B rows	~1,000	~10,000	Aurora + Redis + OpenSearch	Consider DWH
~10TB / 10B rows	~10,000	~100,000	Aurora + DynamoDB + ClickHouse	Kafka
10TB+ / 10B+ rows	10,000+	100,000+	DynamoDB / Spanner / Cassandra	Fully distributed platform

Numerical gates by row count: 10M rows triggers index-design review, 100M rows triggers partitioning consideration, 1B rows triggers sharding consideration, 10B rows makes NoSQL options realistic. Building a distributed composition before data volume is visible is the typical over-design.

When in doubt, expand in the order PostgreSQL alone → add cache → add replica → shard.

Author’s note - the cost of “schema-less for speed”

In early 2017, “tens of thousands of MongoDB instances exposed to the internet without auth set up” were hit by a wave of ransom attacks - data deleted and ransom demanded. The issue was less a flaw in MongoDB itself and more that compositions advertised as “schema-less for speed” were put into production without deep understanding, leaving security settings behind - as has been pointed out since.

Another story: companies running years of business data on the “MongoDB + Node.js” stack popular in the 2010s, when they tried to use it for analysis, found “field names are different by era, types fluctuate, arrays and scalars mixed,” and spent six months reprocessing nearly all records. Effort that could have been avoided by using PostgreSQL JSONB type with stricter typing.

I also watched - around 2018 - a colleague stand up a service saying “MongoDB is fast enough for now,” then witness hellish reckoning the moment aggregation reports were needed a year later. The cost of “schema-less for speed” rebounds across operations, security, and analysis. This lesson laid the groundwork for the PostgreSQL re-evaluation of the 2020s.

When in doubt: managed PostgreSQL + PITR + multi-AZ. This triad prevents 90% of accidents.

Data store selection / operation pitfalls

Here are the typical accidents in data stores. As the most one-way-door area, failure is irreversible.

Forbidden move	Why it’s bad
Adopting MongoDB as main DB for “schema-less convenience”	Aggregation/Joins needed a year later, 3 months lost migrating to PostgreSQL
Running DBs in production with default settings (no auth)	Same pattern as the early-2017 MongoDB ransomware (tens of thousands hit)
Storing images/videos in the DB itself	DB bloat, backup time explosion. S3 + URL reference is the rule
Adopting NoSQL for work needing strong consistency	Inconsistencies in payments/inventory/accounting, refunds and credit loss
Self-implementing DB sharding	Operational load and learning cost melt the team. Use managed Aurora/Spanner
No restore drills for backups	Same outcome as the GitLab January-2017 incident. Quarterly restore drills
Running DBs in single AZ in production	Whole stoppage on AZ failure. Multi-AZ is the production minimum
Using PostgreSQL UUID v4 as primary key for mass insert	Index locality degrades, insert performance several times slower. Switch to UUID v7 (RFC-finalized 2024)
Operating with PITR (Point-in-Time Recovery) disabled	A wrong operation loses hours of data. Always enable managed-DB PITR
Self-running on EC2 when a managed version exists	Personnel cost explodes on security patches, backups, failover

The early-2017 large-scale MongoDB ransomware (incidents where unauthenticated public MongoDB / Redis / Elasticsearch were globally encrypted/deleted) demonstrated the cost of putting default settings into production without questioning.

AI-era perspective

When AI-driven development (vibe coding) and AI usage are the premise, the decisive selection criterion for data stores becomes whether AI can understand and operate it easily. RDBs that come with SQL, type definitions, and explicit schemas are the easiest DBs for AI to handle - their evaluation has actually risen in the AI era.

Favored in the AI era	Disfavored in the AI era
PostgreSQL (all-inclusive incl. pgvector)	Minor emerging DBs (thin training data)
Standard SQL, explicit schemas	Sloppy schema-less JSON
Managed (Supabase, Neon, RDS)	Self-operated custom DBs
BigQuery, Snowflake (standard SQL)	DBs with custom query languages

In an era where AI agents directly operate DBs (Text-to-SQL and DB operations via MCP (Model Context Protocol, the standard protocol bridging AI and tools)), “an RDB with tidy schemas” is the strongest choice. Choose a non-mainstream DB and AI generation accuracy drops.

“PostgreSQL-centric design” is the de facto standard answer for the AI era.

Common misconceptions

NoSQL scales better - modern PostgreSQL handles several TB / several thousand QPS comfortably. The scale where NoSQL is required is much larger than imagined
MongoDB is the latest and modern - MongoDB’s heyday was in the 2010s. Today, many sites have replaced it with PostgreSQL JSONB type. “Schema-less and fast” is paid for in operations, security, and analysis
The optimum is lining up dedicated DBs per use case - multi-DB rapidly raises operational cost. If one DB is enough, one DB is always best
SQL is outdated - rather, it’s being re-evaluated in the AI era. Types are explicit, AI reads/writes them easily, and Text-to-SQL/MCP have made SQL the primary battlefield

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

What’s the main DB (PostgreSQL / MySQL / other)
Whether to introduce a cache (Redis, Memcached)
Whether to introduce an analytics DB (BigQuery, Snowflake)
Is full-text search needed (Elasticsearch, OpenSearch)
Is vector search needed (pgvector, Pinecone)
Use a managed service (self-operation is mostly discouraged)
Backup and restore policy

How to make the final call

The core of data-store selection is awareness that it’s a one-way door. Application code can be rewritten, but migrating a running DB to a different DB is effectively a rebuild - and crossing cloud vendors is even harder. That’s why the conservative starting point of “when in doubt, RDB” is correct in many cases, and the safe judgment axis is “add dedicated DBs only when the exception conditions are clearly visible.” Don’t introduce a DB you can’t operate - a DB the team can’t recover halts the business during incidents.

The decisive axis is whether AI can handle it easily. In an era where AI agents directly operate DBs via Text-to-SQL and MCP, PostgreSQL with standard SQL and explicit schemas has abundant training data, and AI generation accuracy is overwhelmingly higher. Minor emerging DBs and DBs with custom query languages become unfavorable choices in the AI era.

Selection priorities

When in doubt, PostgreSQL - all-inclusive with JSONB / pgvector / full-text search; until exception conditions hit, one DB is enough
Can the team operate it - choose DBs many people know. Don’t reference Netflix cases
Use-case specialization is added at exception - Redis, search, analytics DB are added after RDB visibly falls short
Managed first - self-running is mostly discouraged; lower operational load with Supabase / Neon / RDS

“PostgreSQL-centric + auxiliary DBs as needed” is the de facto modern standard answer.

Summary

This article covered data store selection, including the strengths and weaknesses of RDB, KVS, document, columnar, time-series, search, and vector DBs, plus a phased data-volume x use-case matrix and AI-era favored options.

When in doubt, PostgreSQL; add use-case specialization at exception; managed first. That is the practical answer for data store selection in 2026.

Next time we’ll cover data modeling (table design, normalization, primary keys, indexes).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.