About this article
As the second installment of the âData Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains data store selection.
In 2026, using multiple stores per app for different use cases has become normal. This article specializes in app-level detailed selection, presenting the strengths and weaknesses of RDB/KVS/document/columnar/time-series/search/vector and a phased matrix by data volume x use case (the system-wide placement strategy lives in the separate âSystem Architectureâ article).
Other articles in this category
This articleâs coverage
Data store selection splits into two stages: system-wide placement strategy and individual app detailed selection. This article specializes in the latter - the practical work of âchoosing a concrete data store per app.â
| Article | Coverage |
|---|---|
| 10/06 Data store | System-wide placement strategy / Polyglot Persistence / overall map of scaling strategies |
| This article | Individual app detailed selection / early decision flow / phased selection by data volume x use case |
| 40/02 Data modeling | Table design, normalization, primary keys, indexes |
| 40/03 Data platform | DWH / data lake / BI integration |
The question of this article is âwhich DB type fits this app?â Whole-system placement is handled in the system chapter.
Data stores are âhouses you canât move out ofâ
Choose wrong and you hit the fundamental issues of no speed, no scale, dropping dev efficiency. On the other hand, the categories are too many to scan, so the realistic approach is to first understand the shape of business requirements and then pick the appropriate category.
Data stores are the foundation of all application design. Without a settled DB, entity design, API design, and transaction boundaries canât be settled. Furthermore, once operations begin, data migration is extremely high-cost - âchanging later is practically impossibleâ as a decision. On top of that, data stores are tightly coupled to cloud vendors - even between ârelatively close productsâ like AWS Aurora to GCP Cloud SQL, the bar is high.
âFor now, RDBâ is correct in many cases. But knowing the exceptions is the architectâs job.
Main classification
flowchart TB
DATA{What's the<br/>nature of the data?}
DATA -->|structured, integrity required| RDB[RDB<br/>PostgreSQL/MySQL]
DATA -->|simple Key->Value<br/>ultra-fast| KVS[KVS<br/>Redis/DynamoDB]
DATA -->|flexible schema, JSON| DOC[Document<br/>MongoDB/Firestore]
DATA -->|aggregation/analysis| OLAP[Columnar OLAP<br/>BigQuery/Snowflake]
DATA -->|time-series/sensors| TS[Time-series DB<br/>InfluxDB/Timescale]
DATA -->|full-text search| SE[Search engine<br/>Elasticsearch]
DATA -->|similarity/AI| VEC[Vector DB<br/>pgvector/Pinecone]
classDef root fill:#fef3c7,stroke:#d97706;
classDef rdb fill:#dbeafe,stroke:#2563eb;
classDef kvs fill:#fee2e2,stroke:#dc2626;
classDef doc fill:#fae8ff,stroke:#a21caf;
classDef olap fill:#dcfce7,stroke:#16a34a;
classDef other fill:#f0f9ff,stroke:#0369a1;
class DATA root;
class RDB rdb;
class KVS kvs;
class DOC doc;
class OLAP olap;
class TS,SE,VEC other;
| Category | Strength | Representatives |
|---|---|---|
| RDB (relational DB) | ACID integrity, JOIN, mature | PostgreSQL, MySQL |
| KVS (Key-Value) | Ultra-fast single-key lookup | Redis, DynamoDB |
| Document DB | Flexible schema, direct JSON storage | MongoDB, Firestore |
| Columnar DB (OLAP) | Aggregation/analysis orders of magnitude faster | BigQuery, Snowflake |
| Time-series DB | Accumulating metrics/sensor values | InfluxDB, TimescaleDB |
| Search engine | Full-text search, faceted search | Elasticsearch, OpenSearch |
| Vector DB | Similarity search, AI embeddings | pgvector, Pinecone |
Early decision flow
Going down the questions in order, your top candidate gets pinned for your use case.
ACID integrity required?
â Yes â RDB (PostgreSQL)
â No â ultra-fast on a single key?
â Yes â KVS (Redis/DynamoDB)
â No â schema fluctuates?
â Yes â Document DB
â No â mass-data aggregation/analysis?
â Yes â Columnar DB (BigQuery/Snowflake)
â No â similarity search/AI embeddings?
â Yes â Vector DB (pgvector)
â No â full-text search?
â Yes â OpenSearch
â No â Time-series DB
The default of PostgreSQL + cache (Redis) as needed in two layers covers 90% of systems.
RDB (relational database)
The most traditional and powerful data store, handling row-and-column structured data via SQL. ACID integrity (Atomicity, Consistency, Isolation, Durability) is guaranteed, making it the standard for money calculations and inventory management - work that doesnât tolerate mistakes. The flexibility of combining multiple tables with JOIN is unmatched by other DBs.
Weaknesses are âhorizontal scaling (boosting performance by adding servers) is hardâ and âschema changes are heavy,â but cloud-era managed RDBs (Aurora, Cloud SQL) have largely solved these, and at mid-scale and below, RDB is the only choice - safe to say.
| Pros | Cons |
|---|---|
| Robust ACID integrity | Horizontal scaling is hard |
| Flexible data fetching with JOIN | Schema-change costs are high |
| Standard SQL with rich learning resources | Not great with unstructured data |
| Mature ecosystem | Performance degrades with massive data |
Representatives: PostgreSQL, MySQL, SQL Server, Oracle
Without a special reason: PostgreSQL. It comes with full-text search, JSON, and geospatial as standard - very rare to be in trouble.
KVS (Key-Value Store)
A DB specialized for the simple structure of getting one value from one key. In exchange for losing search and JOIN, it has sub-millisecond response and near-infinite horizontal scalability. It shines where âlots of simple lookupsâ arise - sessions, cache, rankings, rate limiting.
There are also âKVS usable as the main DBâ like DynamoDB, but JOIN and transaction constraints are severe, so itâs unsuitable as the main DB for business systems. The standard pattern in many sites is to use Redis as a cache layer alongside RDB.
| Pros | Cons |
|---|---|
| Ultra-fast read/write | No JOIN, no complex search |
| Easy horizontal scaling | Few schema constraints |
| Simple to implement | Hard to keep consistency across multiple keys |
| Excellent as cache | High design difficulty as main DB |
Representatives: Redis, Memcached, DynamoDB
For cache layer, Redis is the only choice. As main DB, DynamoDB is an option, but design carefully.
Document DB
A DB that stores JSON-format documents as is. Without pre-defining schemas, records with different structures can be stored - suitable for early phases with frequent spec changes or accumulating log data with undefined structure. The convenience of saving app-side objects directly is also attractive.
On the other hand, schemas tend to become vague, and problems like dropping data quality and JOIN-less, hard-to-design are likely - the larger the scale, the easier RDB becomes. After the MongoDB heyday of the 2010s, the modern trend is âsubstituting with PostgreSQL JSONB (JSON Binary, a JSON storage type with fast indexed search)â in growing numbers of cases.
| Pros | Cons |
|---|---|
| Schema-less and flexible | Data quality drops easily |
| Save JSON as-is | Bad at JOIN |
| Fast initial development | Breaks down at scale |
| Natural representation of nested data | Limited ACID guarantees (depends on product) |
Representatives: MongoDB, Firestore, CouchDB
The mainstream today is substituting with PostgreSQLâs JSONB type. Pure document DBs are seeing fewer use cases.
Columnar DB (OLAP)
A DB with structure specialized for analytical queries. Because it stores data column-wise rather than row-wise, aggregation like âsumming sales over 10 million rowsâ runs orders of magnitude faster (seconds â milliseconds) - used as the heart of DWHs (data warehouses).
Modern composition is to use a business DB (RDB) and analysis DB (columnar) together, moving data from business DB to analysis DB via ETL/ELT (Extract/Transform/Load, the mechanism for transferring data to another DB). Running analytics on the RDB drops business-side performance, so this separation has become the de facto standard.
| Pros | Cons |
|---|---|
| Aggregation/analysis ultra-fast | Bad at per-record updates |
| Scales to TB-PB class | Unsuitable for real-time updates |
| SQL-analyzable | Initial cost may be high |
| Cost-effective due to column compression | Cannot be used for OLTP |
Representatives: BigQuery, Snowflake, Redshift, ClickHouse
The analytics platform is between BigQuery or Snowflake. Latecomer ClickHouse is rising for cost.
Time-series DB, search, vector
Use-case-specific emerging categories - basics is to adopt for specific use cases, not as the lead role. Cases of using one alone as the main DB are rare; introduce as a complement to RDB or columnar DB.
| Category | Use case | Representatives |
|---|---|---|
| Time-series DB | Metrics, IoT sensors, stock prices | InfluxDB, TimescaleDB, Prometheus |
| Search engine | Full-text search, log search, faceted | Elasticsearch, OpenSearch |
| Vector DB | Similarity search of AI embeddings, RAG | pgvector, Pinecone, Weaviate |
Vector DBs are an area that surged with the LLM (Large Language Model) / RAG craze, with options spanning from existing-DB extensions (pgvector) to dedicated products (Pinecone).
For ranges substitutable by PostgreSQL extensions (pgvector etc.), itâs better not to introduce a new vector DB. The advantage of doing it in one DB is large.
Decision criteria
1. Nature of the data
| Data nature | Suited category |
|---|---|
| Structured, with relations (transactions, customers) | RDB |
| Simple key lookup (sessions, cache) | KVS |
| Flexible JSON structure (logs, settings) | RDB JSONB or Document DB |
| Mass aggregation/analysis | Columnar DB (DWH) |
| Continuous time-aligned data | Time-series DB |
| Full-text search, scoring | Search engine |
| AI embedding vectors | Vector DB |
The default of RDB + auxiliary DB as needed is the safest. The âadd a dedicated DB only for parts RDB canât handleâ approach is the practical optimum.
2. Scale and growth
| Scale | Typical composition |
|---|---|
| Small (~1M records) | RDB only |
| Mid (~hundreds of millions) | RDB + Redis (cache) |
| Large (billions+) | RDB + Redis + columnar DW |
| Super-large (multi-region, multi-PB) | RDB + KVS + columnar + search + time-series |
âMulti-DB from the start is over-designâ - but you also need to avoid two-way mismatch: âscale is visible, RDB-only is insufficient.â
3. Team skills
DB operations are extremely important. The iron rule is ânever introduce a DB no one can operateâ - introducing a DB the team isnât familiar with means inability to recover during outages, putting business continuity at risk.
PostgreSQL and MySQL are the standards in the sense that âlots of people know them.â Emerging DBs (CockroachDB, Neon, PlanetScale, etc.) have attractive features, but you need to âweigh adoption track records and in-house skills.â
âCan the team operate it?â is the first criterion of tech selection. You canât copy Netflix just because they use it.
How to choose by case
Personal dev / small SaaS MVP
PostgreSQL alone. JSONB, full-text search, and geospatial in one shot, with very few cases of needing to add a dedicated DB. For managed: rich options like Supabase, Neon, RDS.
Mid-size business app (SaaS, in-house systems)
PostgreSQL + Redis. Main data ensures integrity in RDB, sessions and cache get speed from Redis. This composition holds up to a few million users.
B2C with full-text search important (e-commerce, job sites)
PostgreSQL + Redis + Elasticsearch/OpenSearch. Leave product search and faceted filtering to the dedicated engine. RDB LIKE search slows down with scale.
Analytics dashboards important (BI, KPI management)
Business DB (RDB) + analytics DB (BigQuery/Snowflake). Direct analysis on the business DB drops performance, so move via ETL/ELT to the analytics DB. At small scale, a PostgreSQL read replica can substitute.
Embedding AI / RAG (chatbots, search augmentation)
PostgreSQL + pgvector is the fastest route. At larger scale, consider dedicated vector DBs like Pinecone or Weaviate. Redis also supports Vector Search.
IoT / mass metrics ingestion
TimescaleDB (PostgreSQL extension) or InfluxDB. Mass time-series ingestion into RDB is an antipattern.
Phased selection table by data volume / access pattern
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
For data stores, âselecting by guessworkâ is a losing line. The practical rule is to narrow down mechanically by data volume and RPS (Requests Per Second).
| Data volume | Write RPS | Read RPS | Main DB | Auxiliary composition |
|---|---|---|---|---|
| ~1GB / 1M rows | ~10 | ~100 | PostgreSQL alone | None |
| ~100GB / 10M rows | ~100 | ~1,000 | PostgreSQL + Redis | - |
| ~1TB / 1B rows | ~1,000 | ~10,000 | Aurora + Redis + OpenSearch | Consider DWH |
| ~10TB / 10B rows | ~10,000 | ~100,000 | Aurora + DynamoDB + ClickHouse | Kafka |
| 10TB+ / 10B+ rows | 10,000+ | 100,000+ | DynamoDB / Spanner / Cassandra | Fully distributed platform |
Numerical gates by row count: 10M rows triggers index-design review, 100M rows triggers partitioning consideration, 1B rows triggers sharding consideration, 10B rows makes NoSQL options realistic. Building a distributed composition before data volume is visible is the typical over-design.
When in doubt, expand in the order PostgreSQL alone â add cache â add replica â shard.
Authorâs note - the cost of âschema-less for speedâ
In early 2017, âtens of thousands of MongoDB instances exposed to the internet without auth set upâ were hit by a wave of ransom attacks - data deleted and ransom demanded. The issue was less a flaw in MongoDB itself and more that compositions advertised as âschema-less for speedâ were put into production without deep understanding, leaving security settings behind - as has been pointed out since.
Another story: companies running years of business data on the âMongoDB + Node.jsâ stack popular in the 2010s, when they tried to use it for analysis, found âfield names are different by era, types fluctuate, arrays and scalars mixed,â and spent six months reprocessing nearly all records. Effort that could have been avoided by using PostgreSQL JSONB type with stricter typing.
I also watched - around 2018 - a colleague stand up a service saying âMongoDB is fast enough for now,â then witness hellish reckoning the moment aggregation reports were needed a year later. The cost of âschema-less for speedâ rebounds across operations, security, and analysis. This lesson laid the groundwork for the PostgreSQL re-evaluation of the 2020s.
When in doubt: managed PostgreSQL + PITR + multi-AZ. This triad prevents 90% of accidents.
Data store selection / operation pitfalls
Here are the typical accidents in data stores. As the most one-way-door area, failure is irreversible.
| Forbidden move | Why itâs bad |
|---|---|
| Adopting MongoDB as main DB for âschema-less convenienceâ | Aggregation/Joins needed a year later, 3 months lost migrating to PostgreSQL |
| Running DBs in production with default settings (no auth) | Same pattern as the early-2017 MongoDB ransomware (tens of thousands hit) |
| Storing images/videos in the DB itself | DB bloat, backup time explosion. S3 + URL reference is the rule |
| Adopting NoSQL for work needing strong consistency | Inconsistencies in payments/inventory/accounting, refunds and credit loss |
| Self-implementing DB sharding | Operational load and learning cost melt the team. Use managed Aurora/Spanner |
| No restore drills for backups | Same outcome as the GitLab January-2017 incident. Quarterly restore drills |
| Running DBs in single AZ in production | Whole stoppage on AZ failure. Multi-AZ is the production minimum |
| Using PostgreSQL UUID v4 as primary key for mass insert | Index locality degrades, insert performance several times slower. Switch to UUID v7 (RFC-finalized 2024) |
| Operating with PITR (Point-in-Time Recovery) disabled | A wrong operation loses hours of data. Always enable managed-DB PITR |
| Self-running on EC2 when a managed version exists | Personnel cost explodes on security patches, backups, failover |
The early-2017 large-scale MongoDB ransomware (incidents where unauthenticated public MongoDB / Redis / Elasticsearch were globally encrypted/deleted) demonstrated the cost of putting default settings into production without questioning.
AI-era perspective
When AI-driven development (vibe coding) and AI usage are the premise, the decisive selection criterion for data stores becomes whether AI can understand and operate it easily. RDBs that come with SQL, type definitions, and explicit schemas are the easiest DBs for AI to handle - their evaluation has actually risen in the AI era.
| Favored in the AI era | Disfavored in the AI era |
|---|---|
| PostgreSQL (all-inclusive incl. pgvector) | Minor emerging DBs (thin training data) |
| Standard SQL, explicit schemas | Sloppy schema-less JSON |
| Managed (Supabase, Neon, RDS) | Self-operated custom DBs |
| BigQuery, Snowflake (standard SQL) | DBs with custom query languages |
In an era where AI agents directly operate DBs (Text-to-SQL and DB operations via MCP (Model Context Protocol, the standard protocol bridging AI and tools)), âan RDB with tidy schemasâ is the strongest choice. Choose a non-mainstream DB and AI generation accuracy drops.
âPostgreSQL-centric designâ is the de facto standard answer for the AI era.
Common misconceptions
- NoSQL scales better - modern PostgreSQL handles several TB / several thousand QPS comfortably. The scale where NoSQL is required is much larger than imagined
- MongoDB is the latest and modern - MongoDBâs heyday was in the 2010s. Today, many sites have replaced it with PostgreSQL JSONB type. âSchema-less and fastâ is paid for in operations, security, and analysis
- The optimum is lining up dedicated DBs per use case - multi-DB rapidly raises operational cost. If one DB is enough, one DB is always best
- SQL is outdated - rather, itâs being re-evaluated in the AI era. Types are explicit, AI reads/writes them easily, and Text-to-SQL/MCP have made SQL the primary battlefield
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Whatâs the main DB (PostgreSQL / MySQL / other)
- Whether to introduce a cache (Redis, Memcached)
- Whether to introduce an analytics DB (BigQuery, Snowflake)
- Is full-text search needed (Elasticsearch, OpenSearch)
- Is vector search needed (pgvector, Pinecone)
- Use a managed service (self-operation is mostly discouraged)
- Backup and restore policy
How to make the final call
The core of data-store selection is awareness that itâs a one-way door. Application code can be rewritten, but migrating a running DB to a different DB is effectively a rebuild - and crossing cloud vendors is even harder. Thatâs why the conservative starting point of âwhen in doubt, RDBâ is correct in many cases, and the safe judgment axis is âadd dedicated DBs only when the exception conditions are clearly visible.â Donât introduce a DB you canât operate - a DB the team canât recover halts the business during incidents.
The decisive axis is whether AI can handle it easily. In an era where AI agents directly operate DBs via Text-to-SQL and MCP, PostgreSQL with standard SQL and explicit schemas has abundant training data, and AI generation accuracy is overwhelmingly higher. Minor emerging DBs and DBs with custom query languages become unfavorable choices in the AI era.
Selection priorities
- When in doubt, PostgreSQL - all-inclusive with JSONB / pgvector / full-text search; until exception conditions hit, one DB is enough
- Can the team operate it - choose DBs many people know. Donât reference Netflix cases
- Use-case specialization is added at exception - Redis, search, analytics DB are added after RDB visibly falls short
- Managed first - self-running is mostly discouraged; lower operational load with Supabase / Neon / RDS
âPostgreSQL-centric + auxiliary DBs as neededâ is the de facto modern standard answer.
Summary
This article covered data store selection, including the strengths and weaknesses of RDB, KVS, document, columnar, time-series, search, and vector DBs, plus a phased data-volume x use-case matrix and AI-era favored options.
When in doubt, PostgreSQL; add use-case specialization at exception; managed first. That is the practical answer for data store selection in 2026.
Next time weâll cover data modeling (table design, normalization, primary keys, indexes).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (40/89)