Data Architecture

[Data Architecture] Data Store Selection

[Data Architecture] Data Store Selection

About this article

As the second installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data store selection.

In 2026, using multiple stores per app for different use cases has become normal. This article specializes in app-level detailed selection, presenting the strengths and weaknesses of RDB/KVS/document/columnar/time-series/search/vector and a phased matrix by data volume x use case (the system-wide placement strategy lives in the separate “System Architecture” article).

What is data store selection in the first place

In a nutshell, data store selection is “deciding which type of database to entrust this app’s data to.”

Imagine a toolbox. A screwdriver for screws, a hammer for nails, a saw for cutting wood - no single all-purpose tool handles every job. Databases are the same: an RDB for “accurate transaction records,” a KVS for “high-speed responses to massive access,” and a search engine for “full-text search.” The optimal tool differs by use case. Modern apps routinely combine multiple data stores, and how you pick them determines the speed, extensibility, and development efficiency of the entire application.

This article’s coverage

Data store selection splits into two stages: system-wide placement strategy and individual app detailed selection. This article specializes in the latter - the practical work of “choosing a concrete data store per app.”

ArticleCoverage
10/06 Data storeSystem-wide placement strategy / Polyglot Persistence / overall map of scaling strategies
This articleIndividual app detailed selection / early decision flow / phased selection by data volume x use case
40/02 Data modelingTable design, normalization, primary keys, indexes
40/03 Data platformDWH / data lake / BI integration

The question of this article is “which DB type fits this app?” Whole-system placement is handled in the system chapter.

Why data store selection matters

What happens if you pick wrong? A bad choice directly leads to fundamental problems: no speed, no scale, dropping dev efficiency. And once operations begin, data migration is extremely high-cost - “changing later is practically impossible” as a decision.

Data stores are the foundation of all application design. Without a settled DB, entity design, API design, and transaction boundaries can’t be settled. Furthermore, data stores are tightly coupled to cloud vendors - even between “relatively close products” like AWS Aurora to GCP Cloud SQL, the migration bar is high.

“For now, RDB” is correct in many cases. But knowing the exceptions is the architect’s job.

Main classification

Major Data Store Categories Like a toolbox. Screwdriver for screws, hammer for nails. Choose tools by use case RDB Relational DB ACID consistency / JOIN PostgreSQL / MySQL Amounts, inventory, etc. Mission-critical operations When in doubt, choose this KVS Key-Value Ultra-fast single-key lookup Redis / DynamoDB Sessions / cache Rankings / counters Document DB Flexible JSON storage MongoDB / Firestore Schema is unstable Prototypes / CMS Columnar DB OLAP Aggregation & analytics orders of magnitude faster BigQuery / Snowflake Mass data aggregation BI / Reporting Time-series DB Metrics / sensor values InfluxDB / TimescaleDB IoT / Monitoring Search Engine Full-text search / facets Elasticsearch / OpenSearch Product search / log search Vector DB Similarity search / AI embeddings pgvector / Pinecone RAG / Semantic Search Essential for the AI era Graph DB Exploring relationships Neo4j Social / Recommendations Decision starting point: Need ACID? → RDB | Ultra-fast key lookup? → KVS | Mass aggregation? → Columnar | AI search? → Vector Default: PostgreSQL + Redis two-tier. Covers 90% of systems Add specialized DBs only when "RDB isn't enough" becomes clear
CategoryStrengthRepresentatives
RDB (relational DB)ACID integrity, JOIN, maturePostgreSQL, MySQL
KVSUltra-fast single-key lookupRedis, DynamoDB
Document DBFlexible schema, direct JSON storageMongoDB, Firestore
Columnar DB (OLAP)Aggregation/analysis orders of magnitude fasterBigQuery, Snowflake
Time-series DBAccumulating metrics/sensor valuesInfluxDB, TimescaleDB
Search engineFull-text search, faceted searchElasticsearch, OpenSearch
Vector DBSimilarity search, AI embeddingspgvector, Pinecone

Early decision flow

Going down the questions in order, your top candidate gets pinned for your use case.

ACID integrity required?
  └ Yes → RDB (PostgreSQL)
  └ No  → ultra-fast on a single key?
            └ Yes → KVS (Redis/DynamoDB)
            └ No  → schema fluctuates?
                      └ Yes → Document DB
                      └ No  → mass-data aggregation/analysis?
                                └ Yes → Columnar DB (BigQuery/Snowflake)
                                └ No  → similarity search/AI embeddings?
                                          └ Yes → Vector DB (pgvector)
                                          └ No  → full-text search?
                                                    └ Yes → OpenSearch
                                                    └ No  → Time-series DB

The default of PostgreSQL + cache (Redis) as needed in two layers covers 90% of systems.

RDB (relational database)

The most traditional and powerful data store, handling row-and-column structured data via SQL. ACID integrity (Atomicity, Consistency, Isolation, Durability) is guaranteed, making it the standard for money calculations and inventory management - work that doesn’t tolerate mistakes. The flexibility of combining multiple tables with JOIN is unmatched by other DBs.

Weaknesses are “horizontal scaling (boosting performance by adding servers) is hard” and “schema changes are heavy,” but cloud-era managed RDBs (Aurora, Cloud SQL) have largely solved these, and at mid-scale and below, RDB is the only choice - safe to say.

ProsCons
Robust ACID integrityHorizontal scaling is hard
Flexible data fetching with JOINSchema-change costs are high
Standard SQL with rich learning resourcesNot great with unstructured data
Mature ecosystemPerformance degrades with massive data

Representatives: PostgreSQL, MySQL, SQL Server, Oracle

Without a special reason: PostgreSQL. It comes with full-text search, JSON, and geospatial as standard - very rare to be in trouble.

KVS (Key-Value Store)

A DB specialized for the simple structure of getting one value from one key. In exchange for losing search and JOIN, it has sub-millisecond response and near-infinite horizontal scalability. It shines where “lots of simple lookups” arise - sessions, cache, rankings, rate limiting.

There are also “KVS usable as the main DB” like DynamoDB, but JOIN and transaction constraints are severe, so it’s unsuitable as the main DB for business systems. The standard pattern in many sites is to use Redis as a cache layer alongside RDB.

ProsCons
Ultra-fast read/writeNo JOIN, no complex search
Easy horizontal scalingFew schema constraints
Simple to implementHard to keep consistency across multiple keys
Excellent as cacheHigh design difficulty as main DB

Representatives: Redis, Memcached, DynamoDB

For cache layer, Redis is the only choice. As main DB, DynamoDB is an option, but design carefully.

Document DB

A DB that stores JSON-format documents as is. Without pre-defining schemas, records with different structures can be stored - suitable for early phases with frequent spec changes or accumulating log data with undefined structure. The convenience of saving app-side objects directly is also attractive.

On the other hand, schemas tend to become vague, and problems like dropping data quality and JOIN-less, hard-to-design are likely - the larger the scale, the easier RDB becomes. After the MongoDB heyday of the 2010s, the modern trend is “substituting with PostgreSQL JSONB (JSON Binary, a JSON storage type with fast indexed search)” in growing numbers of cases.

ProsCons
Schema-less and flexibleData quality drops easily
Save JSON as-isBad at JOIN
Fast initial developmentBreaks down at scale
Natural representation of nested dataLimited ACID guarantees (depends on product)

Representatives: MongoDB, Firestore, CouchDB

The mainstream today is substituting with PostgreSQL’s JSONB type. Pure document DBs are seeing fewer use cases.

Columnar DB (OLAP)

A DB with structure specialized for analytical queries. Because it stores data column-wise rather than row-wise, aggregation like “summing sales over 10 million rows” runs orders of magnitude faster (seconds → milliseconds) - used as the heart of DWHs (data warehouses).

Modern composition is to use a business DB (RDB) and analysis DB (columnar) together, moving data from business DB to analysis DB via ETL/ELT (Extract/Transform/Load, the mechanism for transferring data to another DB). Running analytics on the RDB drops business-side performance, so this separation has become the de facto standard.

ProsCons
Aggregation/analysis ultra-fastBad at per-record updates
Scales to TB-PB classUnsuitable for real-time updates
SQL-analyzableInitial cost may be high
Cost-effective due to column compressionCannot be used for OLTP

Representatives: BigQuery, Snowflake, Redshift, ClickHouse

The analytics platform is between BigQuery or Snowflake. Latecomer ClickHouse is rising for cost.

Time-series DB, search, vector

Use-case-specific emerging categories - basics is to adopt for specific use cases, not as the lead role. Cases of using one alone as the main DB are rare; introduce as a complement to RDB or columnar DB.

CategoryUse caseRepresentatives
Time-series DBMetrics, IoT sensors, stock pricesInfluxDB, TimescaleDB, Prometheus
Search engineFull-text search, log search, facetedElasticsearch, OpenSearch
Vector DBSimilarity search of AI embeddings, RAGpgvector, Pinecone, Weaviate

Vector DBs are an area that surged with the LLM / RAG craze, with options spanning from existing-DB extensions (pgvector) to dedicated products (Pinecone).

For ranges substitutable by PostgreSQL extensions (pgvector etc.), it’s better not to introduce a new vector DB. The advantage of doing it in one DB is large.

Decision criteria

1. Nature of the data

Data natureSuited category
Structured, with relations (transactions, customers)RDB
Simple key lookup (sessions, cache)KVS
Flexible JSON structure (logs, settings)RDB JSONB or Document DB
Mass aggregation/analysisColumnar DB (DWH)
Continuous time-aligned dataTime-series DB
Full-text search, scoringSearch engine
AI embedding vectorsVector DB

The default of RDB + auxiliary DB as needed is the safest. The “add a dedicated DB only for parts RDB can’t handle” approach is the practical optimum.

2. Scale and growth

ScaleTypical composition
Small (~1M records)RDB only
Mid (~hundreds of millions)RDB + Redis (cache)
Large (billions+)RDB + Redis + columnar DW
Super-large (multi-region, multi-PB)RDB + KVS + columnar + search + time-series

“Multi-DB from the start is over-design” - but you also need to avoid two-way mismatch: “scale is visible, RDB-only is insufficient.”

3. Team skills

DB operations are extremely important. The iron rule is “never introduce a DB no one can operate” - introducing a DB the team isn’t familiar with means inability to recover during outages, putting business continuity at risk.

PostgreSQL and MySQL are the standards in the sense that “lots of people know them.” Emerging DBs (CockroachDB, Neon, PlanetScale, etc.) have attractive features, but you need to “weigh adoption track records and in-house skills.”

“Can the team operate it?” is the first criterion of tech selection. You can’t copy Netflix just because they use it.

How to choose by case

Personal dev / small SaaS MVP

PostgreSQL alone. JSONB, full-text search, and geospatial in one shot, with very few cases of needing to add a dedicated DB. For managed: rich options like Supabase, Neon, RDS.

Mid-size business app (SaaS, in-house systems)

PostgreSQL + Redis. Main data ensures integrity in RDB, sessions and cache get speed from Redis. This composition holds up to a few million users.

B2C with full-text search important (e-commerce, job sites)

PostgreSQL + Redis + Elasticsearch/OpenSearch. Leave product search and faceted filtering to the dedicated engine. RDB LIKE search slows down with scale.

Analytics dashboards important (BI, KPI management)

Business DB (RDB) + analytics DB (BigQuery/Snowflake). Direct analysis on the business DB drops performance, so move via ETL/ELT to the analytics DB. At small scale, a PostgreSQL read replica can substitute.

Embedding AI / RAG (chatbots, search augmentation)

PostgreSQL + pgvector is the fastest route. At larger scale, consider dedicated vector DBs like Pinecone or Weaviate. Redis also supports Vector Search.

IoT / mass metrics ingestion

TimescaleDB (PostgreSQL extension) or InfluxDB. Mass time-series ingestion into RDB is an antipattern.

Phased selection table by data volume / access pattern

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For data stores, “selecting by guesswork” is a losing line. The practical rule is to narrow down mechanically by data volume and RPS (Requests Per Second).

Data volumeWrite RPSRead RPSMain DBAuxiliary composition
~1GB / 1M rows~10~100PostgreSQL aloneNone
~100GB / 10M rows~100~1,000PostgreSQL + Redis-
~1TB / 1B rows~1,000~10,000Aurora + Redis + OpenSearchConsider DWH
~10TB / 10B rows~10,000~100,000Aurora + DynamoDB + ClickHouseKafka
10TB+ / 10B+ rows10,000+100,000+DynamoDB / Spanner / CassandraFully distributed platform

Numerical gates by row count: 10M rows triggers index-design review, 100M rows triggers partitioning consideration, 1B rows triggers sharding consideration, 10B rows makes NoSQL options realistic. Building a distributed composition before data volume is visible is the typical over-design.

When in doubt, expand in the order PostgreSQL alone → add cache → add replica → shard.

Author’s note - the cost of “schema-less for speed”

In early 2017, “tens of thousands of MongoDB instances exposed to the internet without auth set up” were hit by a wave of ransom attacks - data deleted and ransom demanded. The issue was less a flaw in MongoDB itself and more that compositions advertised as “schema-less for speed” were put into production without deep understanding, leaving security settings behind - as has been pointed out since.

Another story: companies running years of business data on the “MongoDB + Node.js” stack popular in the 2010s, when they tried to use it for analysis, found “field names are different by era, types fluctuate, arrays and scalars mixed,” and spent six months reprocessing nearly all records. Effort that could have been avoided by using PostgreSQL JSONB type with stricter typing.

I also watched - around 2018 - a colleague stand up a service saying “MongoDB is fast enough for now,” then witness hellish reckoning the moment aggregation reports were needed a year later. The cost of “schema-less for speed” rebounds across operations, security, and analysis. This lesson laid the groundwork for the PostgreSQL re-evaluation of the 2020s.

When in doubt: managed PostgreSQL + PITR + multi-AZ. This triad prevents 90% of accidents.

Data store selection / operation pitfalls

Here are the typical accidents in data stores. As the most one-way-door area, failure is irreversible.

Forbidden moveWhy it’s bad
Adopting MongoDB as main DB for “schema-less convenience”Aggregation/Joins needed a year later, 3 months lost migrating to PostgreSQL
Running DBs in production with default settings (no auth)Same pattern as the early-2017 MongoDB ransomware (tens of thousands hit)
Storing images/videos in the DB itselfDB bloat, backup time explosion. S3 + URL reference is the rule
Adopting NoSQL for work needing strong consistencyInconsistencies in payments/inventory/accounting, refunds and credit loss
Self-implementing DB shardingOperational load and learning cost melt the team. Use managed Aurora/Spanner
No restore drills for backupsSame outcome as the GitLab January-2017 incident. Quarterly restore drills
Running DBs in single AZ in productionWhole stoppage on AZ failure. Multi-AZ is the production minimum
Using PostgreSQL UUID v4 as primary key for mass insertIndex locality degrades, insert performance several times slower. Switch to UUID v7 (RFC-finalized 2024)
Operating with PITR (Point-in-Time Recovery) disabledA wrong operation loses hours of data. Always enable managed-DB PITR
Self-running on EC2 when a managed version existsPersonnel cost explodes on security patches, backups, failover
Choosing NoSQL assuming “it scales better”Modern PostgreSQL handles several TB / several thousand QPS comfortably; the scale where NoSQL is required is much larger than imagined
Designing with dedicated DBs lined up per use caseMulti-DB rapidly raises operational cost; if one DB is enough, one DB is always best

The early-2017 large-scale MongoDB ransomware (incidents where unauthenticated public MongoDB / Redis / Elasticsearch were globally encrypted/deleted) demonstrated the cost of putting default settings into production without questioning.

AI decision axes

AI-era favorableAI-era unfavorable
PostgreSQL (all-inclusive incl. pgvector)Minor emerging DBs (thin training data)
Standard SQL, explicit schemasSloppy schema-less JSON
Managed (Supabase, Neon, RDS)Self-operated custom DBs
BigQuery, Snowflake (standard SQL)DBs with custom query languages
  1. When in doubt, PostgreSQL — all-inclusive with JSONB / pgvector / full-text search; until exception conditions hit, one DB is enough.
  2. Can the team operate it — choose DBs many people know. Don’t reference Netflix cases.
  3. Use-case specialization is added at exception — Redis, search, analytics DB are added after RDB visibly falls short.
  4. Managed first — self-running is mostly discouraged; lower operational load with Supabase / Neon / RDS.

The structural reason PostgreSQL became the “safe bet” in the AI era

AI coding tools’ SQL generation accuracy directly correlates with the target DB’s training data volume. PostgreSQL is open-source with massive usage examples, so Stack Overflow, GitHub, and official documentation are overwhelmingly abundant. As a result, AI-generated SQL targeting PostgreSQL has higher accuracy than other DBs.

PostgreSQL All-in-One Strategy One instance covers multiple specialized DB roles. The safe bet for the AI era PostgreSQL 1 DB Serving 4 Roles Standard RDB Features ACID / JOIN / Foreign Keys Foundation for business CRUD Orders, customers, inventory Standard SQL with rich learning resources JSONB Document Storage MongoDB alternative Flexible schema Achieved within RDB pgvector Vector Search Pinecone alternative RAG / Similarity Search AI embedding vectors pg_trgm Full-text Search Elasticsearch alternative Fuzzy Search Japanese language support 1 PostgreSQL = fewer DBs, simpler operations Multiple parallel DBs = more complexity, harder for AI to grasp Mongo + Redis + Elastic + Pinecone... When in doubt, PostgreSQL. Highest AI generation accuracy, extensions cover specialized use cases

Furthermore, PostgreSQL handles vector search with pgvector, full-text search with pg_trgm, and document storage with JSONB - fulfilling purpose-specific DB roles in a single instance. The more DBs increase, the more operational complexity grows and the harder it is for AI to grasp context, making PostgreSQL’s “all-in-one” strategy rational in the AI era.

Minor DBs or DBs with proprietary query languages have less training data, so AI generation accuracy drops. Be aware at selection time that choosing niche DBs for “technically interesting” reasons alone means losing AI utilization benefits.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • What’s the main DB (PostgreSQL / MySQL / other)
  • Whether to introduce a cache (Redis, Memcached)
  • Whether to introduce an analytics DB (BigQuery, Snowflake)
  • Is full-text search needed (Elasticsearch, OpenSearch)
  • Is vector search needed (pgvector, Pinecone)
  • Use a managed service (self-operation is mostly discouraged)
  • Backup and restore policy

How to record your reasoning

Data store selection is hard to redo once decided, because data migration costs are enormous. Recording why you chose that data store in an ADR supports future review decisions.

ItemContent
TitleAdopt Snowflake for the analytics platform
StatusApproved
ContextAnalytical queries on the business DB (Aurora PostgreSQL) are degrading production performance. About 200 million rows of data are aggregated in daily batches, and a dedicated analytics platform is needed
DecisionAdopt Snowflake as the analytics platform and sync data daily from the business DB
Rationale- Compute and storage are separated, so analytical query load doesn’t affect the business DB
- Auto-suspend of warehouses means zero cost during idle time
- Semi-structured data (JSON) can be ingested directly, simplifying the ETL transformation step
Rejected alternativesBigQuery: existing infrastructure is unified on AWS, creating cross-cloud data transfer cost and governance issues. Redshift: fixed-cluster billing makes cost efficiency poor during nights/weekends
ConsequencesBuilding a data pipeline to Snowflake (Fivetran or dbt) becomes an additional task. Data catalog maintenance should proceed in parallel

ADRs are best managed as Markdown in docs/adr/ within the code repository, not in spreadsheets or wikis. The greatest value of an ADR is that when you look back, “why this choice was made” is immediately clear.

https://en.senkohome.com/arch-intro-data-etl/ https://en.senkohome.com/arch-intro-data-overview/ https://en.senkohome.com/arch-intro-data-platform/

Summary

This article covered data store selection, including the strengths and weaknesses of RDB, KVS, document, columnar, time-series, search, and vector DBs, plus a phased data-volume x use-case matrix and AI-era favored options.

When in doubt, PostgreSQL; add use-case specialization at exception; managed first. That is the practical answer for data store selection in 2026.

Next time we’ll cover data modeling (table design, normalization, primary keys, indexes).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.