[Data Architecture] Data Platform - Choosing Between DWH, Data Lake, and Lakehouse

About this article

As the fourth installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data platforms.

A platform that only stores is debt; a platform you can pull data from is an asset. This article covers the characteristics of the 3 options DWH/data lake/lakehouse, BI tool integration, recommended compositions by scale, and the structure where the platform turns into a “data swamp” the moment “storing” becomes the goal.

What a data platform handles

BI dashboards, executive KPIs, machine learning, AI agents - the foundation of all of these is the data platform. Without a platform, you end up gathering data per individual system, and cross-department company-wide analysis becomes impossible.

Whether or not you have a data platform is the dividing line of DX (Digital Transformation, the shift to digital-first operations) success. Cutting corners here means you can’t reach company-wide AI utilization.

Why it’s needed

1. Using business DBs for analysis drops their performance

Business DBs are optimized for “fast and safe per transaction,” and running analytical queries like “aggregate all data” slows down the business side. Separating business and analysis is the de facto standard.

2. Data crosses organizations

To connect sales/marketing/accounting/CS data company-wide, “a platform that consolidates in one place” is needed. Without it, the problem of “the same number is different per department” arises.

3. AI accuracy is decided by data quality

ML models and the RAG of LLMs only stand up with “tidy data.” A poor data platform sets the ceiling on AI utilization.

3 options

flowchart LR
    SRC[Business systems<br/>logs/IoT/SNS]
    DWH["DWH<br/>structured-data only<br/>(prepare and load)<br/>BigQuery/Snowflake"]
    LAKE["Data lake<br/>raw data anything<br/>(just store)<br/>S3/Azure Blob"]
    LH["Lakehouse<br/>best of both<br/>Delta Lake/Iceberg"]
    SRC -->|structured only| DWH
    SRC -->|format-agnostic| LAKE
    SRC -->|structured + unstructured| LH
    DWH -.- L1[tidy at the cost of<br/>flexibility]
    LAKE -.- L2[flexible but<br/>swamp risk]
    LH -.- L3[the front-runner since 2024]
    classDef src fill:#fef3c7,stroke:#d97706;
    classDef dwh fill:#dbeafe,stroke:#2563eb;
    classDef lake fill:#fae8ff,stroke:#a21caf;
    classDef lh fill:#dcfce7,stroke:#16a34a,stroke-width:2px;
    class SRC src;
    class DWH dwh;
    class LAKE lake;
    class LH lh;

Option	Rough description
DWH (Data Warehouse)	Analysis-only DB for structured data. Prepare and load
Data lake	Massive storage that holds raw data of any format
Lakehouse	Best of both. SQL is applied directly to the data lake

There are cases where one is enough, and there are large enterprises that operate “DWH + data lake together.” Decide by scale and use case.

Data warehouse (DWH)

A columnar DB optimized for analysis, where structured data is prepared and loaded. A historical concept dating from the 1980s, it’s the main battlefield for “aggregation analysis” - monthly reports, executive dashboards, KPI monitoring.

All modern DWHs are cloud-managed, with BigQuery, Snowflake, and Redshift as the big three. They aggregate at TB-PB scale in seconds, are accessed via SQL, and have moderate learning costs.

Pros	Cons
Aggregation analysis ultra-fast	Unstructured data doesn’t fit
Anyone can use it via SQL	Storing raw data is uneconomical
Fine-grained permission control	Pricing models are unique per cloud
Almost no operations needed (managed)	Strong vendor lock-in

Representatives: BigQuery, Snowflake, Amazon Redshift, Azure Synapse

The default is BigQuery or Snowflake. The reasons to choose Redshift are diminishing unless you’re locked to AWS.

Data lake

“Massive storage” that holds raw data regardless of format. CSV, JSON, images, videos, PDFs, logs - the flexibility to hold anything is the feature, operated with the stance of “first store everything, think about how to use it later.” A cloud’s object storage (S3, GCS, ADLS) becomes the platform as is.

While DWH is “prepare before loading,” the data lake’s idea is load and prepare later. Machine learning, unstructured-data analysis, audit-log retention - it covers “areas DWH can’t handle.”

Pros	Cons
Anything fits regardless of format	Sloppy use turns it into a “data swamp”
Storage cost is extremely cheap	Not SQL-able as is (separate engine needed)
Unlimited scale	Permission management/governance is hard
Optimal for ML preprocessing	Search and aggregation are slow

Representatives: Amazon S3, Google Cloud Storage, Azure Data Lake Storage

Just storing without operational rules turns into a data swamp. Always set up catalog and naming conventions.

Lakehouse

A modern approach that places DWH features on top of a data lake. Apply SQL directly to Parquet files on S3, support ACID transactions - a “best of both” composition. Proposed by Databricks, with Delta Lake, Apache Iceberg, and Apache Hudi establishing themselves as standard data formats.

The concept was born from the issue “operating both a DWH and a data lake is too heavy,” and lakehouse is becoming the top candidate for new builds. However, operational know-how is still developing, so the difficulty rises if the team isn’t familiar.

Pros	Cons
One platform covers both use cases	Operational know-how still maturing
Storage cost is cheap	Team learning cost required
Weak vendor lock-in	Toolchain still incomplete
Handles both structured and unstructured	Can be excessive at small scale

Representatives: Databricks, Snowflake (Iceberg-supporting), BigLake

Since 2024, lakehouse has become the mainstream for new builds. Migration from existing DWH proceeds in stages.

Comparison of the 3

Viewpoint	DWH	Data lake	Lakehouse
Structured data analysis	Excellent	Marginal	Excellent
Unstructured data	No	Excellent	Good
Storage cost	High	Low	Low-Mid
Direct SQL use	Excellent	No	Good
Operational simplicity	Excellent	Marginal	Marginal
Vendor lock-in	Strong	Weak	Mid
Compatibility with ML	Marginal	Excellent	Excellent

The lakehouse looks balanced and new, but the reality is that organizations already running on a DWH don’t need to forcibly switch.

BI tool integration

Even building a data platform, business departments can’t use it without visualization tools (BI). BI tools issue SQL to the DWH/lakehouse and display dashboards and reports.

BI tool	Characteristics	Suited for
Tableau	Strongest features, industry standard	Large enterprises, advanced analysis
Power BI	Pairs well with Microsoft	Microsoft 365 companies
Looker	Strong modeling layer, Google integration	BigQuery users
Metabase	OSS, lightweight, free	Small/mid scale, personal
Redash	OSS, SQL-centric	Engineer-led organizations

Whether business departments can use it themselves decides BI penetration. Choosing one with a UI usable by non-engineers is the rule.

Decision criteria

1. Type of data

The optimal platform varies with data type. Structured data only is DWH only, but with images/videos/unstructured data involved, a data lake or lakehouse becomes mandatory.

Data type	Recommended
Structured business data only	DWH (BigQuery, Snowflake)
With ML / AI preprocessing	Lakehouse or DWH + lake combo
Mass log/image/video retention	Data lake-centric
Long-term retention for history/audit	Data lake (S3 Glacier etc.)

2. Scale and budget

DWHs primarily charge by query billing, and useless queries spike the bill. Data lakes have extremely cheap storage and suit “store first.” Budget management philosophy also affects selection.

3. Cloud vendor

Data platforms are strongly tied to cloud vendors. If you’re already on AWS, Redshift + S3; on GCP, BigQuery; on Azure, Synapse + ADLS - aligning with existing cloud is advantageous in operations and billing.

Vendor	DWH	Data lake
AWS	Redshift	S3 + Glue
Google Cloud	BigQuery	GCS + BigLake
Azure	Synapse	ADLS + Fabric
Multi-cloud	Snowflake	-

Snowflake is the only multi-cloud DWH, supported by enterprises wishing to avoid vendor lock-in.

How to choose by case

Mid-size company executive KPI visualization

BigQuery + Metabase/Redash. Cheap (from tens of thousands of yen monthly), simple, low learning cost. Manage ETL from business DB with dbt (data build tool, the tool defining data-transformation pipelines in SQL).

Large enterprise / multi-department / multi-cloud

Snowflake. Multi-cloud support, strong permission management, contract-scale discounts. BI is Tableau or Power BI.

ML / AI utilization-focused

Lakehouse (Databricks). End-to-end including notebooks and MLOps (DevOps for ML). With existing DWH, start from parallel operation.

Startups / MVP phase

Don’t build a data platform. PostgreSQL read replicas + Metabase are enough. Once revenue grows, consider proper introduction.

Phased platform-selection table by org scale and data volume

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

Choosing a data platform “by trends” breaks down in operations. The practical rule is to phase by scale and monthly cost.

Org scale	Data volume	Recommended platform	Monthly target	BI tool
Personal/MVP	~10GB	PostgreSQL only	$0-50	Metabase (free)
Startup	~1TB	BigQuery	tens-hundreds of dollars	Metabase / Looker Studio
Mid-size SaaS	~10TB	BigQuery or Snowflake	hundreds-thousands of dollars	Looker / Tableau
Large enterprise / multi-dept	~100TB	Snowflake (multi-cloud)	thousands-tens-of-thousands	Tableau / Power BI
Super-large / ML-centric	100TB+	Databricks (lakehouse)	tens-of-thousands+	Dedicated dashboards

Query-billing blow-up pattern: Using SELECT * without limit on BigQuery is the classic story of monthly bills hitting hundreds of thousands overnight. Control costs with the triad partitioning required, column specification required, leverage query cache. Snowflake is also billed by warehouse size x runtime, so the standard is to fix dev environments at XS and Auto-scale on production only.

Snowflake at startup scale is over-investment. BigQuery free tier handles several GB to several hundred GB easily.

Author’s note - cases of “stored but no one can use” swamps

A business unit “stored 3 years of all-department logs in S3” thinking “we’ll analyze it someday,” but with no schema or naming conventions, hundreds of millions of JSON files piled up - and the result was no one could use them, turning into a swamp. Date format differs per file, field names fluctuate with each service revision, and the same value has multiple representations. The processing cost for analysis ended up exceeding the cost of newly designing log collection - reaching the “putting the cart before the horse” punchline.

In another field, conversely, “putting everything in BigQuery is safe” led to ramming image and video binaries into the DWH, with query bills jumping to hundreds of thousands of dollars monthly - the kind of joke-like case told often. Ignoring the basic separation - DWH for structured, data lake for unstructured - rebounds on cost.

I myself once thought lightly about log design and judged “JSON for now” on a past project, only to be told six months later by the analytics person that “we can’t read this.” Both are cases that left the common lesson that when storing itself becomes the goal, the platform becomes a junkyard, not a foundation. Catalogs, naming, and use-case-based platform splits are the basic gear that prevents the platform from becoming a swamp.

A data platform’s purpose is not “storing” but excavating. Catalog and retention policy prevent debt accumulation.

Data platform pitfalls and forbidden moves

Here are the typical accidents in data platforms. The moment “storing” becomes the goal, the platform becomes a junkyard.

Forbidden move	Why it’s bad
Pile raw data in S3 with no schema or naming convention	”No one can dig” swamp 3 years later. Processing cost exceeds new collection design
Putting image/video binaries in DWH	Query billing hits hundreds of thousands monthly. Unstructured goes to data lake
`SELECT *` without limit in BigQuery	Monthly bill jumps overnight. Column specification + LIMIT required
Always-on Large warehouse in Snowflake	Large for dev/staging is wasteful. Control with XS + Auto-scale
Run direct analysis on the business DB	Business-side performance drops, customer impact. Always separate via ETL/ELT to DWH
Hand-written SQL scattered across the org without dbt/ELT	Transformation logic gets person-locked. Secret SQL only on retirees’ PCs
Loading personal data unmasked into DWH	GDPR / personal info law violation. Same risk as Meta’s 2023 EUR 1.2B fine
Operating with no data catalog	”Where’s this data?” asked of people every time. Analysis speed at 1/10
No retention policy	All data from 5 years ago bleeding through Standard-class billing. Move to tiered storage
Adopting lakehouse from small scale	Operational know-how still maturing. Phase in from mid-scale up

The May 2023 Meta GDPR fine of EUR 1.2B (about JPY 200 billion) is a case showing that data-storage-location design mistakes lead directly to fines at the company-survival level. “Inputting industry-specific regulatory requirements (GDPR / HIPAA / PCI DSS) first” is the starting point of platform design.

AI-era perspective

When AI-driven development (vibe coding) and AI usage are the premise, a data platform’s importance as a platform AI agents access jumps. In an era where LLM RAG, Text-to-SQL, and AI agents query business data, a tidy data platform sets AI accuracy ceiling directly.

Favored in the AI era	Disfavored in the AI era
BigQuery / Snowflake (mainstream, abundant training data)	Minor DWHs
dbt + data catalog maintenance	Raw SQL with no documentation
Tidy schema, naming	Sloppy CSV ingestion
Data lineage (tracking origin and transformation paths) visualized	Black-box ETL

The new standard is aiming for a state where the data platform can introduce itself to AI agents. With metadata, catalogs, and lineage in place, AI can autonomously select data and analyze.

A data platform in the AI era is a place AI agents touch. Build with explainability, explicit structure, and mainstream technology.

Common misconceptions

You’ll regret not building a data lake from the start - excessive at small scale. DWHs alone like BigQuery handle several TB. Add a data lake “after unstructured data grows”
With a DWH, you don’t need a business DB - completely different things. Business DB (OLTP) handles transactions, DWH does analysis. Both are needed
Buying expensive BI tools advances analysis - what matters more than tools is how tidy the data is. Pointing premium tools at sloppy data yields nothing
Lakehouse is a superset of DWH so replace everything - still maturing. With thin operational know-how, no need to forcibly switch a stable DWH

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

Type of platform (DWH / data lake / lakehouse)
Cloud vendor (AWS / GCP / Azure / multi)
BI tool (Tableau / Power BI / Metabase etc.)
Data ingestion method (ETL / ELT / streaming)
Retention period and cost tier (hot / cold / archive)
Permission management method (Row-Level Security / IAM)
Governance regime (catalog, lineage)

How to make the final call

The core of a data platform is separating business and analysis, holding the discipline as an organization of not throwing analytical queries at the business DB. Without a platform, the company-wide problem of “the same number is different per department” is guaranteed, and BI, ML, and AI utilization all proceed lacking a foundation. While the scale is small, a DWH alone is enough; add a data lake at the stage where unstructured data grows; converge on lakehouse once operational know-how matures - this phased decision is rational.

The decisive axis is the perspective of a platform AI agents access. In an era where LLM RAG and Text-to-SQL directly query the data platform, a platform with mainstream DWHs (BigQuery / Snowflake) plus metadata and lineage in place lifts AI’s accuracy ceiling directly. Minor DWHs and black-box ETL become liabilities in the AI era.

Selection priorities

Separate business and analysis - don’t run analysis on the business DB; this is the starting point of everything
Phase by scale - small starts on BigQuery / Snowflake alone, large migrates phased to lakehouse
Lean on existing cloud - prioritize integration benefits in billing/IAM/operations; multi-cloud means Snowflake
Build out catalog and lineage - aim for a platform where AI agents can introduce themselves

“Default to BigQuery or Snowflake.” Choose the mainstream, build a readable AI-era data platform.

Summary

This article covered data platforms, including the 3 options DWH/data lake/lakehouse, BI tool integration, phased recommendations by scale, and catalog operations to avoid data swamps.

Separate business and analysis, phase by scale, lean on existing cloud, and build a platform AI agents touch via catalog. That is the practical answer for a data platform in 2026.

Next time we’ll cover ETL / ELT (the mechanism for extracting, transforming, and loading data).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.