Data Architecture

[Data Architecture] Data Platform - Choosing Between DWH, Data Lake, and Lakehouse

[Data Architecture] Data Platform - Choosing Between DWH, Data Lake, and Lakehouse

About this article

As the fourth installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data platforms.

A platform that only stores is debt; a platform you can pull data from is an asset. This article covers the characteristics of the 3 options DWH/data lake/lakehouse, BI tool integration, recommended compositions by scale, and the structure where the platform turns into a “data swamp” the moment “storing” becomes the goal.

What a data platform handles

BI dashboards, executive KPIs, machine learning, AI agents - the foundation of all of these is the data platform. Without a platform, you end up gathering data per individual system, and cross-department company-wide analysis becomes impossible.

Whether or not you have a data platform is the dividing line of DX (Digital Transformation, the shift to digital-first operations) success. Cutting corners here means you can’t reach company-wide AI utilization.

Why it’s needed

1. Using business DBs for analysis drops their performance

Business DBs are optimized for “fast and safe per transaction,” and running analytical queries like “aggregate all data” slows down the business side. Separating business and analysis is the de facto standard.

2. Data crosses organizations

To connect sales/marketing/accounting/CS data company-wide, “a platform that consolidates in one place” is needed. Without it, the problem of “the same number is different per department” arises.

3. AI accuracy is decided by data quality

ML models and the RAG of LLMs only stand up with “tidy data.” A poor data platform sets the ceiling on AI utilization.

3 options

flowchart LR
    SRC[Business systems<br/>logs/IoT/SNS]
    DWH["DWH<br/>structured-data only<br/>(prepare and load)<br/>BigQuery/Snowflake"]
    LAKE["Data lake<br/>raw data anything<br/>(just store)<br/>S3/Azure Blob"]
    LH["Lakehouse<br/>best of both<br/>Delta Lake/Iceberg"]
    SRC -->|structured only| DWH
    SRC -->|format-agnostic| LAKE
    SRC -->|structured + unstructured| LH
    DWH -.- L1[tidy at the cost of<br/>flexibility]
    LAKE -.- L2[flexible but<br/>swamp risk]
    LH -.- L3[the front-runner since 2024]
    classDef src fill:#fef3c7,stroke:#d97706;
    classDef dwh fill:#dbeafe,stroke:#2563eb;
    classDef lake fill:#fae8ff,stroke:#a21caf;
    classDef lh fill:#dcfce7,stroke:#16a34a,stroke-width:2px;
    class SRC src;
    class DWH dwh;
    class LAKE lake;
    class LH lh;
OptionRough description
DWH (Data Warehouse)Analysis-only DB for structured data. Prepare and load
Data lakeMassive storage that holds raw data of any format
LakehouseBest of both. SQL is applied directly to the data lake

There are cases where one is enough, and there are large enterprises that operate “DWH + data lake together.” Decide by scale and use case.

Data warehouse (DWH)

A columnar DB optimized for analysis, where structured data is prepared and loaded. A historical concept dating from the 1980s, it’s the main battlefield for “aggregation analysis” - monthly reports, executive dashboards, KPI monitoring.

All modern DWHs are cloud-managed, with BigQuery, Snowflake, and Redshift as the big three. They aggregate at TB-PB scale in seconds, are accessed via SQL, and have moderate learning costs.

ProsCons
Aggregation analysis ultra-fastUnstructured data doesn’t fit
Anyone can use it via SQLStoring raw data is uneconomical
Fine-grained permission controlPricing models are unique per cloud
Almost no operations needed (managed)Strong vendor lock-in

Representatives: BigQuery, Snowflake, Amazon Redshift, Azure Synapse

The default is BigQuery or Snowflake. The reasons to choose Redshift are diminishing unless you’re locked to AWS.

Data lake

“Massive storage” that holds raw data regardless of format. CSV, JSON, images, videos, PDFs, logs - the flexibility to hold anything is the feature, operated with the stance of “first store everything, think about how to use it later.” A cloud’s object storage (S3, GCS, ADLS) becomes the platform as is.

While DWH is “prepare before loading,” the data lake’s idea is load and prepare later. Machine learning, unstructured-data analysis, audit-log retention - it covers “areas DWH can’t handle.”

ProsCons
Anything fits regardless of formatSloppy use turns it into a “data swamp”
Storage cost is extremely cheapNot SQL-able as is (separate engine needed)
Unlimited scalePermission management/governance is hard
Optimal for ML preprocessingSearch and aggregation are slow

Representatives: Amazon S3, Google Cloud Storage, Azure Data Lake Storage

Just storing without operational rules turns into a data swamp. Always set up catalog and naming conventions.

Lakehouse

A modern approach that places DWH features on top of a data lake. Apply SQL directly to Parquet files on S3, support ACID transactions - a “best of both” composition. Proposed by Databricks, with Delta Lake, Apache Iceberg, and Apache Hudi establishing themselves as standard data formats.

The concept was born from the issue “operating both a DWH and a data lake is too heavy,” and lakehouse is becoming the top candidate for new builds. However, operational know-how is still developing, so the difficulty rises if the team isn’t familiar.

ProsCons
One platform covers both use casesOperational know-how still maturing
Storage cost is cheapTeam learning cost required
Weak vendor lock-inToolchain still incomplete
Handles both structured and unstructuredCan be excessive at small scale

Representatives: Databricks, Snowflake (Iceberg-supporting), BigLake

Since 2024, lakehouse has become the mainstream for new builds. Migration from existing DWH proceeds in stages.

Comparison of the 3

ViewpointDWHData lakeLakehouse
Structured data analysisExcellentMarginalExcellent
Unstructured dataNoExcellentGood
Storage costHighLowLow-Mid
Direct SQL useExcellentNoGood
Operational simplicityExcellentMarginalMarginal
Vendor lock-inStrongWeakMid
Compatibility with MLMarginalExcellentExcellent

The lakehouse looks balanced and new, but the reality is that organizations already running on a DWH don’t need to forcibly switch.

BI tool integration

Even building a data platform, business departments can’t use it without visualization tools (BI). BI tools issue SQL to the DWH/lakehouse and display dashboards and reports.

BI toolCharacteristicsSuited for
TableauStrongest features, industry standardLarge enterprises, advanced analysis
Power BIPairs well with MicrosoftMicrosoft 365 companies
LookerStrong modeling layer, Google integrationBigQuery users
MetabaseOSS, lightweight, freeSmall/mid scale, personal
RedashOSS, SQL-centricEngineer-led organizations

Whether business departments can use it themselves decides BI penetration. Choosing one with a UI usable by non-engineers is the rule.

Decision criteria

1. Type of data

The optimal platform varies with data type. Structured data only is DWH only, but with images/videos/unstructured data involved, a data lake or lakehouse becomes mandatory.

Data typeRecommended
Structured business data onlyDWH (BigQuery, Snowflake)
With ML / AI preprocessingLakehouse or DWH + lake combo
Mass log/image/video retentionData lake-centric
Long-term retention for history/auditData lake (S3 Glacier etc.)

2. Scale and budget

DWHs primarily charge by query billing, and useless queries spike the bill. Data lakes have extremely cheap storage and suit “store first.” Budget management philosophy also affects selection.

3. Cloud vendor

Data platforms are strongly tied to cloud vendors. If you’re already on AWS, Redshift + S3; on GCP, BigQuery; on Azure, Synapse + ADLS - aligning with existing cloud is advantageous in operations and billing.

VendorDWHData lake
AWSRedshiftS3 + Glue
Google CloudBigQueryGCS + BigLake
AzureSynapseADLS + Fabric
Multi-cloudSnowflake-

Snowflake is the only multi-cloud DWH, supported by enterprises wishing to avoid vendor lock-in.

How to choose by case

Mid-size company executive KPI visualization

BigQuery + Metabase/Redash. Cheap (from tens of thousands of yen monthly), simple, low learning cost. Manage ETL from business DB with dbt (data build tool, the tool defining data-transformation pipelines in SQL).

Large enterprise / multi-department / multi-cloud

Snowflake. Multi-cloud support, strong permission management, contract-scale discounts. BI is Tableau or Power BI.

ML / AI utilization-focused

Lakehouse (Databricks). End-to-end including notebooks and MLOps (DevOps for ML). With existing DWH, start from parallel operation.

Startups / MVP phase

Don’t build a data platform. PostgreSQL read replicas + Metabase are enough. Once revenue grows, consider proper introduction.

Phased platform-selection table by org scale and data volume

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

Choosing a data platform “by trends” breaks down in operations. The practical rule is to phase by scale and monthly cost.

Org scaleData volumeRecommended platformMonthly targetBI tool
Personal/MVP~10GBPostgreSQL only$0-50Metabase (free)
Startup~1TBBigQuerytens-hundreds of dollarsMetabase / Looker Studio
Mid-size SaaS~10TBBigQuery or Snowflakehundreds-thousands of dollarsLooker / Tableau
Large enterprise / multi-dept~100TBSnowflake (multi-cloud)thousands-tens-of-thousandsTableau / Power BI
Super-large / ML-centric100TB+Databricks (lakehouse)tens-of-thousands+Dedicated dashboards

Query-billing blow-up pattern: Using SELECT * without limit on BigQuery is the classic story of monthly bills hitting hundreds of thousands overnight. Control costs with the triad partitioning required, column specification required, leverage query cache. Snowflake is also billed by warehouse size x runtime, so the standard is to fix dev environments at XS and Auto-scale on production only.

Snowflake at startup scale is over-investment. BigQuery free tier handles several GB to several hundred GB easily.

Author’s note - cases of “stored but no one can use” swamps

A business unit “stored 3 years of all-department logs in S3” thinking “we’ll analyze it someday,” but with no schema or naming conventions, hundreds of millions of JSON files piled up - and the result was no one could use them, turning into a swamp. Date format differs per file, field names fluctuate with each service revision, and the same value has multiple representations. The processing cost for analysis ended up exceeding the cost of newly designing log collection - reaching the “putting the cart before the horse” punchline.

In another field, conversely, “putting everything in BigQuery is safe” led to ramming image and video binaries into the DWH, with query bills jumping to hundreds of thousands of dollars monthly - the kind of joke-like case told often. Ignoring the basic separation - DWH for structured, data lake for unstructured - rebounds on cost.

I myself once thought lightly about log design and judged “JSON for now” on a past project, only to be told six months later by the analytics person that “we can’t read this.” Both are cases that left the common lesson that when storing itself becomes the goal, the platform becomes a junkyard, not a foundation. Catalogs, naming, and use-case-based platform splits are the basic gear that prevents the platform from becoming a swamp.

A data platform’s purpose is not “storing” but excavating. Catalog and retention policy prevent debt accumulation.

Data platform pitfalls and forbidden moves

Here are the typical accidents in data platforms. The moment “storing” becomes the goal, the platform becomes a junkyard.

Forbidden moveWhy it’s bad
Pile raw data in S3 with no schema or naming convention”No one can dig” swamp 3 years later. Processing cost exceeds new collection design
Putting image/video binaries in DWHQuery billing hits hundreds of thousands monthly. Unstructured goes to data lake
SELECT * without limit in BigQueryMonthly bill jumps overnight. Column specification + LIMIT required
Always-on Large warehouse in SnowflakeLarge for dev/staging is wasteful. Control with XS + Auto-scale
Run direct analysis on the business DBBusiness-side performance drops, customer impact. Always separate via ETL/ELT to DWH
Hand-written SQL scattered across the org without dbt/ELTTransformation logic gets person-locked. Secret SQL only on retirees’ PCs
Loading personal data unmasked into DWHGDPR / personal info law violation. Same risk as Meta’s 2023 EUR 1.2B fine
Operating with no data catalog”Where’s this data?” asked of people every time. Analysis speed at 1/10
No retention policyAll data from 5 years ago bleeding through Standard-class billing. Move to tiered storage
Adopting lakehouse from small scaleOperational know-how still maturing. Phase in from mid-scale up

The May 2023 Meta GDPR fine of EUR 1.2B (about JPY 200 billion) is a case showing that data-storage-location design mistakes lead directly to fines at the company-survival level. “Inputting industry-specific regulatory requirements (GDPR / HIPAA / PCI DSS) first” is the starting point of platform design.

AI-era perspective

When AI-driven development (vibe coding) and AI usage are the premise, a data platform’s importance as a platform AI agents access jumps. In an era where LLM RAG, Text-to-SQL, and AI agents query business data, a tidy data platform sets AI accuracy ceiling directly.

Favored in the AI eraDisfavored in the AI era
BigQuery / Snowflake (mainstream, abundant training data)Minor DWHs
dbt + data catalog maintenanceRaw SQL with no documentation
Tidy schema, namingSloppy CSV ingestion
Data lineage (tracking origin and transformation paths) visualizedBlack-box ETL

The new standard is aiming for a state where the data platform can introduce itself to AI agents. With metadata, catalogs, and lineage in place, AI can autonomously select data and analyze.

A data platform in the AI era is a place AI agents touch. Build with explainability, explicit structure, and mainstream technology.

Common misconceptions

  • You’ll regret not building a data lake from the start - excessive at small scale. DWHs alone like BigQuery handle several TB. Add a data lake “after unstructured data grows”
  • With a DWH, you don’t need a business DB - completely different things. Business DB (OLTP) handles transactions, DWH does analysis. Both are needed
  • Buying expensive BI tools advances analysis - what matters more than tools is how tidy the data is. Pointing premium tools at sloppy data yields nothing
  • Lakehouse is a superset of DWH so replace everything - still maturing. With thin operational know-how, no need to forcibly switch a stable DWH

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Type of platform (DWH / data lake / lakehouse)
  • Cloud vendor (AWS / GCP / Azure / multi)
  • BI tool (Tableau / Power BI / Metabase etc.)
  • Data ingestion method (ETL / ELT / streaming)
  • Retention period and cost tier (hot / cold / archive)
  • Permission management method (Row-Level Security / IAM)
  • Governance regime (catalog, lineage)

How to make the final call

The core of a data platform is separating business and analysis, holding the discipline as an organization of not throwing analytical queries at the business DB. Without a platform, the company-wide problem of “the same number is different per department” is guaranteed, and BI, ML, and AI utilization all proceed lacking a foundation. While the scale is small, a DWH alone is enough; add a data lake at the stage where unstructured data grows; converge on lakehouse once operational know-how matures - this phased decision is rational.

The decisive axis is the perspective of a platform AI agents access. In an era where LLM RAG and Text-to-SQL directly query the data platform, a platform with mainstream DWHs (BigQuery / Snowflake) plus metadata and lineage in place lifts AI’s accuracy ceiling directly. Minor DWHs and black-box ETL become liabilities in the AI era.

Selection priorities

  1. Separate business and analysis - don’t run analysis on the business DB; this is the starting point of everything
  2. Phase by scale - small starts on BigQuery / Snowflake alone, large migrates phased to lakehouse
  3. Lean on existing cloud - prioritize integration benefits in billing/IAM/operations; multi-cloud means Snowflake
  4. Build out catalog and lineage - aim for a platform where AI agents can introduce themselves

“Default to BigQuery or Snowflake.” Choose the mainstream, build a readable AI-era data platform.

Summary

This article covered data platforms, including the 3 options DWH/data lake/lakehouse, BI tool integration, phased recommendations by scale, and catalog operations to avoid data swamps.

Separate business and analysis, phase by scale, lean on existing cloud, and build a platform AI agents touch via catalog. That is the practical answer for a data platform in 2026.

Next time we’ll cover ETL / ELT (the mechanism for extracting, transforming, and loading data).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.