About this article
As the fourth installment of the âData Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains data platforms.
A platform that only stores is debt; a platform you can pull data from is an asset. This article covers the characteristics of the 3 options DWH/data lake/lakehouse, BI tool integration, recommended compositions by scale, and the structure where the platform turns into a âdata swampâ the moment âstoringâ becomes the goal.
Other articles in this category
What a data platform handles
BI dashboards, executive KPIs, machine learning, AI agents - the foundation of all of these is the data platform. Without a platform, you end up gathering data per individual system, and cross-department company-wide analysis becomes impossible.
Whether or not you have a data platform is the dividing line of DX (Digital Transformation, the shift to digital-first operations) success. Cutting corners here means you canât reach company-wide AI utilization.
Why itâs needed
1. Using business DBs for analysis drops their performance
Business DBs are optimized for âfast and safe per transaction,â and running analytical queries like âaggregate all dataâ slows down the business side. Separating business and analysis is the de facto standard.
2. Data crosses organizations
To connect sales/marketing/accounting/CS data company-wide, âa platform that consolidates in one placeâ is needed. Without it, the problem of âthe same number is different per departmentâ arises.
3. AI accuracy is decided by data quality
ML models and the RAG of LLMs only stand up with âtidy data.â A poor data platform sets the ceiling on AI utilization.
3 options
flowchart LR
SRC[Business systems<br/>logs/IoT/SNS]
DWH["DWH<br/>structured-data only<br/>(prepare and load)<br/>BigQuery/Snowflake"]
LAKE["Data lake<br/>raw data anything<br/>(just store)<br/>S3/Azure Blob"]
LH["Lakehouse<br/>best of both<br/>Delta Lake/Iceberg"]
SRC -->|structured only| DWH
SRC -->|format-agnostic| LAKE
SRC -->|structured + unstructured| LH
DWH -.- L1[tidy at the cost of<br/>flexibility]
LAKE -.- L2[flexible but<br/>swamp risk]
LH -.- L3[the front-runner since 2024]
classDef src fill:#fef3c7,stroke:#d97706;
classDef dwh fill:#dbeafe,stroke:#2563eb;
classDef lake fill:#fae8ff,stroke:#a21caf;
classDef lh fill:#dcfce7,stroke:#16a34a,stroke-width:2px;
class SRC src;
class DWH dwh;
class LAKE lake;
class LH lh;
| Option | Rough description |
|---|---|
| DWH (Data Warehouse) | Analysis-only DB for structured data. Prepare and load |
| Data lake | Massive storage that holds raw data of any format |
| Lakehouse | Best of both. SQL is applied directly to the data lake |
There are cases where one is enough, and there are large enterprises that operate âDWH + data lake together.â Decide by scale and use case.
Data warehouse (DWH)
A columnar DB optimized for analysis, where structured data is prepared and loaded. A historical concept dating from the 1980s, itâs the main battlefield for âaggregation analysisâ - monthly reports, executive dashboards, KPI monitoring.
All modern DWHs are cloud-managed, with BigQuery, Snowflake, and Redshift as the big three. They aggregate at TB-PB scale in seconds, are accessed via SQL, and have moderate learning costs.
| Pros | Cons |
|---|---|
| Aggregation analysis ultra-fast | Unstructured data doesnât fit |
| Anyone can use it via SQL | Storing raw data is uneconomical |
| Fine-grained permission control | Pricing models are unique per cloud |
| Almost no operations needed (managed) | Strong vendor lock-in |
Representatives: BigQuery, Snowflake, Amazon Redshift, Azure Synapse
The default is BigQuery or Snowflake. The reasons to choose Redshift are diminishing unless youâre locked to AWS.
Data lake
âMassive storageâ that holds raw data regardless of format. CSV, JSON, images, videos, PDFs, logs - the flexibility to hold anything is the feature, operated with the stance of âfirst store everything, think about how to use it later.â A cloudâs object storage (S3, GCS, ADLS) becomes the platform as is.
While DWH is âprepare before loading,â the data lakeâs idea is load and prepare later. Machine learning, unstructured-data analysis, audit-log retention - it covers âareas DWH canât handle.â
| Pros | Cons |
|---|---|
| Anything fits regardless of format | Sloppy use turns it into a âdata swampâ |
| Storage cost is extremely cheap | Not SQL-able as is (separate engine needed) |
| Unlimited scale | Permission management/governance is hard |
| Optimal for ML preprocessing | Search and aggregation are slow |
Representatives: Amazon S3, Google Cloud Storage, Azure Data Lake Storage
Just storing without operational rules turns into a data swamp. Always set up catalog and naming conventions.
Lakehouse
A modern approach that places DWH features on top of a data lake. Apply SQL directly to Parquet files on S3, support ACID transactions - a âbest of bothâ composition. Proposed by Databricks, with Delta Lake, Apache Iceberg, and Apache Hudi establishing themselves as standard data formats.
The concept was born from the issue âoperating both a DWH and a data lake is too heavy,â and lakehouse is becoming the top candidate for new builds. However, operational know-how is still developing, so the difficulty rises if the team isnât familiar.
| Pros | Cons |
|---|---|
| One platform covers both use cases | Operational know-how still maturing |
| Storage cost is cheap | Team learning cost required |
| Weak vendor lock-in | Toolchain still incomplete |
| Handles both structured and unstructured | Can be excessive at small scale |
Representatives: Databricks, Snowflake (Iceberg-supporting), BigLake
Since 2024, lakehouse has become the mainstream for new builds. Migration from existing DWH proceeds in stages.
Comparison of the 3
| Viewpoint | DWH | Data lake | Lakehouse |
|---|---|---|---|
| Structured data analysis | Excellent | Marginal | Excellent |
| Unstructured data | No | Excellent | Good |
| Storage cost | High | Low | Low-Mid |
| Direct SQL use | Excellent | No | Good |
| Operational simplicity | Excellent | Marginal | Marginal |
| Vendor lock-in | Strong | Weak | Mid |
| Compatibility with ML | Marginal | Excellent | Excellent |
The lakehouse looks balanced and new, but the reality is that organizations already running on a DWH donât need to forcibly switch.
BI tool integration
Even building a data platform, business departments canât use it without visualization tools (BI). BI tools issue SQL to the DWH/lakehouse and display dashboards and reports.
| BI tool | Characteristics | Suited for |
|---|---|---|
| Tableau | Strongest features, industry standard | Large enterprises, advanced analysis |
| Power BI | Pairs well with Microsoft | Microsoft 365 companies |
| Looker | Strong modeling layer, Google integration | BigQuery users |
| Metabase | OSS, lightweight, free | Small/mid scale, personal |
| Redash | OSS, SQL-centric | Engineer-led organizations |
Whether business departments can use it themselves decides BI penetration. Choosing one with a UI usable by non-engineers is the rule.
Decision criteria
1. Type of data
The optimal platform varies with data type. Structured data only is DWH only, but with images/videos/unstructured data involved, a data lake or lakehouse becomes mandatory.
| Data type | Recommended |
|---|---|
| Structured business data only | DWH (BigQuery, Snowflake) |
| With ML / AI preprocessing | Lakehouse or DWH + lake combo |
| Mass log/image/video retention | Data lake-centric |
| Long-term retention for history/audit | Data lake (S3 Glacier etc.) |
2. Scale and budget
DWHs primarily charge by query billing, and useless queries spike the bill. Data lakes have extremely cheap storage and suit âstore first.â Budget management philosophy also affects selection.
3. Cloud vendor
Data platforms are strongly tied to cloud vendors. If youâre already on AWS, Redshift + S3; on GCP, BigQuery; on Azure, Synapse + ADLS - aligning with existing cloud is advantageous in operations and billing.
| Vendor | DWH | Data lake |
|---|---|---|
| AWS | Redshift | S3 + Glue |
| Google Cloud | BigQuery | GCS + BigLake |
| Azure | Synapse | ADLS + Fabric |
| Multi-cloud | Snowflake | - |
Snowflake is the only multi-cloud DWH, supported by enterprises wishing to avoid vendor lock-in.
How to choose by case
Mid-size company executive KPI visualization
BigQuery + Metabase/Redash. Cheap (from tens of thousands of yen monthly), simple, low learning cost. Manage ETL from business DB with dbt (data build tool, the tool defining data-transformation pipelines in SQL).
Large enterprise / multi-department / multi-cloud
Snowflake. Multi-cloud support, strong permission management, contract-scale discounts. BI is Tableau or Power BI.
ML / AI utilization-focused
Lakehouse (Databricks). End-to-end including notebooks and MLOps (DevOps for ML). With existing DWH, start from parallel operation.
Startups / MVP phase
Donât build a data platform. PostgreSQL read replicas + Metabase are enough. Once revenue grows, consider proper introduction.
Phased platform-selection table by org scale and data volume
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
Choosing a data platform âby trendsâ breaks down in operations. The practical rule is to phase by scale and monthly cost.
| Org scale | Data volume | Recommended platform | Monthly target | BI tool |
|---|---|---|---|---|
| Personal/MVP | ~10GB | PostgreSQL only | $0-50 | Metabase (free) |
| Startup | ~1TB | BigQuery | tens-hundreds of dollars | Metabase / Looker Studio |
| Mid-size SaaS | ~10TB | BigQuery or Snowflake | hundreds-thousands of dollars | Looker / Tableau |
| Large enterprise / multi-dept | ~100TB | Snowflake (multi-cloud) | thousands-tens-of-thousands | Tableau / Power BI |
| Super-large / ML-centric | 100TB+ | Databricks (lakehouse) | tens-of-thousands+ | Dedicated dashboards |
Query-billing blow-up pattern: Using SELECT * without limit on BigQuery is the classic story of monthly bills hitting hundreds of thousands overnight. Control costs with the triad partitioning required, column specification required, leverage query cache. Snowflake is also billed by warehouse size x runtime, so the standard is to fix dev environments at XS and Auto-scale on production only.
Snowflake at startup scale is over-investment. BigQuery free tier handles several GB to several hundred GB easily.
Authorâs note - cases of âstored but no one can useâ swamps
A business unit âstored 3 years of all-department logs in S3â thinking âweâll analyze it someday,â but with no schema or naming conventions, hundreds of millions of JSON files piled up - and the result was no one could use them, turning into a swamp. Date format differs per file, field names fluctuate with each service revision, and the same value has multiple representations. The processing cost for analysis ended up exceeding the cost of newly designing log collection - reaching the âputting the cart before the horseâ punchline.
In another field, conversely, âputting everything in BigQuery is safeâ led to ramming image and video binaries into the DWH, with query bills jumping to hundreds of thousands of dollars monthly - the kind of joke-like case told often. Ignoring the basic separation - DWH for structured, data lake for unstructured - rebounds on cost.
I myself once thought lightly about log design and judged âJSON for nowâ on a past project, only to be told six months later by the analytics person that âwe canât read this.â Both are cases that left the common lesson that when storing itself becomes the goal, the platform becomes a junkyard, not a foundation. Catalogs, naming, and use-case-based platform splits are the basic gear that prevents the platform from becoming a swamp.
A data platformâs purpose is not âstoringâ but excavating. Catalog and retention policy prevent debt accumulation.
Data platform pitfalls and forbidden moves
Here are the typical accidents in data platforms. The moment âstoringâ becomes the goal, the platform becomes a junkyard.
| Forbidden move | Why itâs bad |
|---|---|
| Pile raw data in S3 with no schema or naming convention | âNo one can digâ swamp 3 years later. Processing cost exceeds new collection design |
| Putting image/video binaries in DWH | Query billing hits hundreds of thousands monthly. Unstructured goes to data lake |
SELECT * without limit in BigQuery | Monthly bill jumps overnight. Column specification + LIMIT required |
| Always-on Large warehouse in Snowflake | Large for dev/staging is wasteful. Control with XS + Auto-scale |
| Run direct analysis on the business DB | Business-side performance drops, customer impact. Always separate via ETL/ELT to DWH |
| Hand-written SQL scattered across the org without dbt/ELT | Transformation logic gets person-locked. Secret SQL only on retireesâ PCs |
| Loading personal data unmasked into DWH | GDPR / personal info law violation. Same risk as Metaâs 2023 EUR 1.2B fine |
| Operating with no data catalog | âWhereâs this data?â asked of people every time. Analysis speed at 1/10 |
| No retention policy | All data from 5 years ago bleeding through Standard-class billing. Move to tiered storage |
| Adopting lakehouse from small scale | Operational know-how still maturing. Phase in from mid-scale up |
The May 2023 Meta GDPR fine of EUR 1.2B (about JPY 200 billion) is a case showing that data-storage-location design mistakes lead directly to fines at the company-survival level. âInputting industry-specific regulatory requirements (GDPR / HIPAA / PCI DSS) firstâ is the starting point of platform design.
AI-era perspective
When AI-driven development (vibe coding) and AI usage are the premise, a data platformâs importance as a platform AI agents access jumps. In an era where LLM RAG, Text-to-SQL, and AI agents query business data, a tidy data platform sets AI accuracy ceiling directly.
| Favored in the AI era | Disfavored in the AI era |
|---|---|
| BigQuery / Snowflake (mainstream, abundant training data) | Minor DWHs |
| dbt + data catalog maintenance | Raw SQL with no documentation |
| Tidy schema, naming | Sloppy CSV ingestion |
| Data lineage (tracking origin and transformation paths) visualized | Black-box ETL |
The new standard is aiming for a state where the data platform can introduce itself to AI agents. With metadata, catalogs, and lineage in place, AI can autonomously select data and analyze.
A data platform in the AI era is a place AI agents touch. Build with explainability, explicit structure, and mainstream technology.
Common misconceptions
- Youâll regret not building a data lake from the start - excessive at small scale. DWHs alone like BigQuery handle several TB. Add a data lake âafter unstructured data growsâ
- With a DWH, you donât need a business DB - completely different things. Business DB (OLTP) handles transactions, DWH does analysis. Both are needed
- Buying expensive BI tools advances analysis - what matters more than tools is how tidy the data is. Pointing premium tools at sloppy data yields nothing
- Lakehouse is a superset of DWH so replace everything - still maturing. With thin operational know-how, no need to forcibly switch a stable DWH
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Type of platform (DWH / data lake / lakehouse)
- Cloud vendor (AWS / GCP / Azure / multi)
- BI tool (Tableau / Power BI / Metabase etc.)
- Data ingestion method (ETL / ELT / streaming)
- Retention period and cost tier (hot / cold / archive)
- Permission management method (Row-Level Security / IAM)
- Governance regime (catalog, lineage)
How to make the final call
The core of a data platform is separating business and analysis, holding the discipline as an organization of not throwing analytical queries at the business DB. Without a platform, the company-wide problem of âthe same number is different per departmentâ is guaranteed, and BI, ML, and AI utilization all proceed lacking a foundation. While the scale is small, a DWH alone is enough; add a data lake at the stage where unstructured data grows; converge on lakehouse once operational know-how matures - this phased decision is rational.
The decisive axis is the perspective of a platform AI agents access. In an era where LLM RAG and Text-to-SQL directly query the data platform, a platform with mainstream DWHs (BigQuery / Snowflake) plus metadata and lineage in place lifts AIâs accuracy ceiling directly. Minor DWHs and black-box ETL become liabilities in the AI era.
Selection priorities
- Separate business and analysis - donât run analysis on the business DB; this is the starting point of everything
- Phase by scale - small starts on BigQuery / Snowflake alone, large migrates phased to lakehouse
- Lean on existing cloud - prioritize integration benefits in billing/IAM/operations; multi-cloud means Snowflake
- Build out catalog and lineage - aim for a platform where AI agents can introduce themselves
âDefault to BigQuery or Snowflake.â Choose the mainstream, build a readable AI-era data platform.
Summary
This article covered data platforms, including the 3 options DWH/data lake/lakehouse, BI tool integration, phased recommendations by scale, and catalog operations to avoid data swamps.
Separate business and analysis, phase by scale, lean on existing cloud, and build a platform AI agents touch via catalog. That is the practical answer for a data platform in 2026.
Next time weâll cover ETL / ELT (the mechanism for extracting, transforming, and loading data).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (42/89)