About this article
As the fourth installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains data platforms.
A platform that only stores is debt; a platform you can pull data from is an asset. This article covers the characteristics of the 3 options DWH/data lake/lakehouse, BI tool integration, recommended compositions by scale, and the structure where the platform turns into a “data swamp” the moment “storing” becomes the goal.
What is a data platform
In a nutshell, a data platform is “the foundation that collects company-wide data and makes it accessible for anyone to analyze and use.”
Picture a library’s catalog system. If each department manages its own books (data) independently, you can’t even find out whether another department’s books exist. Build a centralized catalog (DWH) and anyone can locate what they need. In the AI era, this data platform becomes the “source of truth” that LLMs and BI tools read from, so platform quality directly sets the ceiling on AI utilization.
What a data platform handles
BI dashboards, executive KPIs, machine learning, AI agents - the foundation of all of these is the data platform. Without a platform, you end up gathering data per individual system, and cross-department company-wide analysis becomes impossible.
Whether or not you have a data platform is the dividing line of DX success. Cutting corners here means you can’t reach company-wide AI utilization.
Why it’s needed
1. Using business DBs for analysis drops their performance
Business DBs are optimized for “fast and safe per transaction,” and running analytical queries like “aggregate all data” slows down the business side. Separating business and analysis is the de facto standard.
2. Data crosses organizations
To connect sales/marketing/accounting/CS data company-wide, “a platform that consolidates in one place” is needed. Without it, the problem of “the same number is different per department” arises.
3. AI accuracy is decided by data quality
ML models and the RAG of LLMs only stand up with “tidy data.” A poor data platform sets the ceiling on AI utilization.
3 options
| Option | Rough description |
|---|---|
| DWH (Data Warehouse) | Analysis-only DB for structured data. Prepare and load |
| Data lake | Massive storage that holds raw data of any format |
| Lakehouse | Best of both. SQL is applied directly to the data lake |
There are cases where one is enough, and there are large enterprises that operate “DWH + data lake together.” Decide by scale and use case.
Data warehouse (DWH)
A columnar DB optimized for analysis, where structured data is prepared and loaded. A historical concept dating from the 1980s, it’s the main battlefield for “aggregation analysis” - monthly reports, executive dashboards, KPI monitoring.
All modern DWHs are cloud-managed, with BigQuery, Snowflake, and Redshift as the big three. They aggregate at TB-PB scale in seconds, are accessed via SQL, and have moderate learning costs.
| Pros | Cons |
|---|---|
| Aggregation analysis ultra-fast | Unstructured data doesn’t fit |
| Anyone can use it via SQL | Storing raw data is uneconomical |
| Fine-grained permission control | Pricing models are unique per cloud |
| Almost no operations needed (managed) | Strong vendor lock-in |
Representatives: BigQuery, Snowflake, Amazon Redshift, Azure Synapse
The default is BigQuery or Snowflake. The reasons to choose Redshift are diminishing unless you’re locked to AWS.
Data lake
“Massive storage” that holds raw data regardless of format. CSV, JSON, images, videos, PDFs, logs - the flexibility to hold anything is the feature, operated with the stance of “first store everything, think about how to use it later.” A cloud’s object storage (S3, GCS, ADLS) becomes the platform as is.
While DWH is “prepare before loading,” the data lake’s idea is load and prepare later. Machine learning, unstructured-data analysis, audit-log retention - it covers “areas DWH can’t handle.”
| Pros | Cons |
|---|---|
| Anything fits regardless of format | Sloppy use turns it into a “data swamp” |
| Storage cost is extremely cheap | Not SQL-able as is (separate engine needed) |
| Unlimited scale | Permission management/governance is hard |
| Optimal for ML preprocessing | Search and aggregation are slow |
Representatives: Amazon S3, Google Cloud Storage, Azure Data Lake Storage
Just storing without operational rules turns into a data swamp. Always set up catalog and naming conventions.
Lakehouse
A modern approach that places DWH features on top of a data lake. Apply SQL directly to Parquet files on S3, support ACID transactions - a “best of both” composition. Proposed by Databricks, with Delta Lake, Apache Iceberg, and Apache Hudi establishing themselves as standard data formats.
The concept was born from the issue “operating both a DWH and a data lake is too heavy,” and lakehouse is becoming the top candidate for new builds. However, operational know-how is still developing, so the difficulty rises if the team isn’t familiar.
| Pros | Cons |
|---|---|
| One platform covers both use cases | Operational know-how still maturing |
| Storage cost is cheap | Team learning cost required |
| Weak vendor lock-in | Toolchain still incomplete |
| Handles both structured and unstructured | Can be excessive at small scale |
Representatives: Databricks, Snowflake (Iceberg-supporting), BigLake
Since 2024, lakehouse has become the mainstream for new builds. Migration from existing DWH proceeds in stages.
Comparison of the 3
| Viewpoint | DWH | Data lake | Lakehouse |
|---|---|---|---|
| Structured data analysis | Excellent | Marginal | Excellent |
| Unstructured data | No | Excellent | Good |
| Storage cost | High | Low | Low-Mid |
| Direct SQL use | Excellent | No | Good |
| Operational simplicity | Excellent | Marginal | Marginal |
| Vendor lock-in | Strong | Weak | Mid |
| Compatibility with ML | Marginal | Excellent | Excellent |
The lakehouse looks balanced and new, but the reality is that organizations already running on a DWH don’t need to forcibly switch.
BI tool integration
Even building a data platform, business departments can’t use it without visualization tools (BI). BI tools issue SQL to the DWH/lakehouse and display dashboards and reports.
| BI tool | Characteristics | Suited for |
|---|---|---|
| Tableau | Strongest features, industry standard | Large enterprises, advanced analysis |
| Power BI | Pairs well with Microsoft | Microsoft 365 companies |
| Looker | Strong modeling layer, Google integration | BigQuery users |
| Metabase | OSS, lightweight, free | Small/mid scale, personal |
| Redash | OSS, SQL-centric | Engineer-led organizations |
Whether business departments can use it themselves decides BI penetration. Choosing one with a UI usable by non-engineers is the rule.
Decision criteria
1. Type of data
The optimal platform varies with data type. Structured data only is DWH only, but with images/videos/unstructured data involved, a data lake or lakehouse becomes mandatory.
| Data type | Recommended |
|---|---|
| Structured business data only | DWH (BigQuery, Snowflake) |
| With ML / AI preprocessing | Lakehouse or DWH + lake combo |
| Mass log/image/video retention | Data lake-centric |
| Long-term retention for history/audit | Data lake (S3 Glacier etc.) |
2. Scale and budget
DWHs primarily charge by query billing, and useless queries spike the bill. Data lakes have extremely cheap storage and suit “store first.” Budget management philosophy also affects selection.
3. Cloud vendor
Data platforms are strongly tied to cloud vendors. If you’re already on AWS, Redshift + S3; on GCP, BigQuery; on Azure, Synapse + ADLS - aligning with existing cloud is advantageous in operations and billing.
| Vendor | DWH | Data lake |
|---|---|---|
| AWS | Redshift | S3 + Glue |
| Google Cloud | BigQuery | GCS + BigLake |
| Azure | Synapse | ADLS + Fabric |
| Multi-cloud | Snowflake | - |
Snowflake is the only multi-cloud DWH, supported by enterprises wishing to avoid vendor lock-in.
How to choose by case
Mid-size company executive KPI visualization
BigQuery + Metabase/Redash. Cheap (from tens of thousands of yen monthly), simple, low learning cost. Manage ETL from business DB with dbt (data build tool, the tool defining data-transformation pipelines in SQL).
Large enterprise / multi-department / multi-cloud
Snowflake. Multi-cloud support, strong permission management, contract-scale discounts. BI is Tableau or Power BI.
ML / AI utilization-focused
Lakehouse (Databricks). End-to-end including notebooks and MLOps (DevOps for ML). With existing DWH, start from parallel operation.
Startups / MVP phase
Don’t build a data platform. PostgreSQL read replicas + Metabase are enough. Once revenue grows, consider proper introduction.
Phased platform-selection table by org scale and data volume
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
Choosing a data platform “by trends” breaks down in operations. The practical rule is to phase by scale and monthly cost.
| Org scale | Data volume | Recommended platform | Monthly target | BI tool |
|---|---|---|---|---|
| Personal/MVP | ~10GB | PostgreSQL only | $0-50 | Metabase (free) |
| Startup | ~1TB | BigQuery | tens-hundreds of dollars | Metabase / Looker Studio |
| Mid-size SaaS | ~10TB | BigQuery or Snowflake | hundreds-thousands of dollars | Looker / Tableau |
| Large enterprise / multi-dept | ~100TB | Snowflake (multi-cloud) | thousands-tens-of-thousands | Tableau / Power BI |
| Super-large / ML-centric | 100TB+ | Databricks (lakehouse) | tens-of-thousands+ | Dedicated dashboards |
Query-billing blow-up pattern: Using SELECT * without limit on BigQuery is the classic story of monthly bills hitting hundreds of thousands overnight. Control costs with the triad partitioning required, column specification required, leverage query cache. Snowflake is also billed by warehouse size x runtime, so the standard is to fix dev environments at XS and Auto-scale on production only.
Snowflake at startup scale is over-investment. BigQuery free tier handles several GB to several hundred GB easily.
Author’s note - cases of “stored but no one can use” swamps
A business unit “stored 3 years of all-department logs in S3” thinking “we’ll analyze it someday,” but with no schema or naming conventions, hundreds of millions of JSON files piled up - and the result was no one could use them, turning into a swamp. Date format differs per file, field names fluctuate with each service revision, and the same value has multiple representations. The processing cost for analysis ended up exceeding the cost of newly designing log collection - reaching the “putting the cart before the horse” punchline.
In another field, conversely, “putting everything in BigQuery is safe” led to ramming image and video binaries into the DWH, with query bills jumping to hundreds of thousands of dollars monthly - the kind of joke-like case told often. Ignoring the basic separation - DWH for structured, data lake for unstructured - rebounds on cost.
I myself once thought lightly about log design and judged “JSON for now” on a past project, only to be told six months later by the analytics person that “we can’t read this.” Both are cases that left the common lesson that when storing itself becomes the goal, the platform becomes a junkyard, not a foundation. Catalogs, naming, and use-case-based platform splits are the basic gear that prevents the platform from becoming a swamp.
A data platform’s purpose is not “storing” but excavating. Catalog and retention policy prevent debt accumulation.
Data platform pitfalls and forbidden moves
Here are the typical accidents in data platforms. The moment “storing” becomes the goal, the platform becomes a junkyard.
| Forbidden move | Why it’s bad |
|---|---|
| Pile raw data in S3 with no schema or naming convention | ”No one can dig” swamp 3 years later. Processing cost exceeds new collection design |
| Putting image/video binaries in DWH | Query billing hits hundreds of thousands monthly. Unstructured goes to data lake |
SELECT * without limit in BigQuery | Monthly bill jumps overnight. Column specification + LIMIT required |
| Always-on Large warehouse in Snowflake | Large for dev/staging is wasteful. Control with XS + Auto-scale |
| Run direct analysis on the business DB | Business-side performance drops, customer impact. Always separate via ETL/ELT to DWH |
| Hand-written SQL scattered across the org without dbt/ELT | Transformation logic gets person-locked. Secret SQL only on retirees’ PCs |
| Loading personal data unmasked into DWH | GDPR / personal info law violation. Same risk as Meta’s 2023 EUR 1.2B fine |
| Operating with no data catalog | ”Where’s this data?” asked of people every time. Analysis speed at 1/10 |
| No retention policy | All data from 5 years ago bleeding through Standard-class billing. Move to tiered storage |
| Adopting lakehouse from small scale | Operational know-how still maturing. Phase in from mid-scale up |
| Building a data lake from the start assuming “you’ll regret not having one” | Excessive at small scale; DWH alone handles several TB. Add a data lake after unstructured data grows |
| Assuming “buying expensive BI tools will advance analysis” | What matters more than tools is how tidy the data is; pointing premium tools at sloppy data yields nothing |
The May 2023 Meta GDPR fine of EUR 1.2B (about JPY 200 billion) is a case showing that data-storage-location design mistakes lead directly to fines at the company-survival level. “Inputting industry-specific regulatory requirements (GDPR / HIPAA / PCI DSS) first” is the starting point of platform design.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| BigQuery / Snowflake (mainstream, abundant training data) | Minor DWHs |
| dbt + data catalog maintenance | Raw SQL with no documentation |
| Tidy schema, naming | Sloppy CSV ingestion |
| Data lineage visualized | Black-box ETL |
- Separate business and analysis — don’t run analysis on the business DB; this is the starting point of everything.
- Phase by scale — small starts on BigQuery / Snowflake alone, large migrates phased to lakehouse.
- Lean on existing cloud — prioritize integration benefits in billing/IAM/operations; multi-cloud means Snowflake.
- Build out catalog and lineage — aim for a platform where AI agents can introduce themselves.
Mainstream DWHs have high AI-generated SQL accuracy
BigQuery, Snowflake, and Redshift have massive SQL patterns in training data, so AI-generated SQL accuracy is stable. With minor DWHs, dialect SQL accuracy drops and manual corrections increase. Including “can AI write for it” as an evaluation axis in platform selection is the current era.
Data platform Text-to-SQL readiness
For natural-language-to-SQL (Text-to-SQL) to work accurately, the platform needs: well-named tables/columns, rich metadata (COMMENTs, descriptions), and catalog API access. BigQuery and Snowflake both have native AI SQL-generation features (Gemini in BigQuery, Cortex in Snowflake), but their accuracy depends entirely on metadata quality. Building the data platform with “AI will query this” as a design premise from day one is the 2026 standard.
What to decide - what is your project’s answer?
For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”
- Type of platform (DWH / data lake / lakehouse)
- Cloud vendor (AWS / GCP / Azure / multi)
- BI tool (Tableau / Power BI / Metabase etc.)
- Data ingestion method (ETL / ELT / streaming)
- Retention period and cost tier (hot / cold / archive)
- Permission management method (Row-Level Security / IAM)
- Governance regime (catalog, lineage)
Related Articles
https://en.senkohome.com/arch-intro-data-datastore/ https://en.senkohome.com/arch-intro-data-etl/ https://en.senkohome.com/arch-intro-data-modeling/
Summary
This article covered data platforms, including the 3 options DWH/data lake/lakehouse, BI tool integration, phased recommendations by scale, and catalog operations to avoid data swamps.
Separate business and analysis, phase by scale, lean on existing cloud, and build a platform AI agents touch via catalog. That is the practical answer for a data platform in 2026.
Next time we’ll cover ETL / ELT (the mechanism for extracting, transforming, and loading data).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (42/89)