About this article
As the fifth installment of the âData Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains ETL / ELT.
In the AI era, transformation logic not written in code canât be read by AI and becomes liability. This article explains the difference between ETL and ELT, modern typical compositions (Fivetran + dbt + BigQuery etc.), data-quality testing and lineage, recommended compositions by scale, and the structure where GUI ETL tools become liability in the AI era.
What is ETL/ELT in the first place
In a nutshell, ETL/ELT is âthe mechanism for gathering scattered data into one place and shaping it into a form easy to analyze.â
Imagine prepping ingredients for a kitchen. You procure ingredients from the greengrocer, fishmonger, and butcher (business DBs, SaaS, logs) - thatâs Extract. You wash, cut, and season them - thatâs Transform. You organize and store them in the fridge - thatâs Load. Without this prep, when an order comes in youâd have to run out to buy ingredients. ETL automates this prep process in the world of data.
What ETL / ELT handles
Traditionally, ETL (extract â transform â load) was mainstream, but in modern times when cloud DWHs (BigQuery, Snowflake) have become extremely fast, ELT - loading into the DWH first then transforming with SQL - has become standard. ETL is the standard for the on-prem era; ELT for the cloud era.
Data pipelines are the blood vessels of the data platform. Clog them and company-wide analysis stops.
Why itâs needed
1. Business DBs canât be used directly for analysis
Business DBs (OLTP) are optimized for transaction processing, and aggregation queries get in the way of business processing. Data needs to be copied to a separate DB for analysis.
2. Data must be aggregated from multiple sources
Business DBs, SaaS (Salesforce, Stripe, Google Analytics, etc.), external APIs, logs - without âgathering them in one place,â company-wide analysis canât happen.
3. Data unprepared canât be used
Raw data has many notation inconsistencies, duplicates, and missing values - analyzing it as is yields no meaningful results. Cleansing, joining, and aggregation are needed.
ETL vs ELT
flowchart LR
subgraph ETL_FLOW["ETL (legacy, 2000s)"]
E1[Extract] --> T1[Transform<br/>dedicated ETL server]
T1 --> L1[Load â DWH]
end
subgraph ELT_FLOW["ELT (cloud, 2015+)"]
E2[Extract] --> L2[Load â DWH]
L2 --> T2[Transform<br/>in-DWH SQL/dbt]
end
classDef old fill:#fee2e2,stroke:#dc2626;
classDef new fill:#dcfce7,stroke:#16a34a;
class ETL_FLOW,E1,T1,L1 old;
class ELT_FLOW,E2,L2,T2 new;
| ETL (legacy) | ELT (cloud) | |
|---|---|---|
| Order | Extract â Transform â Load | Extract â Load â Transform |
| Transform location | Dedicated server (ETL tool) | In-DWH SQL |
| Lead era | On-prem, 2000s | Cloud, 2015+ |
| Transform language | GUI, custom DSL | SQL-centric (dbt etc.) |
| Representative tools | Informatica, Talend | dbt, Fivetran |
Since cloud DWHs got fast enough, ELT - transforming inside the DWH - has become mainstream. The era of preparing a separate transformation server with ETL is ending.
Modern typical composition
Modern data pipelines flow as âextract â load â transform â utilize,â with role-specialized tools used per stage. As best practice, the following composition has settled in.
[Business DB / SaaS]
â
âź Fivetran / Airbyte (extract+load = EL)
[DWH: BigQuery / Snowflake]
â
âź dbt (transform = T)
[tidy data marts]
â
âź Looker / Tableau / Metabase
[Dashboards / BI]
Extract + load (EL) is handled by Fivetran, transform (T) by dbt - this layer split is the modern standard pattern.
Extract + load tools (EL layer)
Tools that automate ingestion from business DBs and SaaS to the DWH. Previously it was normal to write custom scripts per system, but today connector-equipped SaaS tools are common, ingesting from hundreds of sources with just connection settings.
Fivetran (under Starburst) is the industry-standard managed EL. It provides 300+ source connectors, syncing automatically from Salesforce, Stripe, PostgreSQL, etc. to the DWH with just connection settings. Differential sync, schema-change tracking, and retries are all automated, and the strength is barely needing engineer hands - but data-volume billing can run hundreds of thousands to millions of yen monthly.
Airbyte is the OSS version of Fivetran, available for free via self-hosting (also has a paid cloud version). Connector counts keep growing, and itâs the top choice for cost-conscious mid-size companies. Stability and support donât yet match Fivetran, but the functional gap has narrowed year by year.
| Tool | When to choose |
|---|---|
| Fivetran | Large enterprises with budget wanting minimal operational load |
| Airbyte (OSS or Cloud) | Cost-focused mid-size with self-hosting capability |
| Stitch | Small scale, simple sources only |
| Hevo | Mid-size in Japan focused on support |
| Custom scripts | Special sources, very small scale (not recommended) |
Transform tools (T layer)
Tools that transform data inside the DWH with SQL to prepare for analysis. The modern de facto standard is dbt (data build tool, the tool defining data-transformation pipelines in SQL), providing the mechanism to version-control + test + document SQL transformation logic.
Before dbt, transformation logic was scattered across SQL files, stored procedures, and Excel macros - and the accident of numbers no one knew who built or how showing up in executive meetings happened often. dbt revolutionized data platforms by solving this problem with âSQL + Git + tests.â
| Tool | Characteristics |
|---|---|
| dbt Core | OSS, transform with SQL, most prevalent |
| dbt Cloud | Managed version of dbt Core |
| Dataform | Googleâs dbt (BigQuery integrated) |
| Matillion | GUI-centric, enterprise-oriented |
Itâs harder to find a reason not to use dbt - itâs the de facto standard for modern data transformation.
Orchestration (scheduling)
Pipelines need scheduling and dependency management like ârun at 2am dailyâ or ârun next when the prior job succeeds.â This is handled by an orchestrator - Airflow has been the longtime de facto, with Prefect and Dagster also rising today.
| Tool | Characteristics | Suited for |
|---|---|---|
| Apache Airflow | Industry standard, largest community | Large enterprises, existing assets |
| Prefect | Modern UI, strong failure handling | New builds, mid-size |
| Dagster | Data-centric design, easy to test | Data-engineer-led |
| Cloud Composer | Managed Airflow (GCP) | GCP-using companies |
| Cron + scripts | Ultra-light | Small, simple |
For scales completable with dbt Cloud / Fivetranâs scheduling features, a dedicated orchestrator isnât needed.
Data quality testing
Auto-testing whether transformed data is correct is the modern norm. Things like âno NULLs,â âuniqueness,â âvalue range,â and âreferential integrityâ are written as code via dbtâs tests feature, and auto-validated at pipeline execution.
In the era without tests, the problem of âthe monthly report numbers are wrongâ happened often, with cause-tracking sometimes taking âweeks.â With dbtâs testing feature, you can stop the pipeline the moment bad data enters.
| Test type | Content |
|---|---|
| not_null | No NULLs |
| unique | No duplicates |
| relationships | Foreign keys actually exist |
| accepted_values | Only allowed values |
| custom SQL | Arbitrary business rules |
A pipeline without tests canât be trusted. If itâs used for executive decisions, write tests.
Data lineage
The mechanism to track âfrom which column of which business DB, through what transformation, did this âsales amountâ on a dashboard come from?â is data lineage. In large organizations, the question âcan this number be trusted?â arises daily, so without lineage, number reliability canât be guaranteed.
dbt automatically generates lineage graphs. Inter-model dependencies are visualized as a DAG (directed acyclic graph), so you can know what breaks when you delete this table in advance.
| Tool | Lineage visualization |
|---|---|
| dbt | Auto-generated (DAG) |
| DataHub | Org-wide catalog + lineage |
| OpenLineage | OSS standard spec |
| Collibra / Alation | Enterprise commercial |
Decision criteria
1. Org scale
| Scale | Recommended composition |
|---|---|
| Small (a few people, hundreds of thousands of rows) | cron + Python scripts + BigQuery |
| Mid (SaaS companies) | Fivetran + dbt Cloud + BigQuery/Snowflake |
| Large (multi-dept, multi-source) | Fivetran/Airbyte + dbt + Airflow + catalog |
| Super-large, public companies | Full set of enterprise commercial tools |
2. Data freshness requirements
Whether âmorning report is enoughâ or âreal-time dashboardâ - the composition changes greatly.
| Freshness need | Implementation |
|---|---|
| Daily (morning is OK) | Batch ETL/ELT (overnight) |
| Few hours of delay OK | Hourly batch |
| Few minutes OK | Microbatch (15-min cycle etc.) |
| Real-time (seconds) | Streaming (covered next chapter) |
Most business requirements are fine with daily. Going for real-time carelessly explodes cost and complexity.
3. Team skills
dbt is usable if you can write SQL, but Airflow requires Python knowledge. Team skill composition changes selection.
- SQL-centric org â dbt (almost completes with SQL only)
- Python-engineer-led â Airflow / Prefect / Dagster
- Including non-engineers â Fivetran + dbt Cloud (more GUI-leaning)
How to choose by case
Startup, 0 data engineers
Fivetran + dbt Cloud + BigQuery. Minimum composition that runs if you can write SQL. From around tens of thousands of yen monthly.
Mid-size company, in-house SRE (Site Reliability Engineering)
Airbyte (OSS) + dbt Core + BigQuery/Snowflake + Airflow. Cost-controlled, flexible operation.
Large enterprise, multi-source, governance-focused
Fivetran + dbt + DataHub (catalog) + Airflow. Secure commercial support and governance.
With real-time requirements
Kafka + Kinesis / Pub/Sub + Flink / ksqlDB. Streaming platform (see next chapter).
Numerical gates for pipeline freshness and runtime
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
For data pipelines, âitâs runningâ isnât enough - tracking freshness and runtime by numbers is the modern operational standard.
| Metric | Recommended | Reason |
|---|---|---|
| Daily-job runtime | Within 1 hour | Reserve room for re-runs |
| Hourly-job runtime | Within 15 min | Doesnât finish before next hour, gets clogged |
| dbt run time | Within 10 min | Keep dev cycle turning |
| Data-freshness SLO | Set per business requirements | Daily delay monitoring |
| Missing rate | < 0.01% (1 in 10,000) | Beyond that, root-cause investigate |
| Duplicate count | 0 (unique constraint) | Verify with dbt tests every time |
| Foreign-key integrity | 0 errors | relationships test |
| NULL rate (required columns) | 0% | not_null test |
| Failure retries | Up to 3 + exponential backoff | Auto-recovery from transient failures |
| Failure notification | Within 5 min | Slack + PagerDuty |
The modern standard is to âmake dbt tests required in CIâ and stop pipelines on quality violations. Itâs the largest investment that prevents the classic accident of âmonthly report numbers are wrong â weeks of cause-tracking.â
Data quality is auto-tested in CI. Noticing after bad data flows to production is too late.
Authorâs note - the 1 person-month sucked by âthe army of self-built scriptsâ
Thereâs a story often heard about a mid-size company that hand-built â10+ Python scripts for SaaS integrations one by oneâ for Salesforce, Stripe, Zendesk, etc. - one would break with each vendor schema change, and before they noticed, one young engineerâs worth of effort went entirely to âpipeline maintenance.â The motivation to ânot pay SaaS billingâ started self-building, but the typical pattern was that in personnel costs, it cost several times more than the billing.
In another case, an organization had transformation logic scattered across SQL files, stored procedures, and Excel macros - critical transformations remained only on the PC of a person who left the company. The numbers in executive meetings stopped for months, and half a year was needed to rebuild. Effort that wouldnât have been spent had it been Git-managed in dbt.
I myself once watched a Python-script integration scenario where, the moment the rookie took over, vendor API changes broke it weekly and patches were rushed at night. These cases teach the value of code visualization and standardization. The tens of thousands of yen monthly for Fivetran + dbt is overwhelmingly cheap as an investment compared to the risk of self-building and person-locking - thatâs the modern conclusion.
Self-built scripts always lose by the hour. Fivetran/Airbyte billing is recouped in personnel costs.
ETL/ELT pitfalls and forbidden moves
Here are the typical accidents in data pipelines. All of them are direct causes of breaking number reliability.
| Forbidden move | Why itâs bad |
|---|---|
| Self-build 10 SaaS-integration Python scripts | Vendor schema changes break them one by one, eats one young engineer |
| Scatter transformation logic across SQL files, stored procedures, Excel macros | Hell where secret transformations remain only on the retireeâs PC |
| Re-load all data every time with TRUNCATE + INSERT | Load-time explodes, downtime occurs. Switch to incremental load |
| Operate pipelines without tests | âThis number is wrongâ surfaces at month-end, weeks of cause-tracking |
| Newly adopt GUI ETL tools (Informatica / Talend) | Outdated in the AI era. Move to dbt / code-based |
| No notifications on errors | Notice 3 days later via dashboard. Slack + PagerDuty required |
| Run production with manual parameter changes | No record left, irreproducible. Declarative via Airflow / Prefect |
| Deduplicate records on the app side | The proper path is enforcing unique constraints in the pipeline |
| No mechanism to notice 2-hour batch delay | Build in monitoring, alert immediately on SLO violation |
| Edit dbt files directly in production | Bypassing Git makes history fly away |
| Overestimate real-time requirements and adopt streaming | 10x operational cost on requirements solvable by batch |
| Assuming âdbt is SQL so engineers arenât neededâ | Test design, model splitting, and documentation maintenance are separate skills; data engineersâ work remains |
| Adopting Airflow assuming âit solves orchestrationâ | Airflow is heavy to operate; at small scale, dbt Cloud or Prefect is easier |
The âdbt that became mainstream around 2015â (by Claire Carroll and others) is a turning point that revolutionized data transformation with SQL + Git + tests. Before dbt, ETL was vendor-specific tools and GUI configurations turning into black boxes, and accidents of ânumbers no one knew who built or howâ appearing in executive meetings happened often.
GUI ETL tools become liabilities in the AI era. New: dbt + Git only.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| dbt (code-based, SQL + YAML) | Informatica/Talend GUI configurations |
| Airbyte (OSS, code-extensible) | Black-box custom ETL |
| OpenLineage (standard lineage) | Custom lineage formats |
| Version-controlled in Git | Direct production edits |
- ELT + dbt as the basic form â leverage cloud DWH compute power, complete transformations in SQL.
- EL layer with Fivetran / Airbyte â self-built scripts eat person-months, SaaS billing recoups.
- Make data-quality tests required â validate NULL/unique/referential integrity every time with dbt tests.
- Code base + Git management â GUI configs become outdated in the AI era, lean on YAML and SQL.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Extract tool (Fivetran / Airbyte / self-build)
- Transform tool (dbt as top candidate)
- Orchestrator (Airflow / Prefect / not needed if light)
- Data catalog (DataHub / dbt docs / not needed)
- Data quality tests (required dbt test items)
- Schedule frequency (daily / hourly / real-time)
- Failure notification destinations (Slack, PagerDuty)
Summary
This article covered ETL / ELT, including the difference between ETL and ELT, typical compositions, data-quality testing, lineage, phased recommendations by scale, and how GUI ETL becomes liability in the AI era.
ELT + dbt as the basic form, extraction with Fivetran/Airbyte, mandatory quality testing, and lean on a code-based + Git-managed approach. That is the practical answer for ETL/ELT design in 2026.
Next time weâll cover streaming (handling real-time data).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (43/89)