Data Architecture

[Data Architecture] ETL/ELT - Fivetran + dbt + DWH Is the Modern Standard

[Data Architecture] ETL/ELT - Fivetran + dbt + DWH Is the Modern Standard

About this article

As the fifth installment of the “Data Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains ETL / ELT.

In the AI era, transformation logic not written in code can’t be read by AI and becomes liability. This article explains the difference between ETL and ELT, modern typical compositions (Fivetran + dbt + BigQuery etc.), data-quality testing and lineage, recommended compositions by scale, and the structure where GUI ETL tools become liability in the AI era.

What is ETL/ELT in the first place

In a nutshell, ETL/ELT is “the mechanism for gathering scattered data into one place and shaping it into a form easy to analyze.”

Imagine prepping ingredients for a kitchen. You procure ingredients from the greengrocer, fishmonger, and butcher (business DBs, SaaS, logs) - that’s Extract. You wash, cut, and season them - that’s Transform. You organize and store them in the fridge - that’s Load. Without this prep, when an order comes in you’d have to run out to buy ingredients. ETL automates this prep process in the world of data.

What ETL / ELT handles

Traditionally, ETL (extract → transform → load) was mainstream, but in modern times when cloud DWHs (BigQuery, Snowflake) have become extremely fast, ELT - loading into the DWH first then transforming with SQL - has become standard. ETL is the standard for the on-prem era; ELT for the cloud era.

Data pipelines are the blood vessels of the data platform. Clog them and company-wide analysis stops.

Why it’s needed

1. Business DBs can’t be used directly for analysis

Business DBs (OLTP) are optimized for transaction processing, and aggregation queries get in the way of business processing. Data needs to be copied to a separate DB for analysis.

2. Data must be aggregated from multiple sources

Business DBs, SaaS (Salesforce, Stripe, Google Analytics, etc.), external APIs, logs - without “gathering them in one place,” company-wide analysis can’t happen.

3. Data unprepared can’t be used

Raw data has many notation inconsistencies, duplicates, and missing values - analyzing it as is yields no meaningful results. Cleansing, joining, and aggregation are needed.

ETL vs ELT

flowchart LR
    subgraph ETL_FLOW["ETL (legacy, 2000s)"]
        E1[Extract] --> T1[Transform<br/>dedicated ETL server]
        T1 --> L1[Load → DWH]
    end
    subgraph ELT_FLOW["ELT (cloud, 2015+)"]
        E2[Extract] --> L2[Load → DWH]
        L2 --> T2[Transform<br/>in-DWH SQL/dbt]
    end
    classDef old fill:#fee2e2,stroke:#dc2626;
    classDef new fill:#dcfce7,stroke:#16a34a;
    class ETL_FLOW,E1,T1,L1 old;
    class ELT_FLOW,E2,L2,T2 new;
ETL (legacy)ELT (cloud)
OrderExtract → Transform → LoadExtract → Load → Transform
Transform locationDedicated server (ETL tool)In-DWH SQL
Lead eraOn-prem, 2000sCloud, 2015+
Transform languageGUI, custom DSLSQL-centric (dbt etc.)
Representative toolsInformatica, Talenddbt, Fivetran

Since cloud DWHs got fast enough, ELT - transforming inside the DWH - has become mainstream. The era of preparing a separate transformation server with ETL is ending.

Modern typical composition

Modern data pipelines flow as “extract → load → transform → utilize,” with role-specialized tools used per stage. As best practice, the following composition has settled in.

[Business DB / SaaS]
      │
      ▼ Fivetran / Airbyte (extract+load = EL)
[DWH: BigQuery / Snowflake]
      │
      ▼ dbt (transform = T)
[tidy data marts]
      │
      ▼ Looker / Tableau / Metabase
[Dashboards / BI]

Extract + load (EL) is handled by Fivetran, transform (T) by dbt - this layer split is the modern standard pattern.

Extract + load tools (EL layer)

Tools that automate ingestion from business DBs and SaaS to the DWH. Previously it was normal to write custom scripts per system, but today connector-equipped SaaS tools are common, ingesting from hundreds of sources with just connection settings.

Fivetran (under Starburst) is the industry-standard managed EL. It provides 300+ source connectors, syncing automatically from Salesforce, Stripe, PostgreSQL, etc. to the DWH with just connection settings. Differential sync, schema-change tracking, and retries are all automated, and the strength is barely needing engineer hands - but data-volume billing can run hundreds of thousands to millions of yen monthly.

Airbyte is the OSS version of Fivetran, available for free via self-hosting (also has a paid cloud version). Connector counts keep growing, and it’s the top choice for cost-conscious mid-size companies. Stability and support don’t yet match Fivetran, but the functional gap has narrowed year by year.

ToolWhen to choose
FivetranLarge enterprises with budget wanting minimal operational load
Airbyte (OSS or Cloud)Cost-focused mid-size with self-hosting capability
StitchSmall scale, simple sources only
HevoMid-size in Japan focused on support
Custom scriptsSpecial sources, very small scale (not recommended)

Transform tools (T layer)

Tools that transform data inside the DWH with SQL to prepare for analysis. The modern de facto standard is dbt (data build tool, the tool defining data-transformation pipelines in SQL), providing the mechanism to version-control + test + document SQL transformation logic.

Before dbt, transformation logic was scattered across SQL files, stored procedures, and Excel macros - and the accident of numbers no one knew who built or how showing up in executive meetings happened often. dbt revolutionized data platforms by solving this problem with “SQL + Git + tests.”

ToolCharacteristics
dbt CoreOSS, transform with SQL, most prevalent
dbt CloudManaged version of dbt Core
DataformGoogle’s dbt (BigQuery integrated)
MatillionGUI-centric, enterprise-oriented

It’s harder to find a reason not to use dbt - it’s the de facto standard for modern data transformation.

Orchestration (scheduling)

Pipelines need scheduling and dependency management like “run at 2am daily” or “run next when the prior job succeeds.” This is handled by an orchestrator - Airflow has been the longtime de facto, with Prefect and Dagster also rising today.

ToolCharacteristicsSuited for
Apache AirflowIndustry standard, largest communityLarge enterprises, existing assets
PrefectModern UI, strong failure handlingNew builds, mid-size
DagsterData-centric design, easy to testData-engineer-led
Cloud ComposerManaged Airflow (GCP)GCP-using companies
Cron + scriptsUltra-lightSmall, simple

For scales completable with dbt Cloud / Fivetran’s scheduling features, a dedicated orchestrator isn’t needed.

Data quality testing

Auto-testing whether transformed data is correct is the modern norm. Things like “no NULLs,” “uniqueness,” “value range,” and “referential integrity” are written as code via dbt’s tests feature, and auto-validated at pipeline execution.

In the era without tests, the problem of “the monthly report numbers are wrong” happened often, with cause-tracking sometimes taking “weeks.” With dbt’s testing feature, you can stop the pipeline the moment bad data enters.

Test typeContent
not_nullNo NULLs
uniqueNo duplicates
relationshipsForeign keys actually exist
accepted_valuesOnly allowed values
custom SQLArbitrary business rules

A pipeline without tests can’t be trusted. If it’s used for executive decisions, write tests.

Data lineage

The mechanism to track “from which column of which business DB, through what transformation, did this ‘sales amount’ on a dashboard come from?” is data lineage. In large organizations, the question “can this number be trusted?” arises daily, so without lineage, number reliability can’t be guaranteed.

dbt automatically generates lineage graphs. Inter-model dependencies are visualized as a DAG (directed acyclic graph), so you can know what breaks when you delete this table in advance.

ToolLineage visualization
dbtAuto-generated (DAG)
DataHubOrg-wide catalog + lineage
OpenLineageOSS standard spec
Collibra / AlationEnterprise commercial

Decision criteria

1. Org scale

ScaleRecommended composition
Small (a few people, hundreds of thousands of rows)cron + Python scripts + BigQuery
Mid (SaaS companies)Fivetran + dbt Cloud + BigQuery/Snowflake
Large (multi-dept, multi-source)Fivetran/Airbyte + dbt + Airflow + catalog
Super-large, public companiesFull set of enterprise commercial tools

2. Data freshness requirements

Whether “morning report is enough” or “real-time dashboard” - the composition changes greatly.

Freshness needImplementation
Daily (morning is OK)Batch ETL/ELT (overnight)
Few hours of delay OKHourly batch
Few minutes OKMicrobatch (15-min cycle etc.)
Real-time (seconds)Streaming (covered next chapter)

Most business requirements are fine with daily. Going for real-time carelessly explodes cost and complexity.

3. Team skills

dbt is usable if you can write SQL, but Airflow requires Python knowledge. Team skill composition changes selection.

  • SQL-centric org → dbt (almost completes with SQL only)
  • Python-engineer-led → Airflow / Prefect / Dagster
  • Including non-engineers → Fivetran + dbt Cloud (more GUI-leaning)

How to choose by case

Startup, 0 data engineers

Fivetran + dbt Cloud + BigQuery. Minimum composition that runs if you can write SQL. From around tens of thousands of yen monthly.

Mid-size company, in-house SRE (Site Reliability Engineering)

Airbyte (OSS) + dbt Core + BigQuery/Snowflake + Airflow. Cost-controlled, flexible operation.

Large enterprise, multi-source, governance-focused

Fivetran + dbt + DataHub (catalog) + Airflow. Secure commercial support and governance.

With real-time requirements

Kafka + Kinesis / Pub/Sub + Flink / ksqlDB. Streaming platform (see next chapter).

Numerical gates for pipeline freshness and runtime

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

For data pipelines, “it’s running” isn’t enough - tracking freshness and runtime by numbers is the modern operational standard.

MetricRecommendedReason
Daily-job runtimeWithin 1 hourReserve room for re-runs
Hourly-job runtimeWithin 15 minDoesn’t finish before next hour, gets clogged
dbt run timeWithin 10 minKeep dev cycle turning
Data-freshness SLOSet per business requirementsDaily delay monitoring
Missing rate< 0.01% (1 in 10,000)Beyond that, root-cause investigate
Duplicate count0 (unique constraint)Verify with dbt tests every time
Foreign-key integrity0 errorsrelationships test
NULL rate (required columns)0%not_null test
Failure retriesUp to 3 + exponential backoffAuto-recovery from transient failures
Failure notificationWithin 5 minSlack + PagerDuty

The modern standard is to “make dbt tests required in CI” and stop pipelines on quality violations. It’s the largest investment that prevents the classic accident of “monthly report numbers are wrong → weeks of cause-tracking.”

Data quality is auto-tested in CI. Noticing after bad data flows to production is too late.

Author’s note - the 1 person-month sucked by “the army of self-built scripts”

There’s a story often heard about a mid-size company that hand-built “10+ Python scripts for SaaS integrations one by one” for Salesforce, Stripe, Zendesk, etc. - one would break with each vendor schema change, and before they noticed, one young engineer’s worth of effort went entirely to “pipeline maintenance.” The motivation to “not pay SaaS billing” started self-building, but the typical pattern was that in personnel costs, it cost several times more than the billing.

In another case, an organization had transformation logic scattered across SQL files, stored procedures, and Excel macros - critical transformations remained only on the PC of a person who left the company. The numbers in executive meetings stopped for months, and half a year was needed to rebuild. Effort that wouldn’t have been spent had it been Git-managed in dbt.

I myself once watched a Python-script integration scenario where, the moment the rookie took over, vendor API changes broke it weekly and patches were rushed at night. These cases teach the value of code visualization and standardization. The tens of thousands of yen monthly for Fivetran + dbt is overwhelmingly cheap as an investment compared to the risk of self-building and person-locking - that’s the modern conclusion.

Self-built scripts always lose by the hour. Fivetran/Airbyte billing is recouped in personnel costs.

ETL/ELT pitfalls and forbidden moves

Here are the typical accidents in data pipelines. All of them are direct causes of breaking number reliability.

Forbidden moveWhy it’s bad
Self-build 10 SaaS-integration Python scriptsVendor schema changes break them one by one, eats one young engineer
Scatter transformation logic across SQL files, stored procedures, Excel macrosHell where secret transformations remain only on the retiree’s PC
Re-load all data every time with TRUNCATE + INSERTLoad-time explodes, downtime occurs. Switch to incremental load
Operate pipelines without tests”This number is wrong” surfaces at month-end, weeks of cause-tracking
Newly adopt GUI ETL tools (Informatica / Talend)Outdated in the AI era. Move to dbt / code-based
No notifications on errorsNotice 3 days later via dashboard. Slack + PagerDuty required
Run production with manual parameter changesNo record left, irreproducible. Declarative via Airflow / Prefect
Deduplicate records on the app sideThe proper path is enforcing unique constraints in the pipeline
No mechanism to notice 2-hour batch delayBuild in monitoring, alert immediately on SLO violation
Edit dbt files directly in productionBypassing Git makes history fly away
Overestimate real-time requirements and adopt streaming10x operational cost on requirements solvable by batch
Assuming “dbt is SQL so engineers aren’t needed”Test design, model splitting, and documentation maintenance are separate skills; data engineers’ work remains
Adopting Airflow assuming “it solves orchestration”Airflow is heavy to operate; at small scale, dbt Cloud or Prefect is easier

The “dbt that became mainstream around 2015” (by Claire Carroll and others) is a turning point that revolutionized data transformation with SQL + Git + tests. Before dbt, ETL was vendor-specific tools and GUI configurations turning into black boxes, and accidents of “numbers no one knew who built or how” appearing in executive meetings happened often.

GUI ETL tools become liabilities in the AI era. New: dbt + Git only.

AI decision axes

AI-era favorableAI-era unfavorable
dbt (code-based, SQL + YAML)Informatica/Talend GUI configurations
Airbyte (OSS, code-extensible)Black-box custom ETL
OpenLineage (standard lineage)Custom lineage formats
Version-controlled in GitDirect production edits
  1. ELT + dbt as the basic form — leverage cloud DWH compute power, complete transformations in SQL.
  2. EL layer with Fivetran / Airbyte — self-built scripts eat person-months, SaaS billing recoups.
  3. Make data-quality tests required — validate NULL/unique/referential integrity every time with dbt tests.
  4. Code base + Git management — GUI configs become outdated in the AI era, lean on YAML and SQL.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Extract tool (Fivetran / Airbyte / self-build)
  • Transform tool (dbt as top candidate)
  • Orchestrator (Airflow / Prefect / not needed if light)
  • Data catalog (DataHub / dbt docs / not needed)
  • Data quality tests (required dbt test items)
  • Schedule frequency (daily / hourly / real-time)
  • Failure notification destinations (Slack, PagerDuty)

Summary

This article covered ETL / ELT, including the difference between ETL and ELT, typical compositions, data-quality testing, lineage, phased recommendations by scale, and how GUI ETL becomes liability in the AI era.

ELT + dbt as the basic form, extraction with Fivetran/Airbyte, mandatory quality testing, and lean on a code-based + Git-managed approach. That is the practical answer for ETL/ELT design in 2026.

Next time we’ll cover streaming (handling real-time data).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.