Data Transformation Tools

ETL vs ELT: Key Differences

Multi-chapter guide | Chapter 6

ETL and ELT are two architectural paradigms used in data integration. ETL (Extract-Transform-Load) pulls data from various sources, transforms it using a processing engine, and loads the results to the destination warehouse or downstream system. In contrast, ELT (Extract-Load-Transform) flips the order of transformation and loading found in ETL. It relies on the destination system's ability to transform the data after loading. 

Choosing ETL or ELT for a data integration system is difficult because of the evenly matched pros, cons, and overlaps. This article dives into the key features of ETL and ELT, explains their differences, and provides suggestions about choosing the right approach.

Summary of key ETL vs. ELT differences

ETL ELT
Definition ETL stands for extract-transform-load, where a separate process transforms data before loading it to the destination database or warehouse ELT stands for extract-load-transform, where transformation happens in the destination database, warehouse, or lakehouse platform.
Schema flexibility and evolving requirements ETL works best in use cases with a rigid destination schema that does not change frequently. ELT offers much more flexibility when it comes to schema evolution. Frequent schema changes are easier to handle in ELT.
Transformation complexity Since transformation happens as a separate process, ETL allows for more complex transformations using SQL or other code-based frameworks, such as Apache Spark, Databricks, etc. Since transformation relies mainly on the destination warehouse or lakehouse SQL layer, ELT lacks the flexibility of having granular control over transformation.
Transformation cost ETL offers options for optimizing transformation cost since transformation requirements can be aggregated and planned better. ELT tends to have higher transformation costs because each requirement tends to be a new transformation, and platforms charge based on the traversed data volume for each transformation.
Unstructured data ETL is suited for working with structured data where the processing logic is clearly known beforehand ELT is much more suited for working with unstructured data where processing logic can vary frequently or is unknown beforehand.
Storage requirements Storage requirements at the destination are typically lower in ETL since only processed or aggregate data is loaded. ELT requires higher storage at the destination since the raw data is first dumped to the destination and then transformed according to use cases.
Enforcing data governance ETL offers granular control over enforcing governance policies, including PII masking and row-based access control. ELT offers less granular control over enforcing policies. Governance enforcement is limited to what the destination warehouse or lakehouse supports.

Evolution of ETL and ELT

In ETL, transformation generally happens in a data processing engine like Hadoop or Spark. Developers can use any programming language supported by the processing engine and not limit themselves to the SQL paradigm. Most processing engines also support writing logic in SQL through an additional layer like Spark SQL for Spark. 

Since transformation is not dependent on the ability of the source or destination system, this pattern allows for more flexibility in transformation logic and granular control over data manipulation.

Extract Transform Load

In contrast, ELT first loads data to the destination system and then transforms it on demand according to the requirement. The simplest way to do this is by using the destination database or data warehouse’s SQL engine to convert the data once it is loaded. This delays the transformation until the target system requirements are fully known. It also helps make data available quickly for development teams to experiment with before deciding on the final structure. 

Extract Load Transform

The timelines of the origin of ETL and ELT patterns are debatable. ELT has existed in some form since the early days, in the form of bulk loading capability of relational databases and their SQL engines. Over time, when data volume exploded, it became difficult for traditional databases to keep up, and distributed data processing engines like Hadoop and Spark became popular. This resulted in a transition to ETL as the popular architecture for a data integration platform. The recent advent of distributed data warehouses with strong SQL engines and higher data processing power has led to ELT gaining prominence again.

ELT vs. ELT - Key deciding factors

Here are some factors to consider when choosing between the two approaches.

Data requirements

ETL works well when requirements do not change often, and the schema is constant. Development teams can plan transformation logic and apply standard engineering practices to ensure reusability and modularity. This approach also provides opportunities to use standard software development practices and improve maintainability. 

That being said, ETL involves separate processing frameworks and requires higher engineering expertise. A data management platform like Dataforge accomplishes the best of both worlds regarding ETL. It abstracts away the complexities of underlying frameworks while allowing scalable software engineering practices to be used. 

ELT works best while working with data that has an evolving schema. You can quickly implement changing requirements since the load process is uncoupled from the final schema. Analytics teams can figure out the final schema according to their requirements and manipulate the data as and when new requirements evolve. 

However, the side effect of an ELT approach is that teams generally lose sight of core engineering principles like reusability and modularity. It is also challenging to capture lineage while working in a rapidly changing environment. A platform like Dataforge, with its automatic lineage capturing and functional development style, can minimize these limitations and support ELT implementation.

Transformation complexity

In ETL, transformations are implemented in a separate processing engine like Apache Spark using SQL or programming languages like Python, Scala, etc. You can also use a streaming ETL pattern to perform real-time data transformations as data moves from source to destination systems. 

ETL provides greater flexibility when implementing complex transformation logic. To reduce complications while using a highly flexible data processing framework, consider using an abstraction layer like DBT over Spark or a data management platform like Dataforge. 

For example, consider a retail organization that has data related to customer’s spending in each category. A typical requirement in retail customer analytics is to create a vector that represents a customer's spending habits. Such a vector is then used to train machine learning models. Consider the below data format.

customer_id,purchase_value,product_id, category
C001,45,DETERGENT_001,SANITARY
C001,2,DRINK_001,BEVERAGE
C003,4,DRINK_001,BEVERAGE
C004,5,DRINK_003,BEVERAGE
C005,7,MEAT_005,FOOD
C001,8,MEAT_005,FOOD
  

Let's write PySpark code to convert this into a customer vector where each vector element represents the customer's spend in that category.

# Aggregate spending by customer and category 

spending_by_category = df.groupBy("customer_id", "category") \ .agg(sum("purchase_value").alias("total_spent"))

# Pivot the dataframe
pivoted_df = spending_by_category.groupBy("customer_id").pivot("category").sum("purchase_value").fillna(0) 

# Vectorize the data 

feature_cols = [col for col in pivoted_df.columns if col != "customer_id"] 

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features") 

vectorized_df = assembler.transform(pivoted_df)
  

A complex transformation like the above snippet is common in organizations that use deep analytics based on machine learning. 

In comparison, ELT relies on the capabilities of destination systems to implement transformation. The destination system can be a data warehouse like Snowflake, Redshift, etc, or a lakehouse like Delta Lake. 

Data warehouses typically limit you to implement transformations using their SQL layer. This limits the complexity of transformations that can be implemented. Some warehouses extend transformation capabilities through user-defined functions but can result in several lines of spaghetti code that are difficult to maintain. For example, implementing the above customer vector generation in SQL requires user-defined functions that are less optimized. 

However, if the destination system is a lake house, transformation complexity is a less limiting factor since lake houses like Delta Lake allow for code and SQL-based data manipulation.

Time to production

In most organizations, development teams responsible for transformation are separate from the analytics teams that eventually use the data. Implementing a new transformation or creating a new data product requires inter-team communication in ETL, which results in more friction than ELT setups.

ELT delays data transformation towards the point of use. It is designed for use cases where analytics requirements keep evolving, and developers must transform data quickly according to emerging requirements. Since data already resides in a distributed data warehouse or lake house, one can quickly implement transformations and take it to production. From that perspective, ELT offers quicker production time than ETL.

Transformation costs

In ETL, transformation is a well-planned activity requiring collaboration between multiple teams. This results in well-thought-out transformation modules with higher reusability. Since requirements are known beforehand, ETL provides opportunities to aggregate multiple data product requirements and create reusable modules for cost optimization. 

ELT transformations are more dynamic than ETL. The quest to quickly take results to production and the availability of a built-in powerful data transformation engine results in code duplication across transformation jobs. Since cloud-based data warehouses like Redshift, Snowflake, BigQuery, etc, charge based on the amount of data traversed, duplication can lead to unnecessary expenses. 

Support for unstructured data

Unstructured data integration poses two main challenges:

ELT’s flexibility in delaying transformation to the point of use makes it more suitable to handle unstructured or semi-structured data. ETL is more suitable for working with structured data where processing logic is known early in the lifecycle.

For example, consider a retail organization procuring data about customer browsing behavior from a third-party data provider. The data comes in JSON format and is stored in the data warehouse. Consider the data below.

  • Traditional relational processing logic is not suitable for handling a wide range of formats like text, video, images, social media content, etc.

  • Engineering teams typically don’t know what to do with unstructured data until the point in the pipeline when they understand the format and define a way to integrate it with existing data points.

customer_id, age, gender, behavior 
C001, 23, M , {"sports":"0.6","automobile":"0.8","gardeining":"0","watches":"0"} 
C001, 34, F , {"sports":"0.6","automobile":"0.8","gardeining":"0","watches":"0.7"} 
C001, 55, M , {"sports":"0","automobile":"0","gardeining":"0.7","watches":"0.8"}
  

Initially, engineers may have no use case for this data, but during a watch sale, they can extract details of the specific customers who prefer watches. The SQL snippet would look as below.

SELECT customer_id, gender, CAST(JSON_EXTRACT_PATH_TEXT(behaviour, 'watches') AS FLOAT) AS preference_for_watches FROM customers;
  

Recent advancements in lake houses help engineers dump unstructured data in a staging area and handle transformation later using Spark-based processing engines. 

Storage requirements 

While using ETL, data in the final destination is already aggregated per the data product requirements. Since data is transformed before loading, the overall data volume at the final destination is low. Before the transformation step in ETL, data is typically stored in inexpensive, cheap hardware. 

In contrast, ELT demands significant storage capacity since raw data is loaded into the system before any transformation. This necessitates using data lake houses or warehouses with high processing power to store raw data. Hence, storage costs in ELT are typically higher. 

Enforcing data governance

ETL offers more granularity in enforcing data governance. Requirements like RBAC and PII masking can be enforced in the transformation layer before getting to the warehouse. 

ELT’s data governance mechanisms are limited to what the lake house or warehouse offers. Since ELT's core design principle gives end users more control, enforcing data governance while balancing flexibility is difficult in ELT.

Choosing the right platform

ETL excels in handling structured data with high-complexity transformations and provides many opportunities for optimizing cost and performance. ELT does well when it comes to dynamic requirements with evolving schema. It provides a lower time to production in specific scenarios and does well with unstructured data. As evident, choosing between these two must be done based on the use case.

With data integration being a key factor behind business success, modern organizations must deal with various use cases. The advent of lake houses with strong SQL engines and the flexibility to use code-based processing engines has blurred the lines between ETL and ELT.  

Considering the above, choosing either ETL or ELT and being loyal to that choice is not wise. Organizations are better off retaining the flexibility to use ETL or ELT according to use cases without adhering to one in particular. 

Instead, a data management platform can help implement a neutral ETL-ELT strategy. You want a tool that abstracts the complexity of the underlying storage layer and processing engine while allowing developers to build transformations with flexibility.

The following table represents how some of the popular data management tools do when it comes to the key factors described above.

Features/Platform Databricks DBT Dataforge Talend Fivetran
Ease of implementation Difficult, Using code and SQL Using Jinja templating Functional Low Code + SQL No code
Evolving processing logic and schema Low support Low support Comprehensive support Low support Low support
Scalable software engineering practices Low support Comprehensive support Comprehensive support Low support Low support
Observability Overwatch, a separate observability tool No built-in support can be implemented manually Comprehensive support No built-in support No built-in support
Auto orchestration Manual orchestration Manual orchestration Comprehensive support Manual orchestration Supported

Ease of implementation

Transformations are generally implemented using SQL or programming languages like Python, Scala, etc. While SQL is simple to learn, it is unnecessarily verbose and leads to thousands of lines of spaghetti code. The underlying complexity of the programming languages makes implementing transformations a task that requires high engineering depth. A low-code development approach or a domain-specific language can reduce the complexities of implementing transformations. 

For example, Dataforge allows you to implement transformations through functional code and helps remove boilerplate code. Its cloud IDE provides templates, auto-complete, and real-time syntax checking. 

For example, let's say you want to aggregate the amount spent by a customer between a specific date. The DBT declarative code snippet for this requirement looks like this.

  output_name: feature_customer
columns: #columns in SQL CREATE TABLE ( ... ) DDL format
- customer string
- line_item_receipt_date date
- line_item_net_price_100 integer
channels:
#Can optionally have more than one source(s) mapped to a single output via channels
- source_name: cust_lineitem
  #Optional filter condition using raw attributes or rules from source
  filter: "[This].l_receiptdate BETWEEN '2013-01-01' AND '2013-12-31'"
  #Optional operation type. Options are Aggregate and None
  operation_type: "Aggregate"
  mappings:
	#Basic mappings
  - c_name customer
  - l_receiptdate line_item_receipt_date
	#Aggregate mapping
  - sum(net_price_int) line_item_net_price_100
  

Support for evolving processing logic

In the data integration world, nothing ever is static. Requirements keep changing, and the transformations you build today become outdated tomorrow. A platform that supports adding logic without updating existing pipelines is very valuable in such scenarios. Dataforge enables your code to adapt to data changes and reduces  regression testing efforts

Another factor that keeps evolving in the data integration world is schema. Data schemas change with the addition or deletion of attributes. Data integration jobs breaking down with schema changes can slow down development teams. An even worse problem is the difficulty in identifying reasons for job failures caused by schema changes.  Dataforge’s schema evolution mechanisms, type safety, and automatic validation help avoid surprises due to schema changes.

Scalable software engineering practices

Building transformations through SQL often results in losing sight of the basic software engineering practices of reusability, modularity, and extensibility. It often results in spaghetti code that is difficult to maintain. A data management platform can help enforce standard software engineering practices without compromising time to production.

Observability

ETL and ELT systems involve thousands of interconnected jobs that must run reliably every day. A platform with built-in observability supports quick troubleshooting. Platforms like Dataforge simplify troubleshooting by letting developers use SQL to search through the code. It also improves auditability and compliance while providing a mechanism for continuous monitoring. 

Auto orchestration

Defining directed acyclic graphs is the typical way of orchestrating jobs with dependencies. Manually doing this is error-prone and can lead to convoluted job sequences that are hard to troubleshoot. A platform where your transformation code itself is the orchestration code helps in quicker time to production and better maintainability.

Last thoughts

ETL and ELT differ in their approaches to implementing transformation. ETL executes transformation before loading the data to the destination warehouse or lakehouse. Conversely, ELT delays it to the point of use and is better at handling unstructured data, evolving requirements, etc. Making a hard choice between ETL and ELT is no longer required. Organizations can exploit the power of both paradigms by using a data management platform that abstracts the underlying processing engine and storage layer. 

Navigate Chapters:

Table of Contents

  • Summary
  • Evolution of ETL and ELT
  • ELT vs. ELT - Key deciding factors
  • Choosing the right platform
  • Last thoughts