Data Transformation Tools

Modern Data Transformation Process & Techniques

Multi-chapter guide | Chapter 2

The data transformation process converts raw data into usable insights that drive business intelligence. This article explains the process using examples and recommends best practices for improving data quality, flexibility, speed, scalability, and usability. 

In its last section, this article introduces the industry’s leading open-source project that helps data engineers implement data transformations using YAML configuration files, greatly simplifying large-scale data pipelines.

Summary of key data transformation process concepts

Concept Description
Data discovery Understanding and analyzing raw data to assess its structure and quality.
Multi-stage data processing Sequentially processing data through various stages for better quality and usability.
Data transformation rules Rules are applied during data transformation to ensure consistency, accuracy, and compliance.
Data enrichment Adding additional context or information to enhance data value.
Tracking data lineage Visualizing the flow and transformation path of data.
Data quality Embedding data quality checks directly into transformation scripts.
Data governance Ensuring data integrity, security, and compliance.
Automation and efficiency Techniques and tools for automating the data transformation process to improve efficiency and reduce manual intervention.
Handling semi-structured data Extracting and transforming data from comma-separated values (CSV), image meta-data, or logs.
Schema evolution Adjusting to changes in data structures dynamically.

Before exploring the concepts above, let’s start with a refresher on high-level data transformation techniques.

Approaches to data transformation

Depending on when transformations are performed in the data pipeline, there are three primary approaches. The choice between the three impacts the cost and efficiency of your data transformation processes.

ETL

ETL (Extract, transform, load) extracts data from source systems, transforms it to fit operational needs, and loads it into a target data warehouse or database. ETL ensures that only cleaned and transformed data enters your target system. ETL is ideal for structured data environments. For example, in the financial industry, ETL processes ensure that transaction data is clean and accurate before loading into systems. 

ETL requires dedicated storage and computing resources, which is helpful when the technology stack and costs associated with transformation are better separated from the target system (e.g., a data lake).

ELT

ELT (Extract, load, transform) changes the order of operations—you first extract data and then load it into the data warehouse or data lake for transformation. ELT applies transformations within the storage system. 

It is more suited to big data scenarios where data velocity, volume, and variety make transformation processes challenging to perform on the fly. ELT can also simplify the tech stack compared to ETL as it utilizes the computational power of the data warehouse or data lake instead of a separate transformation engine.

Reverse ETL

In simple terms, reverse ETL starts from an ELT platform rather than directly from the source system to take advantage of data models or logic already built in the ELT platform.

Data transformation process techniques

No matter where the data transformation process occurs in your data pipeline, it typically involves multiple stages, as depicted in the medallion architecture below. 

Medallion Architecture is a layered data organization design pattern that progressively refines data through Bronze (raw), Silver (cleansed), and Gold (curated) stages for optimized analytics and consumption.

Logical data flow from raw data to useful insights (source)

Data discovery

Data discovery is the first step that sets the stage for all subsequent steps in the data transformation process. It involves understanding raw data structure, content, and integrity from various sources like CSV files, databases, or data warehouses. You ensure the data is consistent, complete, and ready for the next stages of transformation.

Entity relationships

Understanding how different data tables relate is vital for planning effective data transformations. Entity-relationship diagrams (ERD) can help identify relationships and spot redundancies. An ERD visualizes the relationships between entities in a database, such as how customers relate to orders or products. It helps in understanding the database structure and planning data transformations.

Sample entity-relationship diagram (ERD) (source)

Data integrity

The next step is to ensure data integrity. This includes running queries to identify missing values, validate primary keys, and understand data distribution. Below are examples of some basic integrity checks in the Python programming language using Pandas.

Data transformation rules

Data transformation rules dictate how raw data is converted into a structured, usable format. For example, data may be transformed to: 

  • map data models between two applications,

  • align with a data model aggregated for reporting purposes,

  • feed into a machine learning or AI model for further analysis.

Data transformation rules include a variety of operations, each serving a distinct purpose in refining the data.

Filtering removes unnecessary or irrelevant data from the dataset. For instance, you filter out transactions below a certain amount.

DataForge
Python
import pandas as pd
# Sample data
data = {'TransactionID': [1, 2, 3, 4],
        'Amount': [100, 200, 50, 300]}
df = pd.DataFrame(data)
# Filtering transactions greater than 100
filtered_df = df[df['Amount'] > 100]
print(filtered_df)
  

Aggregation combines multiple data components into a single summary value. For example, summing up sales figures to get a total sales value.

DataForge
Python
# Aggregating total sales
total_sales = df['Amount'].sum()
print(f"Total Sales: {total_sales}")
  

Joining merges data from different sources based on a common key.

DataForge
Python
# Sample data for joining
# Joining dataframes on CustomerID
merged_df = pd.merge(df_customers, df_transactions, on='CustomerID')
print(merged_df)
  

Data enrichment

Data enrichment enhances existing datasets by integrating additional data from external or internal sources. This added information helps to provide more context, fill in missing details, or offer deeper insights that the original dataset alone could not provide.

Data enrichment typically involves steps like:

  • Data Matching: Aligning records from different datasets.

  • Data Aggregation: Summarizing or combining data from multiple sources.

  • Data Transformation: Converting the added data into a consistent format.

For example, if you have a customer database with basic information like company name and address, data enrichment might involve appending data like purchase history or social media engagements. 

In the basic example below meant to illustrate the concept, we will use SQL to populate two tables with customer purchase history and social media engagements and use the data to enrich the customer information.

The purchase information:

DataForge
Python
CREATE TABLE purchases (
    customer_id INT,
    purchase_date DATE,
    amount DECIMAL(10, 2)
);

INSERT INTO purchases (customer_id, purchase_date, amount) VALUES
(1, '2024-01-15', 150.00),
(2, '2024-01-16', 85.00),
(3, '2024-01-17', 200.00);
  

The social media information:

DataForge
Python
CREATE TABLE social_media (
    customer_id INT,
    engagement_type VARCHAR(50),
    engagement_date DATE
);

INSERT INTO social_media (customer_id, engagement_type, engagement_date) VALUES
(1, 'like', '2024-01-10'),
(2, 'comment', '2024-01-12'),
(4, 'share', '2024-01-14');
  

And the data enrichment query:

DataForge
Python
SELECT 
    p.customer_id,
    p.purchase_date,
    p.amount,
    CASE 
        WHEN sm.customer_id IS NOT NULL THEN 'Engaged'
        ELSE 'Not Engaged'
    END AS social_media_engagement
FROM 
    purchases p
LEFT JOIN 
    social_media sm
ON 
    p.customer_id = sm.customer_id
    AND sm.engagement_date BETWEEN DATEADD(day, -30, p.purchase_date) AND p.purchase_date;
  

The enriched result will be:

customer_id purchase_date amount social_media_engagement
1 2024-01-15 150.00 Engaged
2 2024-01-16 85.00 Engaged
3 2024-01-17 200.00 Not Engaged

The customer data in this example was enriched with purchase history and social media engagement data to provide a more complete profile for each customer.

Best practices in data transformation

Follow the below best practices when implementing your data transformation process.

Track data lineage through the data journey

Data lineage traces data flow through a pipeline. It shows where data originates and how it is transformed and used across various systems, which is useful for compliance, auditing, and ensuring data integrity throughout the data lifecycle.

Data lineage technology started at a table level. For example, a table-level data lineage diagram can help identify data origins during auditing to ensure that Personally Identifiable Information (PII) subject to industry regulation isn’t shared inappropriately. 

However, table-level lineage has limited value for detailed troubleshooting, so the new industry standard in data lineage technology is tracking data at a column or field level. This means you can trace every column within a table back to the columns that produced its values. The most advanced systems, like DataForge, keep logs of every line of code applied to every table cell.

Data lineage example in the DataForge user interface (Source)

Implement data quality checks in code

Data transformation requires stringent quality checks for reliability and accuracy. Data quality checks should not be an afterthought—they are an integral part of the process. This is where data quality in code comes into play. Code tests validate the data at every stage, ensuring the output meets the required quality standards. They enhance the reliability of the data transformation process through automation. Here are some examples of common checks:

  • Missing values

  • Duplicate records 

  • Data type (e.g., text, date)

  • Column count

  • Data range (e.g., score between 0 and 100)

  • Identifier uniqueness 

You can also deploy custom data quality rules tailored to specific needs. Implementing and maintaining data quality rules for large or online datasets can be challenging. However, you can adopt a declarative approach to version control and test these rules like any other code. You get consistent application and easier debugging of quality checks.

Establish data governance

Data governance is the framework of policies, processes, and standards that ensures an organization's proper management, quality, security, and use of data. 

People, processes, and technology must come together to establish a data governance program. For example, an organization might decide to address data errors in a way that it won’t repeat again. Instead of rejecting bad data or forcing it through the system, this approach creates a separate workflow for questionable records. This allows high-quality data to proceed to its destination while flagging and isolating problematic data for human review.

The process involves:

  1. Data validation: Checking incoming records against defined rules or quality criteria.

  2. Branching: Separating good records from bad ones.

  3. Human intervention: Routing problematic records to data stewards for review.

  4. Remediation: Fixing issues either at the source or by improving the data transformation logic.

This method helps maintain data quality while providing a mechanism for continuous improvement of the data pipeline, making it a data governance practice.

Automate for efficiency

Automation in data transformation utilizes technology to perform tasks that would otherwise require human intervention. It enhances efficiency and reduces errors with consistent data handling. You can speed processing times and free human resources to focus on strategic tasks. 

Here are some examples of data automation:

  • Automated Data Validation: Running scripts to check data quality, consistency, and integrity as soon as data is ingested, flagging or correcting errors automatically.

  • Automated Reporting and Alerts: Automatically generate and distribute reports or trigger alerts based on specific data thresholds or anomalies detected in the pipeline.

  • Trigger-based workflows that initiate specific data processes based on predefined events or conditions.

However, vendors define automation differently. Let’s take data pipeline orchestration, for example. 

Traditional data processing platforms like Apache Airflow use Directed Acyclic Graphs (DAGs) to define the dependencies between jobs and orchestrate their sequential execution. The DAG helps to schedule a new job after the previous job on which it depends has been completed. This level of orchestration might be interpreted as automation. However, defining a DAG is a manual process by itself. 

The example below shows how a DAG graph would be programmed node by node in Python, including the edges of the directions of the dependencies:

DataForge
Python
class Node: 
   def __init__(self, data): 
      self.data = data 
      self.children = [] 

def create_dag_node_by_node(): 
    # Create nodes #
      A = Node('A') 
      B = Node('B') 
      C = Node('C') 
      D = Node('D') 

    # Define edges #
      A.children = [B, C] 
      B.children = [D] 
      C.children = [D] 

# Assuming A is the root node return A # Example usage #
      root = create_dag_node_by_node()
  

The challenge with scaling this approach is that data engineers must programmatically define the DAG’s graph of dependencies, which takes time and might change over time. 

A new generation of tools like DataForge uses functional programming to let data engineers define data transformations in YAML code, replacing procedural scripts to schedule jobs. This approach eliminates the need for scheduling and orchestrating jobs, bringing true automation to the data pipeline. Read this page to learn more about how this approach works.

DataForge
Python
import pandas as pd
df = pd.read_csv('reviews.csv')

def check_missing_values(df):
   return df.isnull().sum()

def check_duplicate_rows(df):
   return df[df.duplicated()]

def check_null_values_in_column(df, column):
   return df[column].isnull().sum() == 0

def check_primary_key(df, key_column):
   is_unique = df[key_column].is_unique
   no_nulls = check_null_values_in_column(df, key_column)
   return is_unique and no_nulls
  

The key stages in a data transformation process

Each stage has a specific role and contributes to the overall transformation. The stages below are defined in the context of the medallion architecture diagram shared earlier.

Bronze Layer: Raw, unprocessed data is ingested from data sources. This layer contains the original data in its native format.

Silver Layer: Cleansed and transformed data derived from the Bronze layer. Data is filtered, cleansed, and validated, making it more structured and usable for analytics.

Gold Layer: Highly refined, aggregated, or enriched data becomes ready for consumption by business intelligence tools or machine learning algorithms.

Data transformation process challenges

Data transformation processes are not without their challenges. Here are some of the common ones.

Repetitive operations

The data transformation process often repeats several times on the same data from the same data source due to various reasons like:

  • Addition of new data

  • Schema changes

  • Backfill due to pipeline failure

  • Addition of specific rules in the data access layer. 

Since the stages in the data transformation process are not isolated, this can often lead to redundant effort and discoveries. However, some tools can help overcome this isolation and create smaller, reusable units. For example, DataForge, a company that offers a free, open-source tool and a hosted solution based on functional programming paradigms, explains different ways to address them in this article about overcoming ETL challenges.

Semi-structured data

In the world of data, not everything is neatly organized into rows and columns as in a relational database. Much of the data ingested into data transformation processes is unstructured (like text) or semi-structured, presenting unique challenges. 

Here are examples of semi-structured data:

  • JSON (JavaScript Object Notation) files

  • XML (eXtensible Markup Language) documents

  • Email messages

  • Log files

  • NoSQL databases

  • HTML web pages

  • CSV (Comma-Separated Values) files with inconsistent formatting

  • YAML files

  • EXIF (Exchangeable Image File Format) 

Often, what’s being extracted from semi-structured data isn’t the data itself but the meta-data. For example, EXIF includes information like camera settings, date, time, and GPS coordinates, which must be extracted and used in applications. 

Several tools can help manage and analyze unstructured and semi-structured data effectively. For example, Pandas is helpful for data manipulation, and NLTK (Natural Language Toolkit) is great for text processing as unstructured data often takes the form of text. 

One key challenge in dealing with semi-structured data is integrating it with structured data. This involves normalizing and aligning different data formats. To illustrate how this can be done, let’s consider an example using Python and the Pandas library. Consider the following CSV dataset:

DataForge
CSV
# Sample CSV before transformation
Movie, Score, Notes
"Avengers", "8", "Favorite: Yes | Popular: Yes"
"Snow White", "7", "Favorite: No | Popular: Yes"
"Spirited Away", "9", "Favorite: Yes | Popular: No"
  

In this case, the Notes field is a compound column that needs to be parsed and separated into multiple columns. To achieve this transformation using Python and Pandas, you can define a custom function to parse the Notes field and apply it to the dataframe:

DataForge
Python
import pandas as pd

# Custom function to split the 'Notes' column
def extract_notes(note):
   attributes = dict(item.split(': ') for item in note.split(' | '))
   return pd.Series(attributes)

df = pd.read_csv('movies.csv')

df[['Favorite', 'Popular']] = df['Notes'].apply(extract_notes)

# Convert 'Yes'/'No' to Boolean values
df['Favorite'] = df['Favorite'].map({'Yes': True, 'No': False})
df['Popular'] = df['Popular'].map({'Yes': True, 'No': False})

df = df.drop(columns=['Notes'])

df.to_csv('transformed_movies.csv', index=False)
  
DataForge
CSV
# Sample CSV after transformation
Movie, Score, Favorite, Popular
"Avengers", "8", "True", "True"
"Snow White", "7", "False", "True"
"Spirited Away", "9", "True", "False"
  

While tools like Pandas simplify the handling of semi-structured data, they can be time-consuming for complex transformations on a large scale. That's why modern data engineering tools offer a unified platform for seamlessly integrating diverse data sources of different formats (logs vs. CSV), velocities (real-time vs. batch), and interfaces (REST API vs. database views) to ingest, cleanse, and transform data within a single framework.

Schema evolution

Schema changes can occur frequently due to: 

  • Business requirements change—such as when launching new products or services.

  • Addition or modification of data sources—such as adding a new API or database that introduces new fields and data types.

  • Laws and regulatory changes that require additional data collection.

  • Data types change—such as when transitioning from text-based data to more complex data types like JSON or XML.

Dynamic schema management is essential for detecting and managing unexpected schema changes or anomalies, such as data types and missing fields. You must identify and address issues promptly to maintain data integrity and reliability.

It is also important to implement monitoring and automated response systems. Here are key strategies: 

  • Data observability 

  • Anomaly detection

  • Real-time event tracking

Tools like DataForge can help with real-time auto-detection and remediation. Learn more about this aspect of the process here.

Complex operations

Data transformation often involves complex operations requiring SQL scripting within the database. Such operations result in monolithic code and complex patterns. Any alteration to a schema requires updates to all downstream SQL scripts. This results in tedious troubleshooting and the need for regression testing upon every change.

To address this scaling challenge, DataForge has created the industry’s leading open-source project and a hosted solution to allow data engineers to declaratively define data sources, relationships, and transformation rules in YAML files using functional programming paradigms, avoiding the burden of maintaining interdependent SQL codes and scripts. 

In the YAML example below, the source table is “orders.” The input fields are order_id, quantity, and order_date. The “rules” clause defines the transformations. The “relations” and “channels” clauses define the relationships between tables across multiple sources. The last section of the file defines the output table and the fields (order_date and rev_by_date), mapping the output columns to the output table.

DataForge
Python
source_name: orders
source_description: "Table containing order data."
target_table_name: total_sales
raw_attributes:
- order_id int
- quantity int
- order_date date
rules:
 - name: revenue 
   expression: "[This].quantity * [products].price"

relations:
 - name: "Order-Product Relationship"
   expression: "[orders].product_id = [products].product_id"
   cardinality: "M-1"

output_name: rev_by_date
columns:
  - order_date date
  - rev_by_date decimal(18,2)
channels:
  source_name: orders
  operation_type: "Aggregate"
  mappings:
    - order_date order_date
    - SUM(revenue) rev_by_date
  

This approach makes adding additional rules or table columns in the YAML file easy and avoids the need for rewriting SQL scripts when the schema changes or new data is added to the pipeline.

Read this two-part blog (part 1 and part 2) to learn more.

Last thoughts

This article covered the core components of a data transformation process and recommended best practices for implementing them. 

However, scaling data transformation processes at an enterprise level results in difficult-to-maintain code justifying the need for adopting a new approach. DataForge offers a solution to simplify the data transformation process with a declarative approach. Visit the DataForge product page to learn more.

Navigate Chapters:

Table of Contents

  • Summary
  • Approaches to data transformation
  • Techniques
  • Best practices
  • Process challenges
  • Last thoughts