Data Transformation Tools

SQL Transformation: Best Practices & Modern Techniques

Multi-chapter guide | Chapter 3

Data engineers have used SQL code for decades to convert, clean, combine, and order data. However, current methods of continuous software delivery with massive data volumes introduce new challenges in development, regression testing, and troubleshooting when relying on conventional SQL statements for data transformation. 

Despite its power and popularity, SQL has limitations stemming from its origins as an interactive command tool with a table-centric approach to data manipulation. These limitations, including the lack of object-oriented features and ordered execution without the help of external scripts, can make it challenging to use for complex data transformations.

This article explores popular SQL transformation best practices that help scale SQL in larger application environments. It also explains the advancements that functional programming has brought to SQL in recent years and introduces the leading open-source project that offers such transformation technology.

Summary of key SQL transformation concepts

Concept Description
Modularity Break down tasks into smaller, reusable functions that promote code reuse and maintainability.
Immutability Use transformations that produce new datasets without altering the original data, ensuring consistent outputs.
Declarative configurations Use declarative constructs that emphasize “what” over “how” in the transformation.
Error handling Isolate errors to specific steps, provide detailed messages, and prevent cascading failures.
Scalability & Performance Adapt to growing data needs without significant performance degradation, supporting various scaling strategies and enhancing performance.
Readability and maintainability Enhance code clarity, reduce technical debt, and improve long-term sustainability.

#1 Modularity

Common steps like cleaning data, standardizing formats, and performing aggregations introduce redundancy if written multiple times throughout a project. They make it harder to manage code updates.

Instead, you can create modular scripts for smaller functions and promote their reuse across the project. Isolated functions are easier to understand, test, and modify without affecting the entire codebase.

Stored procedures

Stored procedures promote modularization and prevent repetitive code in SQL. They allow you to encapsulate frequently used queries into a single callable routine. Below is an example of a SQL stored procedure in SQL to truncate product names and round off the price.

DataForge
DFQL
CREATE PROCEDURE CleanAndTransformProducts
AS
BEGIN
	UPDATE Products
	SET price = ROUND(price, 2),
    	product_name = CASE
        	WHEN LEN(product_name) > 30 THEN LEFT(product_name, 30) + '...'
        	ELSE product_name
    	END
END
  

Pre-compiling stored procedures can reduce code duplication, centralize the logic, and enhance performance. However, as the number of transformations and complexity increases, they can become monolithic and hard to maintain. Each stored procedure becomes rigidly tied to the table that it references, which makes it difficult to reuse in other pipelines, and ultimately compromises modularity.

In a functional programming approach, you can segregate SQL transformations as reusable “rules” closely tied to your table definition. You can easily modify and scale these rules without altering the table's underlying code. 

If you are new to applying functional programming paradigms to SQL, read this article to understand how the open-source project DataForge Core decomposes table logic into smaller scopes of data operations by defining column-pure and row-pure functions.

For example, DataForge Core lets you write YAML code to initiate the Products database and list the associated transformation procedures as rules.

DataForge
YAML
source_name: "Products"
source_query: "SELECT * FROM Products"
raw_attributes:
- product_id int
- product_name string
- average_cost decimal
- average_sell_price decimal
rules:
 - name: "Round Average Cost"
   expression: "ROUND([This].average_cost, 2)"
 - name: "Truncate Product Name"
   expression: “CASE
     WHEN LENGTH([This].product_name) > 30
     THEN CONCAT(LEFT([This].product_name, 30), '...')
     ELSE [This].product_name
   END”
  

As data transformations grow, the functional approach allows to isolate transformation logic at the grain of individual rule (attribute / column). This promotes the re-use of rules, addressing stored procedures’ short-comings. The expressions don’t directly refer to the underlying table. Also, structuring commonly used transformations as above makes them more readable to a new engineer.

#2 Immutability

Once created, immutable or read-only data is unmodifiable. SQL transformations act on the immutable data and produce new data sets to maintain data integrity. 

Maintaining immutability creates a data audit trail that preserves the evolution of data throughout a transformation. This allows for accountability, easier troubleshooting, and error reproduction. If a transformation causes failures in a production environment, you can quickly access and revert to the previous data state to avoid downtime.

Immutable data also produces the same output every time it is transformed. Your data pipelines remain reliable, and you can trust that outputs remain unchanged when multiple transformations are applied in a sequence or when debugging.

Views

SQL views are often used to maintain immutability. Consider a Sales table that captures the details of every sale a particular shop makes.

DataForge
SQL
CREATE TABLE Sales (
	TransactionID INT PRIMARY KEY,
	CustomerID INT,
	ProductID INT,
	Quantity INT
);
  

To store the individual revenue from each sale, you can create a view SalesDetails as shown below.

DataForge
SQL
CREATE VIEW SalesDetails AS
SELECT Sales.TransactionID,
   	Sales.CustomerID,
   	Sales.ProductID,
   	Sales.Quantity,
   	Sales.Quantity * Products.Price AS Revenue
FROM Sales
JOIN Products ON Sales.ProductID = Products.ProductID;
  

Assuming a separate Products table, the view joins both original data sets and creates a view containing all the original columns and associated revenue (Price * Quantity). You can then refer to this view like any other table.

DataForge
SQL
SELECT TransactionID, CustomerID, ProductID, Quantity, TotalCost
FROM OrderDetails;
  

By doing the above, you can maintain the integrity of your original Sales table while achieving the required transformation result. For improved performance, you can also use a materialized view. The key difference between views and materialized views are that the latter is physically stored on disk. Since non-materialized views are virtual, retrieving data from them is reliant on querying the underlying tables, making them slower. Materialized views are able to pre-compute queries since they exist physically, but on the flip side, they require more storage. Materialized views are preferred when dealing with large data sets, and when retrieving real time data is not a priority. You can define a materialized view by simply including the ‘MATERIALIZED’ keyword as below:

DataForge
SQL
CREATE MATERIALIZED VIEW SalesDetails AS
SELECT Sales.TransactionID,
       Sales.CustomerID,
       Sales.ProductID,
       Sales.Quantity,
       Sales.Quantity * Products.Price AS Revenue
FROM Sales
JOIN Products ON Sales.ProductID = Products.ProductID;
  

Much like, stored procedures, views and materialized views also tend to be too rigidly tied to the associated tables, reducing the ability to reuse or modularise them.

A functional approach avoids separate views and lets you define new attributes without impacting data, while improving modularity. For example, consider the functional code block from DataForge.

DataForge
DFQL
source_name: "Sales"
raw_attributes:
  - TransactionID int
  - CustomerID int
  - ProductID int
  - Quantity int
rules:
  - attribute_name: "Revenue"
    expression: "[This].Quantity * [Products].Price"

#Relations
  - name: "[Sales]- ProductID -[Products]"
    expression: "[Sales].ProductID = [Products].ProductID"
    cardinality: "M-1"
  

The new attribute, Revenue, transforms data in the Sales and Products tables to generate a calculated value. By defining this new attribute within the rules section, not only is the original data remains untouched, and you maintain the immutability of the Sales table, but the scope of immutability is at the column level rather than the entire view/table. Relations which define the connections between sources are typically defined separately, but we have included it below the table definition for simplicity.

The functional approach makes complicated transformations well-documented and easily accessible, all while keeping immutability in mind. You don’t need to learn a new way of writing code since DataForge has a cloud offering that allows you to utilize their web application to do most of the transformations. Find out more examples here.

Overview of DataForge Cloud for data management (Source)

#3 Declarative configurations

Clarity of objectives becomes increasingly important as data complexity grows. Data engineers are more concerned with “what” a transformation does than “how” it does so. Procedural structures such as loops and conditionals increase verbosity and decrease readability. In comparison, a declarative approach conveys the required outcome and lets the database management system handle the underlying tasks. It promotes consistency and standardization across transformations. 

SQL provides several declarative constructs like SELECT, JOIN, and INSERT to approach our transformations. Using a simple SQL transformation as an example:

DataForge
SQL
CREATE VIEW employee_details AS
WITH full_names AS (
  SELECT CONCAT(first_name, ' ', last_name) AS full_name, employee_id
  FROM employees
),
ages AS (
  SELECT employee_id, DATEDIFF(CURRENT_DATE(), date_of_birth) / 365.25 AS age
  FROM employees
)
SELECT fn.full_name, a.age
FROM full_names fn
JOIN ages a ON fn.employee_id = a.employee_id;
  

It is evident from the code that you want to retrieve every employee's full name and age from their data. On the other hand, if you were to use a procedural approach using Python, it would look like this:

DataForge
Python
import sqlite3

conn = sqlite3.connect('employees.db')
cursor = conn.cursor()

cursor.execute("SELECT first_name, last_name, date_of_birth FROM employees")

results = []
for first_name, last_name, date_of_birth in cursor:
	full_name = f"{first_name} {last_name}"
	age = (datetime.date.today() - date_of_birth).days / 365.25
	results.append((full_name, age))

for full_name, age in results:
	print(f"Name: {full_name}, Age: {age:.2f}")

cursor.close()
conn.close()
  

Code lines increase due to the result array, two for-loops, and more logic within the loops. As the complexity of queries increases, logic also becomes more verbose. In contrast, the declarative approach of SQL allows the engineer to prioritize the transformation's result while the SQL engine handles the execution. However, the downside is that the process can’t be restarted at an intermediary step, creating challenges in the areas of backfilling and regression testing. Also, it is important to note that while individual SQL statements are declarative, multi-statement SQL scripts are often not. This is because multi-statement scripts dictate the exact order of data transformations and persistence.

Like SQL, a functional approach is also effective in avoiding code complexity. For example, in DataForge:

DataForge
DFQL
rules:
- name: "full_name"
  expression: "CONCAT([This].first_name, ' ', [This].last_name)"

- name: "age"
  expression: "DATEDIFF(CURRENT_DATE(), [This].date_of_birth) / 365.25"
  

After defining the employee table, you can express the associated functions of full_name and age as rules to reuse whenever needed. Ultimately, you get the benefits of being declarative while abstracting the data processing and separating it from the application code. This separation also encourages modular testing and a more full-proof transformation. DataForge adds storage and orchestration without breaking the declarative paradigm by using an event-based workflow engine rather than static DAGs (Directed Acyclic Graph) or procedural definitions.

#4 Error handling

Working with long, complex transformations may produce errors that can butterfly-effect downstream processes and potentially corrupt data. Errors could arise from erroneous data, constraint violations, or failing DBMS connections. Error handling prevents the flow of transformation from halting completely. A strategic way to deal with errors in SQL is using automatic retries:

Automatic Retries

Below is an example of automatic retries in an SQL server while trying to connect to a Cloud Service:

DataForge
DFQL
DECLARE @retryCount INT = 0;
DECLARE @maxRetries INT = 3;
DECLARE @success BIT = 0;

WHILE @retryCount < @maxRetries AND @success = 0
BEGIN
	BEGIN TRY
    	EXECUTE [CloudService].[InsertStudent] @ID = 1, @Name = 'Joe';
   	 
    	SET @success = 1; 
	END TRY
	BEGIN CATCH
    	SET @retryCount = @retryCount + 1;
    	PRINT 'Temporary error occurred. Retrying...';
    	WAITFOR DELAY '00:00:05';
	END CATCH
END
  

In terms of integration, many issues are temporary and mainly caused by outages, network problems, or race conditions. Differentiating between transient errors and systemic or logical issues provides major operational benefits. Automatically retrying these connection issues reduces the workload on data teams by allowing them to focus their resources on issues that require closer attention.

A key point to note concerning SQL’s handling of errors is that it occurs at runtime. This could lead to various issues resulting in more downtime and operational costs. Runtime handling also requires engineers to be very careful while scripting queries, which can be time-consuming. SQL error handling is also tightly coupled with database transactions. When an error occurs, the database platform would typically rollback the entire transaction, so data engineers would need to carefully monitor the scope of transaction execution and create retry code for multiple specific entry points for each potential failure. Dataforge, on the contrary, handles errors at compile time. It validates SQL expressions and data types when compiling a project to SQL. This frees engineers from worrying about error-handling techniques like transactions and try-catch blocks, as they are built into the transactions and write operations by the framework. On a large scale, such automated error handling could save data teams a great deal of time and effort.

#5 Scalability & Performance

Scalability ensures the database is manageable as transformations grow in size and complexity. Growth is typically measured in terms of the data volume involved, the number of concurrent users, and business requirements. A well-thought-out scalable solution avoids a decrease in performance when the project inevitably grows. 

SQL supports both vertical and horizontal scaling. You can improve CPU power, RAM, and storage hardware or distribute the transformation workloads across nodes. 

Sharding distributes data across multiple databases, called shards, to manage increasing data volumes. For instance, in a user schema with fields like userID, name, email, and userRegion, you can shard the data based on the user’s region, storing North American users in UsersDB1, Asian users in UsersDB2, and so forth. This method effectively reduces the load on any single database and allows for targeted queries based on regions.

A diagram describing the process of sharding (Source: DataForge)

Cross-shard queries

When data is spread across multiple shards, you may need to perform cross-shard queries, such as retrieving all users with a Gmail account across all regions. A sharding router coordinates and optimizes queries across the relevant shards, ensuring that only the necessary shards are queried. 

Managing transactions across shards can be complex, as you must ensure data consistency and integrity across multiple databases. Eventual consistency models accept that data across shards may not be immediately consistent but will eventually reach a consistent state. Adopting eventual consistency provides a more scalable solution. Read more about how DataForge can help in containing data engineering costs.

A diagram describing the process of cross-sharding (Source: DataForge)

Certain queries require the redistribution of data among partitions to allow for different grouping. This is known as “shuffling” and is an expensive operation due to copying and rewriting data. For example, an operation like map()does not require data to be shuffled, but sortByKey(), reduceByKey(), and groupByKey() require data to be shuffled due to the creation of a new RDD (Resilient Distributed Dataset). Modern data formats like Parquet, Delta, and Iceberg use the underlying query engine (eg. Spark) to automatically “shard” data using one or more columns. These pre-defined partitions significantly reduce the overhead that shuffling creates. SQL server databases, on the other hand, require careful configuration to enable partitioning, especially when dealing with large datasets. Avoiding shuffling is a manual and time-consuming process. Therefore, modern platforms like Spark are highly recommended when scaling out databases, due to their efficiency advantages.

Parallel Processing

Performance is key in data transformation, especially when dealing with large datasets. Parallel processing distributes transformation tasks across multiple processors or threads, drastically improving efficiency. For instance, a SELECT operation on a table with 1 billion rows takes 300 seconds on a single CPU core. However, partitioning the data into eight segments and using eight CPU cores reduces execution time to around 40 seconds—a 7.5x improvement.

Consider the Sales table with tens of thousands of rows. To optimize processing, you can partition the data into separate tables based on ProductID, as shown:

DataForge
SQL
CREATE TABLE Sales_Partition1 AS
SELECT *
FROM Sales
WHERE MOD(ProductID, 4) = 0;

CREATE TABLE Sales_Partition2 AS
SELECT *
FROM Sales
WHERE MOD(ProductID, 4) = 1;
  

To find the total quantity of every product sold, you run the relevant query on each partitioned table, as shown below. The UNION ALL keyword combines the results in the end.

DataForge
SQL
SELECT ProductID, SUM(Quantity) AS TotalQuantity
FROM (
   SELECT ProductID, Quantity
   FROM Sales_Partition1
   UNION ALL
   SELECT ProductID, Quantity
   FROM Sales_Partition2
   UNION ALL
) AS ParallelSales
GROUP BY ProductID;
  

As you can see, SQL requires explicitly defining each partition and handling their aggregation through separate queries, leading to more maintenance.

The rule-based approach in functional programming supports partitioning more efficiently.

DataForge
DFQL
# Source File
source_name: "sales"
source_query: "SELECT * FROM sales"
raw_attributes:
- transactionId int
- productid int
- quantity decimal


# Output File
output_name: "sales_aggregate"
columns:
  - productid int
  - total_sales decimal(18,2)
channels:
  - source_name: "sales"
    operation_type: "Aggregate"
    mappings:
      - SUM(quantity) partition_1_total
      - productid productid
  

As shown above, you can use DataForge to define modular transformations without manually managing underlying partitions or creating complex SQL queries. The source file contains a channel that allows us to aggregate the sales quantity of every product in the associated ‘sales’ table, and using DataForge Cloud you can define a parameter that partitions the dataset (in this case: MOD([THIS].product_id, 4), accordingly. The ability to configure partitioning this way implements parallelism without having to build the solution from scratch and write lengthy scripts.

#6 Readability and maintainability

It is important that code related to updating data, some of which might be sensitive, is very clear in its aim. Clarity smoothens collaboration within teams, allows easy integration of new developers, and reduces the chance of mishandling or deleting important data, especially months or years after the code has been written. Some important principles of readable code within SQL transformations are provided below.

Comments

A logical code flow with frequent, concise comments increases readability. The below transformation is easily readable and provides extra information through comments.

DataForge
DFQL
-- Start transaction
BEGIN TRANSACTION;

-- Create a view to calculate total sales
CREATE VIEW total_sales AS
SELECT


    o.order_date,
   SUM(p.price * o.quantity) AS revenue_by_date
FROM
   orders o
JOIN
   products p ON o.product_id = p.product_id
GROUP BY
   o.order_date

-- Commit transaction
COMMIT;
  

Naming conventions

Use meaningful variable and self-explanatory function names. For example, when creating a view consisting of the tables users and purchases, it is more intuitive to call the view user_purchases rather than u_p.

Advanced constructs

Using SQL-provided features such as stored procedures, CTEs (Common table expressions), and views can make transformations more concise and scalable and prevent an engineer from being overwhelmed trying to understand the script.

Functional programming approach

SQL code can be verbose and complex when dealing with large data transformations. A combination of SQL and functional programming can neatly organize transformation components. As seen below, using the DataForge framework, SQL code converted into its declarative counterpart in .yaml code looks more intuitive. The data source, transformation rules, and relationships are organized in a way that’s easy to follow, even if the number of transformation rules grows to hundreds.

DataForge
DFQL
source_name: orders
source_description: "Table containing order data."
target_table_name: total_sales
raw_attributes:
- order_id
- quantity
- order_date
rules:
 - name: revenue 
   expression: "[This].quantity * [products].price"
relations:
 - name: "Order-Product Relationship"
   expression: "[orders].product_id = [products].product_id"
   cardinality: "M-1"

output_name: rev_by_date
columns:
  - order_date date
  - rev_by_date decimal(18,2)
channels:
  source_name: orders
  operation_type: "Aggregate"
  mappings:
    - order_date order_date
    - SUM(revenue) rev_by_date
  

Last thoughts

This article reviewed a few pitfalls and best practices of SQL in data transformation. As data volumes and complexity increase, SQL presents challenges stemming from its monolithic nature. Data engineers can enhance the efficiency and effectiveness of SQL transformations by focusing on modularity and reusability. Functional programming models like DataForge retain immutability, error handling, declarative configurations, and other existing SQL capabilities. They also better support complex data projects' error handling, scalability, and readability.

Navigate Chapters:

Table of Contents

  • Summary
  • #1 Modularity
  • #2 Immutability
  • #3 Declarative configurations
  • #4 Error handling
  • #5 Scalability & Performance
  • #6 Readability and maintainability
  • #Last thoughts