Data Transformation Tools
SQL Transformation: Best Practices & Modern Techniques
Multi-chapter guide | Chapter 3
Data engineers have used SQL code for decades to convert, clean, combine, and order data. However, current methods of continuous software delivery with massive data volumes introduce new challenges in development, regression testing, and troubleshooting when relying on conventional SQL statements for data transformation.
Despite its power and popularity, SQL has limitations stemming from its origins as an interactive command tool with a table-centric approach to data manipulation. These limitations, including the lack of object-oriented features and ordered execution without the help of external scripts, can make it challenging to use for complex data transformations.
This article explores popular SQL transformation best practices that help scale SQL in larger application environments. It also explains the advancements that functional programming has brought to SQL in recent years and introduces the leading open-source project that offers such transformation technology.
Summary of key SQL transformation concepts
Concept | Description |
---|---|
Modularity | Break down tasks into smaller, reusable functions that promote code reuse and maintainability. |
Immutability | Use transformations that produce new datasets without altering the original data, ensuring consistent outputs. |
Declarative configurations | Use declarative constructs that emphasize “what” over “how” in the transformation. |
Error handling | Isolate errors to specific steps, provide detailed messages, and prevent cascading failures. |
Scalability & Performance | Adapt to growing data needs without significant performance degradation, supporting various scaling strategies and enhancing performance. |
Readability and maintainability | Enhance code clarity, reduce technical debt, and improve long-term sustainability. |
#1 Modularity
Common steps like cleaning data, standardizing formats, and performing aggregations introduce redundancy if written multiple times throughout a project. They make it harder to manage code updates.
Instead, you can create modular scripts for smaller functions and promote their reuse across the project. Isolated functions are easier to understand, test, and modify without affecting the entire codebase.
Stored procedures
Stored procedures promote modularization and prevent repetitive code in SQL. They allow you to encapsulate frequently used queries into a single callable routine. Below is an example of a SQL stored procedure in SQL to truncate product names and round off the price.
Pre-compiling stored procedures can reduce code duplication, centralize the logic, and enhance performance. However, as the number of transformations and complexity increases, they can become monolithic and hard to maintain. Each stored procedure becomes rigidly tied to the table that it references, which makes it difficult to reuse in other pipelines, and ultimately compromises modularity.
In a functional programming approach, you can segregate SQL transformations as reusable “rules” closely tied to your table definition. You can easily modify and scale these rules without altering the table's underlying code.
If you are new to applying functional programming paradigms to SQL, read this article to understand how the open-source project DataForge Core decomposes table logic into smaller scopes of data operations by defining column-pure and row-pure functions.
For example, DataForge Core lets you write YAML code to initiate the Products database and list the associated transformation procedures as rules.
source_name: "Products"
source_query: "SELECT * FROM Products"
raw_attributes:
- product_id int
- product_name string
- average_cost decimal
- average_sell_price decimal
rules:
- name: "Round Average Cost"
expression: "ROUND([This].average_cost, 2)"
- name: "Truncate Product Name"
expression: “CASE
WHEN LENGTH([This].product_name) > 30
THEN CONCAT(LEFT([This].product_name, 30), '...')
ELSE [This].product_name
END”
As data transformations grow, the functional approach allows to isolate transformation logic at the grain of individual rule (attribute / column). This promotes the re-use of rules, addressing stored procedures’ short-comings. The expressions don’t directly refer to the underlying table. Also, structuring commonly used transformations as above makes them more readable to a new engineer.
#2 Immutability
Once created, immutable or read-only data is unmodifiable. SQL transformations act on the immutable data and produce new data sets to maintain data integrity.
Maintaining immutability creates a data audit trail that preserves the evolution of data throughout a transformation. This allows for accountability, easier troubleshooting, and error reproduction. If a transformation causes failures in a production environment, you can quickly access and revert to the previous data state to avoid downtime.
Immutable data also produces the same output every time it is transformed. Your data pipelines remain reliable, and you can trust that outputs remain unchanged when multiple transformations are applied in a sequence or when debugging.
Views
SQL views are often used to maintain immutability. Consider a Sales table that captures the details of every sale a particular shop makes.
To store the individual revenue from each sale, you can create a view SalesDetails as shown below.
Assuming a separate Products table, the view joins both original data sets and creates a view containing all the original columns and associated revenue (Price * Quantity). You can then refer to this view like any other table.
By doing the above, you can maintain the integrity of your original Sales table while achieving the required transformation result. For improved performance, you can also use a materialized view. The key difference between views and materialized views are that the latter is physically stored on disk. Since non-materialized views are virtual, retrieving data from them is reliant on querying the underlying tables, making them slower. Materialized views are able to pre-compute queries since they exist physically, but on the flip side, they require more storage. Materialized views are preferred when dealing with large data sets, and when retrieving real time data is not a priority. You can define a materialized view by simply including the ‘MATERIALIZED’ keyword as below:
Much like, stored procedures, views and materialized views also tend to be too rigidly tied to the associated tables, reducing the ability to reuse or modularise them.
A functional approach avoids separate views and lets you define new attributes without impacting data, while improving modularity. For example, consider the functional code block from DataForge.
source_name: "Sales"
raw_attributes:
- TransactionID int
- CustomerID int
- ProductID int
- Quantity int
rules:
- attribute_name: "Revenue"
expression: "[This].Quantity * [Products].Price"
#Relations
- name: "[Sales]- ProductID -[Products]"
expression: "[Sales].ProductID = [Products].ProductID"
cardinality: "M-1"
The new attribute, Revenue, transforms data in the Sales and Products tables to generate a calculated value. By defining this new attribute within the rules section, not only is the original data remains untouched, and you maintain the immutability of the Sales table, but the scope of immutability is at the column level rather than the entire view/table. Relations which define the connections between sources are typically defined separately, but we have included it below the table definition for simplicity.
The functional approach makes complicated transformations well-documented and easily accessible, all while keeping immutability in mind. You don’t need to learn a new way of writing code since DataForge has a cloud offering that allows you to utilize their web application to do most of the transformations. Find out more examples here.
Overview of DataForge Cloud for data management (Source)
#3 Declarative configurations
Clarity of objectives becomes increasingly important as data complexity grows. Data engineers are more concerned with “what” a transformation does than “how” it does so. Procedural structures such as loops and conditionals increase verbosity and decrease readability. In comparison, a declarative approach conveys the required outcome and lets the database management system handle the underlying tasks. It promotes consistency and standardization across transformations.
SQL provides several declarative constructs like SELECT, JOIN, and INSERT to approach our transformations. Using a simple SQL transformation as an example:
CREATE VIEW employee_details AS
WITH full_names AS (
SELECT CONCAT(first_name, ' ', last_name) AS full_name, employee_id
FROM employees
),
ages AS (
SELECT employee_id, DATEDIFF(CURRENT_DATE(), date_of_birth) / 365.25 AS age
FROM employees
)
SELECT fn.full_name, a.age
FROM full_names fn
JOIN ages a ON fn.employee_id = a.employee_id;
It is evident from the code that you want to retrieve every employee's full name and age from their data. On the other hand, if you were to use a procedural approach using Python, it would look like this:
import sqlite3
conn = sqlite3.connect('employees.db')
cursor = conn.cursor()
cursor.execute("SELECT first_name, last_name, date_of_birth FROM employees")
results = []
for first_name, last_name, date_of_birth in cursor:
full_name = f"{first_name} {last_name}"
age = (datetime.date.today() - date_of_birth).days / 365.25
results.append((full_name, age))
for full_name, age in results:
print(f"Name: {full_name}, Age: {age:.2f}")
cursor.close()
conn.close()
Code lines increase due to the result array, two for-loops, and more logic within the loops. As the complexity of queries increases, logic also becomes more verbose. In contrast, the declarative approach of SQL allows the engineer to prioritize the transformation's result while the SQL engine handles the execution. However, the downside is that the process can’t be restarted at an intermediary step, creating challenges in the areas of backfilling and regression testing. Also, it is important to note that while individual SQL statements are declarative, multi-statement SQL scripts are often not. This is because multi-statement scripts dictate the exact order of data transformations and persistence.
Like SQL, a functional approach is also effective in avoiding code complexity. For example, in DataForge:
After defining the employee table, you can express the associated functions of full_name and age as rules to reuse whenever needed. Ultimately, you get the benefits of being declarative while abstracting the data processing and separating it from the application code. This separation also encourages modular testing and a more full-proof transformation. DataForge adds storage and orchestration without breaking the declarative paradigm by using an event-based workflow engine rather than static DAGs (Directed Acyclic Graph) or procedural definitions.
#4 Error handling
Working with long, complex transformations may produce errors that can butterfly-effect downstream processes and potentially corrupt data. Errors could arise from erroneous data, constraint violations, or failing DBMS connections. Error handling prevents the flow of transformation from halting completely. A strategic way to deal with errors in SQL is using automatic retries:
Automatic Retries
Below is an example of automatic retries in an SQL server while trying to connect to a Cloud Service:
DECLARE @retryCount INT = 0;
DECLARE @maxRetries INT = 3;
DECLARE @success BIT = 0;
WHILE @retryCount < @maxRetries AND @success = 0
BEGIN
BEGIN TRY
EXECUTE [CloudService].[InsertStudent] @ID = 1, @Name = 'Joe';
SET @success = 1;
END TRY
BEGIN CATCH
SET @retryCount = @retryCount + 1;
PRINT 'Temporary error occurred. Retrying...';
WAITFOR DELAY '00:00:05';
END CATCH
END
In terms of integration, many issues are temporary and mainly caused by outages, network problems, or race conditions. Differentiating between transient errors and systemic or logical issues provides major operational benefits. Automatically retrying these connection issues reduces the workload on data teams by allowing them to focus their resources on issues that require closer attention.
A key point to note concerning SQL’s handling of errors is that it occurs at runtime. This could lead to various issues resulting in more downtime and operational costs. Runtime handling also requires engineers to be very careful while scripting queries, which can be time-consuming. SQL error handling is also tightly coupled with database transactions. When an error occurs, the database platform would typically rollback the entire transaction, so data engineers would need to carefully monitor the scope of transaction execution and create retry code for multiple specific entry points for each potential failure. Dataforge, on the contrary, handles errors at compile time. It validates SQL expressions and data types when compiling a project to SQL. This frees engineers from worrying about error-handling techniques like transactions and try-catch blocks, as they are built into the transactions and write operations by the framework. On a large scale, such automated error handling could save data teams a great deal of time and effort.
#5 Scalability & Performance
Scalability ensures the database is manageable as transformations grow in size and complexity. Growth is typically measured in terms of the data volume involved, the number of concurrent users, and business requirements. A well-thought-out scalable solution avoids a decrease in performance when the project inevitably grows.
SQL supports both vertical and horizontal scaling. You can improve CPU power, RAM, and storage hardware or distribute the transformation workloads across nodes.
Sharding distributes data across multiple databases, called shards, to manage increasing data volumes. For instance, in a user schema with fields like userID, name, email, and userRegion, you can shard the data based on the user’s region, storing North American users in UsersDB1, Asian users in UsersDB2, and so forth. This method effectively reduces the load on any single database and allows for targeted queries based on regions.
A diagram describing the process of sharding (Source: DataForge)
Cross-shard queries
When data is spread across multiple shards, you may need to perform cross-shard queries, such as retrieving all users with a Gmail account across all regions. A sharding router coordinates and optimizes queries across the relevant shards, ensuring that only the necessary shards are queried.
Managing transactions across shards can be complex, as you must ensure data consistency and integrity across multiple databases. Eventual consistency models accept that data across shards may not be immediately consistent but will eventually reach a consistent state. Adopting eventual consistency provides a more scalable solution. Read more about how DataForge can help in containing data engineering costs.
A diagram describing the process of cross-sharding (Source: DataForge)
Certain queries require the redistribution of data among partitions to allow for different grouping. This is known as “shuffling” and is an expensive operation due to copying and rewriting data. For example, an operation like map()does not require data to be shuffled, but sortByKey(), reduceByKey(), and groupByKey() require data to be shuffled due to the creation of a new RDD (Resilient Distributed Dataset). Modern data formats like Parquet, Delta, and Iceberg use the underlying query engine (eg. Spark) to automatically “shard” data using one or more columns. These pre-defined partitions significantly reduce the overhead that shuffling creates. SQL server databases, on the other hand, require careful configuration to enable partitioning, especially when dealing with large datasets. Avoiding shuffling is a manual and time-consuming process. Therefore, modern platforms like Spark are highly recommended when scaling out databases, due to their efficiency advantages.
Parallel Processing
Performance is key in data transformation, especially when dealing with large datasets. Parallel processing distributes transformation tasks across multiple processors or threads, drastically improving efficiency. For instance, a SELECT operation on a table with 1 billion rows takes 300 seconds on a single CPU core. However, partitioning the data into eight segments and using eight CPU cores reduces execution time to around 40 seconds—a 7.5x improvement.
Consider the Sales table with tens of thousands of rows. To optimize processing, you can partition the data into separate tables based on ProductID, as shown:
To find the total quantity of every product sold, you run the relevant query on each partitioned table, as shown below. The UNION ALL keyword combines the results in the end.
As you can see, SQL requires explicitly defining each partition and handling their aggregation through separate queries, leading to more maintenance.
The rule-based approach in functional programming supports partitioning more efficiently.
# Source File
source_name: "sales"
source_query: "SELECT * FROM sales"
raw_attributes:
- transactionId int
- productid int
- quantity decimal
# Output File
output_name: "sales_aggregate"
columns:
- productid int
- total_sales decimal(18,2)
channels:
- source_name: "sales"
operation_type: "Aggregate"
mappings:
- SUM(quantity) partition_1_total
- productid productid
As shown above, you can use DataForge to define modular transformations without manually managing underlying partitions or creating complex SQL queries. The source file contains a channel that allows us to aggregate the sales quantity of every product in the associated ‘sales’ table, and using DataForge Cloud you can define a parameter that partitions the dataset (in this case: MOD([THIS].product_id, 4), accordingly. The ability to configure partitioning this way implements parallelism without having to build the solution from scratch and write lengthy scripts.
#6 Readability and maintainability
It is important that code related to updating data, some of which might be sensitive, is very clear in its aim. Clarity smoothens collaboration within teams, allows easy integration of new developers, and reduces the chance of mishandling or deleting important data, especially months or years after the code has been written. Some important principles of readable code within SQL transformations are provided below.
Comments
A logical code flow with frequent, concise comments increases readability. The below transformation is easily readable and provides extra information through comments.
Naming conventions
Use meaningful variable and self-explanatory function names. For example, when creating a view consisting of the tables users and purchases, it is more intuitive to call the view user_purchases rather than u_p.
Advanced constructs
Using SQL-provided features such as stored procedures, CTEs (Common table expressions), and views can make transformations more concise and scalable and prevent an engineer from being overwhelmed trying to understand the script.
Functional programming approach
SQL code can be verbose and complex when dealing with large data transformations. A combination of SQL and functional programming can neatly organize transformation components. As seen below, using the DataForge framework, SQL code converted into its declarative counterpart in .yaml code looks more intuitive. The data source, transformation rules, and relationships are organized in a way that’s easy to follow, even if the number of transformation rules grows to hundreds.
source_name: orders
source_description: "Table containing order data."
target_table_name: total_sales
raw_attributes:
- order_id
- quantity
- order_date
rules:
- name: revenue
expression: "[This].quantity * [products].price"
relations:
- name: "Order-Product Relationship"
expression: "[orders].product_id = [products].product_id"
cardinality: "M-1"
output_name: rev_by_date
columns:
- order_date date
- rev_by_date decimal(18,2)
channels:
source_name: orders
operation_type: "Aggregate"
mappings:
- order_date order_date
- SUM(revenue) rev_by_date
Last thoughts
This article reviewed a few pitfalls and best practices of SQL in data transformation. As data volumes and complexity increase, SQL presents challenges stemming from its monolithic nature. Data engineers can enhance the efficiency and effectiveness of SQL transformations by focusing on modularity and reusability. Functional programming models like DataForge retain immutability, error handling, declarative configurations, and other existing SQL capabilities. They also better support complex data projects' error handling, scalability, and readability.