Introduction to the DataForge Declarative Transformation Framework
Building, maintaining, and extending data pipelines and transformation logic is the backbone of every data product. Without consolidated and clean datasets, the innovations produced by analytics, data science, AI Models, and LLMs never get off the starting line.
At DataForge, we have spent the past 6+ years researching the fastest, easiest, and most extensible ways to build and maintain transformation code. Our mission is to help teams focus on the logic and skip as many tedious or complex steps as possible - without sacrificing a developer’s ability to build custom logic.
The Problem
The majority of enterprise data engineering code deployed today is structured as a monolith, combining transformation logic with persisted database tables. Monolithic code lacks flexibility and often results in an unmanageable mess of spaghetti code as new source and target datasets are added over time. Software engineering has solved this problem by separating transformation logic and state. This separation combined with concepts such as functional programming, microservices, and other principals make code easier to build, extend, and manage.
There are two primary reasons data engineering has lagged behind software best practices and principals:
Data transformation code by nature changes the state of external objects (database tables).
This side-effecting forces code to follow imperative computer science patterns rather than declarative.
Many data transformation tools are scoped at the table/dataset level and can’t be split into modular and composable functional code blocks.
To illustrate these points, let’s use SQL: the most popular data transformation language today. SQL DML is an immensely powerful scripting language that can handle very complex transformations. SQL SELECT statements are declarative in isolation: every SQL platform includes a query engine that translates a SQL statement into an optimal physical data execution plan, allowing developers to focus on logic.
However SQL has some key traits that make it challenging to use for complex transformation development:
SQL was created as an interactive command tool. It does not support key modern programming language features such as object oriented or functional programming structures, libraries, or packages.
The minimal unit of transformation is a dataset (table), limiting the way transformation code can be structured, managed, and re-used.
Multi-statement queries with INSERT, DELETE, UPDATE, and other built-in statements are imperative and require external procedural scripting harnesses to ensure proper order of operations.
In addition to these key traits, SQL alone lacks structures or frameworks to prevent multiple statements or scripts from side-effecting the same object or table. This opens the door to imperative patterns and complex webs of state-dependent logic.
As illustrated above, every black-box of transformation logic not only operates at the table level, but multiple different transformation scripts can side-effect or modify a target table. Target tables can also be used as inputs to the same script, creating stateful “can only run once” or “must be run exactly twice” patterns in the processing chain.
Apache Spark took the first step towards declarative programming schemes by introducing the concept of immutable DataFrames (and later Datasets). Spark also introduced Scala and Python APIs that with careful use can be implemented with declarative patterns.
Tools like dbt then furthered this trend by replacing DataFrames with dbt models and simplifying the code from Python or Scala to something more SQL-like. The primary tradeoff is transformations can only be strictly defined at the table-level due to the SQL-like nature.
By forcing the target data of each spark function/dbt model to be immutable and protected from side-effects by other functions/models, Spark and dbt force developers to think more declaratively when developing data pipelines and transformation code, resulting in improved pipeline architectures and easier to manage data platforms.
So if Spark and dbt have made it possible to be more declarative, why are so many data platforms still built as monoliths and devolve into spaghetti code?
Two main reasons:
The table scope of immutability is too broad, resulting in code duplication.
Developers mix and match different transformation types freely and custom design the pipeline framework and procedure independent of logic contained in other data flows.
TL;DR Modern data transformation frameworks like Spark and dbt do a great job helping teams build transformation code quickly, but fall short when large codebases need to be extended, managed, or augmented due to their wide scope of immutability and lack of structure defining when in the processing chain to perform specific types of transformation.
Breaking up the monolith
In order to introduce additional declarative patterns and functional coding best practices, the scope of immutability must be broken down to a lower level than tables.
The reason the minimum scope of immutability for both DataFrames and dbt models is a table is because data engineers most commonly think about transformations using set-based operations using SQL. Written as a function, SQL statements look like the following:
Where T represents input/output tables of the function.
With Spark and dbt, multiple immutable tables are passed in to a function that produces a single immutable output table. While straightforward and useful for simple calculations and transformations, a single function’s definition can potentially contain thousands of lines of code (or more!), making it unwieldy to develop, manage, debug, and modify. This means seemingly simple tasks such as adding a single column to SQL pipeline can require months of development.
Spark DataFrames and dbt models give developers the option to break down these functions into smaller and more manageable chunks, however even when broken down to their most atomic levels, they still operate at the table-level for input and output arguments:
While a step in the right direction, by only allowing developers to write table-level functions as the minimum scope, it limits their ability to encapsulate and compose certain types of logical transforms. In addition to this minimum scoping problem, developers must design the structure and organization of these functions based on a difficult balance between ease of development, performance, and consistency with patterns platform-wide.
While skilled developers and architects can manage this balance during a green-field build of a few hundred pipelines, deciding when or how to restructure a solution due to new requirements that has thousands of pipelines becomes an impossible equation to solve effectively without a different approach.
Grouping transforms by dimension
In order to build up a lower-level framework to address these problems with table-level functions, the table logic must be decomposed and reorganized into smaller parts using a smaller scope of operation (immutability).
Taking inspiration from linear algebra, table-level functions can be broken down by what dimension the component functional logic operates on: cell, column, or row. To maximize composability and downstream benefits, it is also important to ensure that the logic is broken down in a way that results in pure functions. This means that no other function or external process can affect the same output data (cell, column, or row, respectively).
Cell-pure Functions:
Starting with the lowest-level and most versatile group, cell-pure functions take a set of cells as arguments and generate a single value.
These can be most easily understood as “excel-like-functions” and while they are the most basic and atomic version of a pure function, they are not traditionally useful in the context of data engineering, as they do not follow set-based logic or relational algebra algorithms.
Column-pure and Row-pure Functions:
In order to make cell-pure functions more useful and applicable to data transformations, both the inputs and output must be contextualized within the structure of a table.
To do so, and continuing on our application of linear algebra to data engineering, one of the two dimensions (row or column) of the table must be held constant while the other is allowed to change. Applying cell-pure functions with this constraint creates a new type of pure function in the dimension of change.
Thus:
Column-pure functions are transformations where the number of rows do not change between input and output
Row-pure functions are transformations where the number of columns do not change between input and output
Column-pure Functions:
- Are always in the context of a single immutable starting table (T1)
- T1 is the table that is constrained to one dimension via no row changes
- Are made up of a cell-pure function(s) evaluated for every row in T1
- The output is stored as new immutable column(s) (X1..n) appended to T1, creating a new immutable table R
The usefulness of these column-pure functions starts to become apparent when the independence of the inputs and output are analyzed:
- Columns A, B, and C are all independent of each other
- X1 is dependent on A and B, but not C
With column-pure functions, transformation dependencies can be broken down to the columnar level rather than the table-level. Additionally, the dependencies are implicit and transparent from the functional definition of the transformation, rather “black-box” at the table-level.
Name | Description | SQL |
---|---|---|
Alias | Rename a column | SELECT comment AS invoice_comments FROM sales.invoice |
Scalar Operation | Perform calculation on a static variable | SELECT revenue * 100 AS revenue_100 FROM sales.invoice |
Multiple Column Operation | Perform calculation using a selection of columns from a single row |
SELECT revenue - tax - refunds AS “Gross Margin” FROM sales.invoice |
Aggregate | Use window function syntax to calculate aggregate without changing grain |
SELECT SUM(revenue) OVER (PARTITION BY region_id) AS revenue_total_by region FROM sales.invoice |
Ranking | Use window function syntax to do cross-row calculations, comparing values and ranking the results |
SELECT RANK(revenue) OVER (PARTITION BY region_id ORDER BY revenue DESC) AS revenue_rank_by region FROM sales.invoice |
Another powerful feature of column-pure functions is their ability to be combined together without the need for intermediary storage or orchestration.
Extending our original example, above, the result table R can now be treated as the starting Table T1 for use by a new column-pure function:
In this second example:
- A, B, and C are independent columns
- X1 is dependent on A and B, but not C
- X2 is dependent on B and X1
-
- Transitive: X2 is dependent on A and B, but not C
These combined examples highlight how developers can now think in columns rather than tables, and by maintaining pure functional transforms, key benefits emerge:
Logic can be written and managed at the column-level, while processing can be managed at the column, multi-column, or full table level for performance.
New logic can be branched from any intermediary function or column, preventing code duplication and promoting DRY principals.
New column-pure logic never impacts existing target data, eliminating the need for impact analysis or regression testing.
If an existing function must be modified, impact analysis is simple to calculate due to implicit functional chains.
Here is another alternative visual aid to help illustrate how new logic can be easily added to the existing codebase without needing to modify existing functions, thus eliminating regression testing:
Complex Column-pure Functions (JOINs)
Arguably the most powerful and important operator in SQL, JOIN statements are complex and over-loaded with many different types of functional transformations. In order to support JOIN operations within the context of column-pure functions the different JOIN types must be segregated (LEFT, INNER, etc.) and the cardinality of the JOINs must be considered. This is because a JOIN statement can potentially change the number of rows - and thus immediately disqualify it from being column-pure.
With the definition of column-pure including “Number of output rows is determined by number of input rows (no grain change)” the starting table of the JOIN and direction is critically important in adhering to these constraints. The easiest way to approach this problem is to assume all JOINs are LEFT JOINs and the number of rows of the function output is defined by the first table argument (the table furthest on the “left”).
Now that multiple tables are involved, a sample data model is helpful for illustration:
Complex column-pure functions that include a join must:
Start with a set grain of data (the “left”-most table)
Define JOIN condition(s) as a boolean-typed column-pure function
Optionally apply a pre-calculation to the “right-most” table in the JOIN chain to ensure no new rows are created (e.g. aggregation)
Examples:
Name | Description | Right Table Row Function | SQL |
---|---|---|---|
Lookup via M->1 JOIN | Retrieve a column from another table. No additional calculations needed due to JOIN cardinality of M->1 |
None | SELECT c.name AS customer_name FROM sales.invoice i LEFT JOIN sales.customer c ON i.customer_id = c.id |
Rollup via 1->M JOIN | Perform aggregation on right table to rollup the data to match the grain of the left most table. Then do a 1-1 join to avoid changing the number of rows and retain column-purity. |
Aggregate | WITH invoice_rollup AS (SELECT customer_id, sum(revenue) revenue FROM sales.invoice GROUP BY customer_id) SELECT ir.revenue FROM sales.customer c LEFT JOIN invoice_rollup ir ON c.id = ir.customer_id |
These two examples show how even complex calculations and SQL statements can adhere to the restrictions of column-pure functions.
In the Rollup example, a WITH statement (Common Table Expression) is written first to calculate a rollup of the invoice table and generate a set at the same grain as the customer table. Now that the grains match, a JOIN can be executed without changing the number of rows in sales.customer (the left-most table) as part of the operation.
Row-pure Functions:
While column-pure functions can cover a large variety of different transformations, they only handle one dimension of the table.
Row-pure functions output a table with different number of rows:
Output is a table with more or less rows (different grain)
Output columns (schema) are identical to input columns
No new columns are added or removed
Can be first-order (Union) or higher-order function (Filter)
Examples:
Name | Description | SQL | Scope |
---|---|---|---|
Filter | Remove rows | SELECT * FROM sales.invoice WHERE revenue > 100 | All columns in sales.invoice |
Union | Add rows from another table | SELECT * FROM sales.invoice UNION ALL SELECT * FROM sales.invoice_history |
All columns in sales.invoice (same columns as) sales.invoice_history |
Unlike column-pure functions, where the examples discussed above were just a small sample of different possible transformations, row-pure functions are much more limited in diversity, but are equally powerful constructs.
Because they require zero change in the column set (schema), row-pure functions are by definition limited to simply adding or removing rows.
Any SQL transformation that changes the number of rows in a table or result can be decomposed into one of the two examples of row-pure functions in the table above. For example
INSERT is different syntax for UNION
DELETE is different syntax for FILTER
DISTINCT is a column-pure ranking window function followed by a FILTER
UPDATE is a DELETE followed by an INSERT
CROSS JOIN is UNION repeated in a loop
GROUP BY is DISTINCT combined with column-pure aggregate window functions
HAVING is a FILTER
Row-Pure Function Summary:
Row-pure functions operate at the table-level, but specifically avoid schema changes to improve composability with other row-pure functions.
Operations that change the number of rows are often blended with column-creating functions (such as Group By), and must be decomposed into UNION and FILTER statements to identify the row-pure component(s) of the higher-order function.
Bring it all together:
With these two strictly defined functional types, all SQL transformation logic can be broken down into either Column-Pure and Row-Pure functions:
With the pure functions now defined and table-level transformations decomposed, the original problem statements can now be addressed:
Data engineering code is monolithic and hard to extend or modify once built due to its freeform procedural nature.
SQL is the gold standard for transformation code, but has some major limitations that make it difficult to follow software engineering best practices and principles.
Key findings from breaking down the logic:
Column-pure functions are easy to combine with each other and new logic can be added without the need for downstream regression testing.
Row-pure functions are easily composed with each other and should be separated from column logic to avoid table-level dependencies.
Unlike column-pure functions, if new row-pure functions are added, then it creates an entirely new logical table vs just adding a column to an existing table and could affect downstream processes.
Due to their minimized dependencies, both Column and Row-pure functions can be easily reused and integrated into various data pipelines, akin to the mixin pattern.
Thus, to make data engineering code as declarative as possible and avoid the monolithic patterns and architectures, the process is as follows:
Analyze transformation logic and decompose the relational algebra into column-pure and row-pure functions.
Group as many column-pure functions and row-pure functions together into as few processing stages (each stage only has one type) as possible.
The large majority of column-pure functions should be grouped together to maximize advantages and extensibility.
As new requirements or feature additions are made, take advantage of the functional nature of the pure processing stages and add more functions to an existing stage if columnar, or an additional branch stage at the very end for row-pure logic.
Take all logic and decompose it into column-pure and row-pure functions, eliminating any higher order combination row and column combined functions such as GROUP BY.
Determine the fewest number of stages possible to process the datasets while grouping as many column-pure functions together as possible to maximize framework efficiency.
New column-pure logic can be added as a new function in the column-pure stage, with no downstream regression impact as none of the existing target columns are affected.
For new row-pure functions, it naturally creates a new table definition, so a dedicated parallel processing stage is often the most efficient design and avoids any regression impact to existing processes.
This simplified example shows how developers can use these concepts and frameworks to develop a better pipeline architecture that is easier to manage and change over time.
Rather than a web of spaghetti code and monolithic table-level custom pipeline flows, the majority of logic can be broken down into column-level expressions and standardized stages.
This allows teams to scale-up their solution to thousands of pipelines without compromise - meaning the 1,000’th new column in a target dataset is just as quick to build and deploy as the 1st.
At DataForge, we developed an enterprise data management platform around this framework for pipeline development.
To learn more about the extended framework built around column-pure and row-pure functions, check our Part 2 of the blog series.
Our open source product DataForge Core is a free way for engineers to explore pure functional development and start to think about data transformations declaratively rather than procedurally.
DataForge Cloud extends DataForge Core with additional framework components, a web-based development environment, automated infrastructure, and many more powerful features that allow developers to build, run, and manage functional data transformation code faster and easier than any alternative.
In the next blog in this technical deep-dive series, we will detail how DataForge applied these base concepts and extended the framework to handle complex incremental loads, circular functional dependencies, and other advanced patterns.