Mastering Schema Evolution & Type Safety with DataForge

Introduction

According to the recent poll posted on Reddit data engineering community, the most common reason for production data pipelines failures is source data schema changes. That includes adding or removing attributes and their data type changes.

SQL is the leading language for data transformation in modern data engineering and is widely supported by most contemporary data tools. However, as discussed in a previous blog, despite its power, SQL has some significant limitations. Today, we’ll take a closer look at two often-overlooked but crucial issues: Type Safety and Schema Evolution. This blog will explore these challenges, and in the linked video, I’ll demonstrate how DataForge addresses and solves them.

Type Safety

Type Safety is an important part of Compile-time validation. Together they minimize the risk of runtime crashes and exceptions.

Almost all modern programming languages support type safety and strong typing. While SQL is a strongly-typed language, it typically does not enforce type safety due to several reasons:

  • SQL scripts often don’t have a distinct compilation phase. Compilation occurs dynamically immediately prior to execution of the script. Any syntax or type errors detected will fail at execution time.

  • Variable and column data types used by SQL scripts are highly dependent on the schema of the tables, and it’s often impossible to perform type checking without having a schema of each table referenced by the script.

    • Example: it’s impossible to validate SELECT * query without having a schema of the table.

  • The SQL expression type is also dependent on the data contained in the table. Consider this example in Spark SQL:

price * quantity,  where price is DECIMAL(4,2) and quantity is INT

Depending on the value contained in the quantity column, the expression result type may vary between 

DECIMAL(6,2) and DECIMAL(15,2) when executing on different batches of data

This leads to the data type of the result of the SQL query and transformation changing unpredictably. The typical way to handle this is to apply a manual CAST function to the result of the transformation. Unfortunately, this is not only cumbersome, but also does not fully address the problem because the CAST data type may not be compatible with the latest batch of data’s result type. Even worse, in Spark SQL specifically, a failed CAST silently creates a NULL value for that row rather than failing the query by default.

These unsafe type changes lead to cascading failures when the changed attribute is referenced in downstream transformations. These issues are notoriously difficult to debug and fix because changes may flow through long chains of transform and store operations. 

Schema Evolution

Modern data platforms live in a permanent state of change: new datasets being constantly added, existing dataset schemas and business logic change frequently. These changes require a data transformation pipeline to manage Schema Evolution of each dataset it stores, otherwise the pipeline will fail when writing to intermediate or target tables. 

Here are common column or data type changes that need to be addressed by a schema evolution process:

  • New column added to the dataset

  • Column removed from the dataset

  • Data type of the existing column has changed

These changes may be caused by schema changes in the most recent batch of data, or business logic changes requiring data type change of one of the existing rules.

Note that the data type change can also include changes to the schema of a nested, complex data type column., e.g. STRUCT.  It is for this reason that Type Safety and Schema Evolution concepts are closely linked together and need to be considered as a whole. In my experience, these 2 linked issues represent one of the most tedious aspects of maintaining the data pipelines. The problem is also exacerbated by multiple downstream schemas that need to be changed, that include multiple layers of data lake medallion architecture, intermediate persisted stages, and all impacted downstream datasets. 

After decades of frustration over this common challenge, we asked ourselves at DataForge: how could we simplify and automate the schema evolution process? Are there a set of common patterns we can pre-build and re-use depending on the type of evolution?

The DataForge Approach

First, we need to implement compile/design-time Type Safety for each columnar transformation (Rule). This requires, at design time:

  1. Validating that all references used by expressions are valid: Hub (data lake table) names, Raw Attribute (source column) names, and Rule (calculated column) names.

  2. Checking the DataForge columnar expression syntax by evaluating the expression on the target SQL platform (Databricks SQL in our case)

  3. Determine the transformation result data type, and save it to a metadata repository separate from the lake/database schema.

  4. Recursively querying all downstream attributes and transformations referencing the current attribute in this metadata repository and performing the same checks

Only when the above validations have succeeded, we allow the developer to save the Rule and update the logic.

This process ensures that each Rule data type will never change inadvertently, and any allowed changes are persisted downstream.

Next, we looked at the different types of changes that can occur and categorized them, thus creating dataset-specific options to handle the schema evolution triggered by schema changes in the ingested data batch.

Each configuration category can be either:

  • Enabled: Performs specific action

  • Disabled: Stops ingestion and generates error. This setting is typically used in highly controlled environments. It will require human approval before ingesting any new attribute.

Option Action
Add Automatically adds the new column to the schema of the dataset, across all layers of the data lake. Missing values are backfilled with NULLs.
Remove Fills columns that don’t exist in the batch with NULL values.
Upcast When the data types of the attribute in the batch and Hub differ, and the incoming batch data type is a subtype of Hub data type (can be cast to without loss of data), the batch attribute is automatically upcast to the Hub dataset data type.

Example
Hub type: Long
Batch type: Int
Int upcast and stored as long
Extend When data types of the Raw Attribute in the batch and Hub differ, and the Hub data type is a subtype of the batch data type, the dataset attribute is altered to batch data type, automatically upcasting all existing values. All downstream rules and mappings are validated and altered as well.

Example
Hub type: int
Batch type: long
int dataset attribute is ALTERed to long
Clone When data types of the attribute in the batch and data lake are incompatible (cannot be converted without data loss), a clone of the attribute will be created in the dataset with a new alias. This enables data engineers to handle new attribute type via conditional rules (e.g. CASE or COALESCE)

Upcast and Extend for Complex Types

For complex/nested data types (STRUCT, ARRAY) the Upcast/Extend option applies the schema evolution parameters recursively to each nested attribute.

To illustrate this, let’s use column product. In the latest batch of data, its type is different from the type defined in the dataset: it lacks the color attribute. After ingestion, the system applies Upcast, which adds color with NULL value and stores it in the existing table.

Data Type Value
Hub STRUCT<
  id INT,
  product_name STRING,
  price DECIMAL(9,2)
>
{
  id: 1,
  product_name: ‘Bicycle’,
  price: 1250.00
}
Batch STRUCT<
  id INT,
  product_name STRING,
  price DECIMAL(9,2),
  color STRING
>
{
  id: 1,
  product_name: ‘Bicycle’,
  price: 1250.00,
  color: null
}

With DataForge, we have simplified schema evolution strategies into easy to understand and configurable settings. This saves you from having to custom build this logic for every data set, and protects you from expensive failures caused by changes data structures. You can select exactly what types of changes are allowed on a per-dataset basis while also ensuring that downstream logic will continue to operate correctly.

Conclusion

DataForge provides customizable schema evolution functionality combined with robust, compile-time type safety.

This results in the following advantages:

  • Massively improves pipeline reliability, reducing time developers spend debugging code

  • Speeds up development by automating what would otherwise require tedious and time consuming custom code

  • Enforces standard architectural patterns for the entire data lakehouse

Take advantage of this innovative approach and accelerate your development by starting a 30 day trial of DataForge Cloud today!

Previous
Previous

Data Transformation at Scale: Rule Templates & Cloning

Next
Next

Introducing Stream Processing in DataForge: Real-Time Data Integration and Enrichment