Introducing Complex Types with Extended Schema Evolution in DataForge Cloud 8.0
DataForge Cloud 8.0 enabled full support of struct and array complex types. For those unfamiliar with Apache Spark's complex types: a struct is akin to a table row containing one or more fields (similar to table columns) with specific data types. Structs can be nested at multiple levels, with fields containing other structs. Arrays represent collections of elements of a specific type, such as integers or strings. Arrays of structs are particularly noteworthy because they essentially encapsulate an entire table within a single cell of a parent table.
Common Uses
These types are widely used in scenarios involving messaging, events, APIs, and other systems generating semi-structured (JSON) data. Despite their power, working with nested struct types has historically been complex, especially when managing evolving schemas that include multi-level nested arrays of structs commonly found in many messaging protocols and API responses.
Our Approach
At DataForge, our philosophy when managing schemas has always been KISS+YAGNI. DataForge Cloud automates most data type decisions - from detecting the data type during ingestion, capturing it for each columnar transformation, and mapping it to the corresponding output type. We’ve applied the same principles to complex types by:
Automatically capturing schemas of raw attributes during ingestion
Determining transformed column types during rule validation
Mapping output column types to appropriate complex types, like Snowflake’s VARIANT
This approach helps users navigate complex attribute hierarchies when defining transformations, as illustrated below:
Schema Evolution
Handling schema evolution for complex types adds another layer of difficulty. Spark complex types are rigid and not easily castable like simple types. Recognizing this, we extended our schema evolution model to accommodate complex types. Here are five new configuration options that control schema evolution for each DataForge Cloud source:
The Upcast option is particularly powerful for handling JSON data with sparse attributes, enabling seamless backfilling of incoming data into existing column types when schemas are compatible.
These options give DataForge Cloud users fine-grained control over schema changes. The default settings of "Add, Remove, Upcast" work for most scenarios, with additional options providing further control.
Future Enhancements
We plan to introduce the following features to enhance complex type management and schema evolution:
A sub-source feature for working with nested array<struct> attributes similar to regular DataForge sources.
A manual "Merge" option to consolidate multiple column clones with compatible schemas.
These enhancements will complete our vision for managing complex types and schema evolution in DataForge Cloud.