Engineering Choices and Stage Design with Traditional ETL
Constructing a dependable BI data model through traditional ETL processes requires careful attention to each phase of data transformation. Engineering teams face several critical design choices that impact the model’s stability, performance, and scalability. Recently, Joe Swanson, co-founder and lead developer at DataForge, walked through a data modeling example using the Coalesce platform. His demonstration highlighted essential design choices involved in building a robust data pipeline. Here, we analyze these engineering decisions and their implications.
1. Specifying Data Types from Untyped Files
One of the initial design choices in building a BI data model is whether to define data types at the earliest stages. Joe advocates for establishing "type stages" for raw data tables, where data types are validated and set at the outset. This early assignment of types reduces downstream dependencies on casting, minimizes potential error handling, and simplifies subsequent transformations. Without this initial type-setting step, downstream processes face an increased risk of casting errors, adding complexity to data validation. While introducing type stages creates an additional processing layer, it enhances pipeline durability and ensures data integrity for later stages.
2. Grouping Data to Achieve Consistent Granularity
Engineering teams frequently encounter source tables with different levels of granularity, requiring careful consideration of data aggregation methods. For example, customer data might be available at a delivery-detail level instead of the invoice-detail level, necessitating aggregation for consistent granularity. Joe demonstrated the use of a "group stage" to align customer and delivery data, which facilitates accurate joins and preserves data fidelity. Here, engineers face a trade-off: creating a separate group stage enhances modularity and reuse but can add processing steps. Consolidating aggregation logic into fewer steps can reduce complexity but at the cost of flexibility.
3. Managing Multi-Item Orders Through Unpivoting
In many data models, handling complex record structures, such as rows containing multiple items, can complicate aggregation and reporting. Joe showcased an unpivoting technique, where each item in a multi-item order is broken into distinct rows, creating a consistent granularity that simplifies calculations. This choice allows BI tools to retrieve accurate insights without complex transformations. However, unpivoting introduces additional stages, impacting processing resources. Engineers must evaluate the balance between increased granularity and processing efficiency, as unpivoted structures may better support analysis but add to the pipeline’s complexity.
4. Choosing When to Materialize Stages for Data Stability
Materializing intermediate stages—creating physical tables to store transformations—adds stability to the data pipeline, allowing validated data to serve as a secure source for downstream processes. Joe highlighted the choice to materialize type and aggregation stages to provide consistent access to cleaned data, enhancing the robustness of future data transformations. However, fully materializing each stage increases storage costs and can add maintenance overhead. Virtual stages, on the other hand, keep data transient, passing it through without permanent storage. While this approach saves on storage, it demands recalculations in high-use scenarios. Engineers must carefully consider access patterns and computational needs to make the most effective choice.
5. Balancing Modularity and Performance
One overarching engineering design choice in ETL pipeline development is balancing modularity against performance. Joe’s example emphasized how separating tasks into discrete stages (e.g., typing, grouping, unpivoting) improves modularity and debugging. However, each additional stage introduces processing demands and can increase the pipeline’s maintenance requirements. Consolidating logic by using common table expressions (CTEs) can reduce the number of stages, streamlining operations but potentially complicating troubleshooting and limiting flexibility. Engineering teams must evaluate these trade-offs in light of data complexity, operational load, and resource constraints.
Summary
These engineering design choices are fundamental in crafting BI data pipelines that balance simplicity, performance, and adaptability. By critically assessing each choice based on organizational needs and technical requirements, engineering teams can build scalable, resilient data models that consistently deliver accurate insights and support evolving analytics demands.
Part 2: Eliminating Stage Design with DataForge
In part 2 of this blog series, Joe will explore how these many decision points and considerations are eliminated or automated by the DataForge framework, enabling developers to spend more time building the core logic and less time worrying about code re-use, performance, or architecture considerations.
Avoid these design decisions and common pitfalls by signing up for a 30-day trial of DataForge to explore the future of data engineering.
Star and follow our open-source DataForge library on Github to stay updated on new features and enhancements to DataForge Core.