Introducing Stream Processing in DataForge: Real-Time Data Integration and Enrichment

At DataForge, we continue to innovate by simplifying the complex and automating the tedious. We are excited to announce the release of Stream Processing, enabling users to combine both streaming and batch data seamlessly. This powerful feature provides real-time data enrichment and processing, allowing for the creation of dynamic, scalable data pipelines.

With these new capabilities, DataForge strengthens its alignment with the principles of Lambda Architecture, a robust framework for building data processing systems that can handle both real-time (speed layer) and historical (batch layer) data in parallel. By integrating streaming data with batch datasets, DataForge provides the foundation for comprehensive real-time analytics, significantly enhancing your ability to manage, analyze, and react to fast-moving data.

Stream Processing: Key Features

DataForge’s stream processing capabilities include:

  • Kafka and Streaming Delta Table Integration: Seamlessly ingest data from Kafka topics and streaming Delta tables into the DataForge platform.

  • Batch Data Enrichment: Enrich real-time data streams with historical batch data already residing in the DataForge Managed Lakehouse, allowing comprehensive, up-to-the-moment insights.

  • Downstream Data Processing: Write enriched data back to Kafka topics and Delta tables for real-time consumption by downstream systems.

Building a Lambda Architecture with DataForge

Lambda Architecture is based on the idea that real-time systems benefit from both a speed layer (handling real-time data) and a batch layer (processing large volumes of historical data). With DataForge’s new stream processing capabilities, you can easily implement both layers, allowing for fast responses to new data while simultaneously maintaining the accuracy and depth of batch processing.

In the video above, Co-founder and Lead Software Engineer Joe Swanson showcases how to build a pipeline that exemplifies this architecture. We pulled streaming order data from Confluent’s Apache Kafka Cloud, enriched it with batch data already stored in DataForge, and output the results back into Kafka, ensuring real-time availability for downstream consumers.

Key Steps in the Demo:

  1. Real-Time Data Ingestion: DataForge ingested streaming order data from a Kafka topic, with each message containing fields such as order ID, time, and item ID.

  2. Enriching with Batch Data: Using DataForge’s rules engine, the real-time stream was enriched with item details (item cost, type) from a preloaded batch dataset. The transformation logic was implemented as a simple, automated rule.

  3. Output to Kafka: The enriched order data, now containing additional information from the batch layer, was written back to Kafka for real-time processing by downstream systems.

Technical Advantages of a Lambda Architecture

By leveraging a Lambda Architecture, you unlock several benefits:

  • Real-Time and Batch Unification: By leveraging a single tool and technology stack for both real-time (speed layer) and historical (batch layer) data, it is easy to blend data from any integration point and ensure delivery meets the target application’s demands.

  • Full Lifecycle ML/AI Solution: Use batch data processing for feature engineering, model training, and evaluation. Then leverage stream processing combined with Databricks Model Serving to easily serve real-time AI solutions in production - all in one platform.

  • Simplified Operations: With all the technologies under one roof, it becomes easy to track, monitor, and maintain all data sets and pipelines across the enterprise, regardless of their real-time or batch nature.

Why DataForge's Stream Processing Matters

DataForge’s integration of stream processing provides a fully managed Lambda Architecture without the need for custom stage or processing design. By connecting real-time streams to historical batch data, DataForge enables organizations to act on up-to-date insights and expand to advanced, high performance analytics use cases.

Technical Highlights:

  • Accelerated Development: Build a streaming data solution integrated with your Lakehouse with easy to use configuration in minutes.

  • Zero-Effort Infrastructure: DataForge automates cluster management in Databricks, dynamically provisioning job clusters for stream processing and enrichment, minimizing the need for manual intervention.

  • Flexible Rules Engine: DataForge’s intuitive rules engine simplifies the complexity of data transformation, allowing for a common coding structure and patterns across both streaming and batch transformations.

Build Your Lambda Architecture with DataForge

The Stream Processing Integration makes it easier than ever to implement Lambda Architecture in your data pipelines, providing both real-time responsiveness and batch data accuracy. Whether you are working with IoT data, live transaction streams, or operational analytics, DataForge gives you the tools to handle data at any scale.

Get started with stream processing in DataForge today and transform your data infrastructure to meet the demands of modern real-time data processing.

For technical documentation, visit our developer portal, academy, or contact our team for a tailored demo.

Previous
Previous

Mastering Schema Evolution & Type Safety with DataForge

Next
Next

Sub-Sources: Simplifying Complex Data Structures with DataForge