DataForge Releases Version 7.1.0
It is now easier than ever to transform, process, and store your data to create your dream Data Intelligence Platform
DataForge is on a mission to simplify and eliminate the tedious Data Engineering tasks required to manage data pipelines. To expand upon our mission, the DataForge 7.1.0 version is here with a revamped user interface, new features, and developer productivity enhancements!
At DataForge, we make data management, integration, and analysis faster and easier for Databricks users. Our proprietary technology is built to ingest, consolidate, and process raw data from any source. As the only data engineering architecture extension to the Databricks platform, DataForge helps organizations solve integration hurdles quickly without compromising customization.
The DataForge team is committed to simplifying the data engineering workflow process for Databricks users. With that, this release includes several key enhancements.
DataForge is proud to support Unity Catalog in Databricks natively. Data Engineers can now ingest data directly from tables stored in Unity Catalog.
Data discovery sucks! DataForge is excited to release Automatic Keys and relations for database connections to make the process easier. By scanning the information schema and metadata, DataForge is now able to generate reusable ETL code automatically. Data Engineers can now skip the data discovery process and immediately ingest and process complex source system models - without manually re-defining tables and joining conditions.
See DataForge in action:
This demo is best viewed on a computer rather than mobile.
For more details, please see the release notes below or visit DataForgeLabs.com.
Table of Contents
Automatic Keys and Relations for Database Connections
Unity Catalog Ingestion and Output
Simplified Rule Expressions for Relation traversals
Enhanced Project Import/Export
Project variables and options improve the developer experience
Watermark Column for Custom/None Refresh Incremental Loads
New Agent Authentication Protocol (v2) and MSI download button
Optional Read Replica for Postgres Metadata Database
Revamped User Interface
DataForge workspaces have a new user interface with revamped colors and tools for developers.
Color updates make the eye travel where you need to be. Following the DataForge color pallet, users will see a familiar layout on pages but with enhanced color schemes to draw the eye toward warnings and places of action.
Relation graphs have been modified to enhance the user experience. When users view relations in graph form, the graph now displays from bottom to top. As nodes are expanded or collapsed, the graph re-renders to optimize visibility.
Source statuses on the main Source page now sort in order of actions needed. Clicking the Status column will sort the statuses so the user is immediately drawn to any sources with failures and warnings that should be reviewed.
Processing pages now jump to the dates that processes were run so users can quickly navigate back or forwarding using the date selection without wasting time scrolling through dates with no processes.
Automatic Keys and Relations for database connections
DataForge is committed to speeding up and easing the developer workflow. With database connections, users can choose what metadata to collect from the source system, managed by two new connection parameters:
Metadata Refresh (includes three options):
Tables and Keys (default) collects most granular information for each table
Tables collects only table/view names
None disables metadata collection for the connection
Metadata Schema Pattern (optional):
Specifies LIKE pattern to filter schemas for metadata collection
When the Tables and Keys collection option is enabled on the Metadata Refresh parameter, the Connection Metadata tab will show Primary Keys and Referenced Tables for each table. Users can bulk create sources for any tables needed, as well as referenced tables recursively, and automatically create relations between the sources in just a few clicks. When bulk creating, users have options to change the source name patterns and trigger initial data pulls on all sources.
Unity Catalog Ingestion and Output
DataForge is proud to natively support Unity Catalog in Databricks. Users can now ingest data directly from tables stored in Unity Catalog. Data can also be output to Unity Catalog through the Delta table output type. Unity Catalog requires a new connection be created with the catalog saved on the connection. By default, all connections will continue to use Hive Metastore unless otherwise specified.
Simplified Rule expressions for Relation traversals
Relations and rule expressions can be hard to navigate at times when there is a long chain of relations. DataForge now takes this difficulty out of play by simplifying rule expressions to only the target source name within brackets without writing the relation chain by hand. Relations and relation chains are now expanded into an Expression Parameters model below rule expressions to make relation chains legible and easy to update with a drop-down. After typing [ into the expression, DataForge will list all sources reachable from the [This] source via active relations.
Pre 7.1 Rule Expression: [This]~{Relation Name 1}~[Source 2]~{Relation Name 2}~[Source 3].attribute
Post 7.1 Rule Expression: [Source 3].attribute (relations shown in expression parameters)
Once the user selects the destination source, DataForge will pick the best relation path and display it in the Expression Parameters section below the expression. Users can change any part of the path using the presented drop-downs. Where applicable, additional hops in the relation chain can be added with an "Add Next" button to expand the traversal. Relation paths displayed in Expression Parameters have intuitive labels formatted as [From Source Name]->relation name->[To Source Name].
Hovering over the relation path selected provides users an extra level of detail to see the relation expression used including primary and foreign keys. The editor also tracks cardinality based on whether the expression is wrapped into an aggregate function or not, and informs the user about having to use an aggregate in the expression.
Users can click on each attribute name in the parameters to auto highlight where that attribute is referenced in the rule expression which can be extremely helpful in complicated rule expressions.
Rule templates also follow the new simplified syntax, and include additional parsing to generate and save expression parameters. This achieves faster and more robust linking of templates to sources.
Enhanced Project Import/Export
Importing a project now includes an automatic check whether any sources will be deleted as part of the import. If one or many sources will be deleted as a result of not being included in the import files, the import will be put on pause and a pause status will be assigned to the import. By opening the logs of the import, users will be able to see a list of sources that will be deleted if the import proceeds. Users click the pause status icon to see the warning message and options to cancel the import, fail the import, or proceed with the import.
Project exports no longer include updated by and created by users and datetimes. When Project imports are run, all objects that are changed or new will be created and marked with Created by or Updated by users of "Import " and the import number which can be found on the Project Imports screen. Created and Updated by datetimes will post the time the import was started. This massively improves auditability and debugging as users can see exactly when a configuration was changed with which import files.
Logs for Project Imports now include detailed counts of all objects that were updated and deleted during the import.
Projects variables and options improve the developer experience
DataForge projects have overhauled the ability for developers to manage CI/CD pipelines with the ability to integrate DevOps best practice tools like Github. In this release, projects have been matured to improve the developer experience through a number of new tools.
Project variables can now be added to a project so configuration names are replaced by a variable name when the project is exported and matched to another variable when the project is imported back into a workspace. In practicality, this allows users to manage multiple sets of the same project (e.g. Dev, Test, Prod) within the same workspace without having to worry whether multiple outputs are pointed to the same table across projects. Variables are added to a project before exporting from a workspace and when imported back into a workspace, the variables are matched and replaced with target values the user configures. When project export files that include variables are imported into a project that does not yet have the variables configured, the import is failed and the variables are automatically created as a placeholder. The user only has to populate the variables with values and restart the import!
Projects now include a Disable Ingestion flag that users can toggle on or off to turn off ingestion on all sources within a project. This setting can be configured when the project is being created or after the project exists to manage ingestions. When a Project has ingestions disabled, a grey in progress icon with a slash through it appears next to the project name in the site banner.
Watermark Column for Custom/None Refresh Incremental Loads
Sources with Custom or None Refresh now include additional parameters that allow users to use a <latest_watermark> token in source queries. The <latest_watermark> token is substituted with the max(watermark column) as defined in the Watermark Column parameter.
Watermark Column: column in source attributes is substituted as MAX(Watermark Column) in source queries when <latest_watermark> token is used
Watermark Initial Value: initial value to substitute into <latest_watermark> when there is no data ingested yet (e.g. 1900-01-01)
New Agent Authentication Protocol (2.0) and MSI download button
A new agent authentication method has been added as 2.0 protocol that removes the need for Auth0 to be used for authentication. The 2.0 protocol uses an agent token that is stored in the agent-config.bin file and hashed/encrypted. Customers should begin converting to 2.0 protocol when possible as DataForge will discontinue support of the 1.0 authentication protocol in the future.
Customers can convert existing agents by changing the Auth Protocol parameter on the agent ui page to be 2.0 and save the change. This will stop the agent from communicating with DataForge temporarily. After saving this change, download the Agent Config file again from the Agents page and replace the existing agent-config.bin file on the machine that the agent is currently installed on. After replacing the agent-config.bin file, restart the Agent service from the machine and the agent will begin communicating with DataForge again.
The agents page now includes a button to download the Agent MSI file directly from the UI. Users no longer need to navigate to their respective cloud storage system to download the MSI file for new installations.
Optional Read Replica for Postgres Metadata Database
DataForge now works with a failover node of Postgres where needed. To deploy a second node of Postgres, add a variable in Terraform of "read_replica_enabled" and a value of "yes". A second Postgres node helps with redundancy if the cloud provider has an issue with the cluster running Postgres. Adding this read replica has additional cost implications. If the first Postgres node goes down for any reason, traffic and metadata will be diverted to the second node.