Advanced SQL Concepts
Hash In Data Engineering Key Concepts
Multi-chapter guide | Chapter 3
Hashing is the mathematical process that transforms data into a shorter, fixed-length string. Hash function outputs are always irreversible, consistent, and predictable in size. Due to its versatility, it has been used for decades in domains such as cryptography, data storage, authentication, etc.
Hashing is widely used in data engineering pipelines to tackle sensitive information storage, efficient data retrieval and indexing, data partitioning, data integrity assessment, assurance, and more. In SQL, hashing applications revolve around improving query performance, monitoring data structure changes, and deduplicating database entries.
This article explains the hashing properties and types of hashing in data engineering. We explore techniques like hash-based indexing, hash joins, partitioning, data compression, etc., and best practices for implementing them in data engineering pipelines. Find answers on which hash functions to use, how to apply them to large datasets efficiently, and how to handle the resulting outputs.
Summary of key hash in SQL concepts
Concept | Description |
---|---|
Hashing | Hashing is a transformation operation that takes any given character set and transforms it into another value. |
Hash-based indexing | Hash functions map data entries to an index table or hash table. |
Hash-based data deduplication | Hash functions create unique hash values for each record and detect duplicates without performing full record comparisons. |
Hash-based change data capture | CDC relies on identifying and capturing changes; since hashing excels at producing unique, compact data representations, you can use it to detect even minor changes between datasets. |
Hash joins | Hash rows from one table by the join key, creating a quick-access map. Check each row against this map while scanning the other table. Avoids a full comparison of both tables for fast matching. |
Hash functions for data sharding | Hash output modulo the number of available shards (hash(key) % shard_count) |
Hash functions for data obfuscation | Hash functions in data engineering pipelines obfuscate data dealing with sensitive records such as personal identities, financial data, medical records, and other confidential information. |
Hash partitioning | Divide a large dataset into smaller, more manageable partitions based on the hash values of one or more columns. |
Data compression | Hash function outputs are used to represent data in a more compact form without losing the underlying information. |
Hashing overview
Hashing is a one-way transformation process composed of three main parts: the input, the hashing function, and the output. The input can consist of any text passed through a one-way hashing function. The resulting value is usually shorter and of fixed length, making it easier to find or employ the original string.
You can think of a hash value as the shadow of an object; if you see only the shadow, you cannot visualize every detail of the original object, but if you have the original object, you can always create the exact shadow.
Ideally, and under most contexts, this “shadow” is a unique value under a given set, meaning the hash function produces unique values given many inputs. However, this is not always the case: hash functions can produce non-unique values under a given set, resulting in a “hash collision”. Although the risk of hash collision exists for any given hash function, it is extremely rare for modern hashing algorithms such as SHA256.
Ordered and unordered lists or structured trees require traversal, leading to variable and often non-constant access times. In contrast, hashing allows for nearly constant data access.
Hash function types
Hash functions are characterized by their design and intended application:
Group | Description | Examples | Use Cases |
---|---|---|---|
Cryptographic | Designed to be secure against attacks. | SHA-256, SHA-512, SHA-3 | Digital signatures, password hashing, and data integrity verification. |
Non-cryptographic | Prioritize speed and efficiency over security. | FNV, MurmurHash | Hash tables, checksums, and data deduplication. |
Message-digest | Produce a fixed-size output representing a "digest" of the input data. | MD5 | Data storage and retrieval. |
Universal | Designed to minimize the chance of collisions between distinct keys, which can affect performance in hash table operations. | Multiplication-based, Polynomial, Universal hash families | High efficiency and collision resistance requirements. |
Hashing in data engineering
Hashing is widely used in data engineering pipelines to optimize performance and ensure data integrity across databases and applications.
Cryptographic hash functions are used in secure password storage, digital signatures, and data integrity checks.
Non-cryptographic hash functions are used in hash tables for quick lookups, deduplication, and in-memory data processing tasks.
Checksum algorithms, a subset of the above, are widely used for error-checking and data integrity verification, including data deduplication checks.
Universal hashing is used in high-demand, intensive applications such as distributed systems and hash tables in large datasets.
For each category, we have preferred functions depending on the desired outcome. Below are just some examples:
Secure password storage | Cryptographic (e.g., SHA-256) |
Data deduplication | Non-cryptographic (e.g., MurmurHash) |
Error-checking in data transfers | Checksum (e.g., CRC32) |
Distributed systems | Universal (e.g., Multiplication-based) |
In-memory data processing | Non-cryptographic (e.g., FNV) |
A hands-on example
Let’s look at how some hash algorithms work by following a hands-on example using PostgreSQL. Consider a PostgreSQL database designed for a public library management system. This database tracks books, authors, and borrowers and consists of four main tables:
Entity-Relationship Diagram (ERD) for a simplified library management system.
Create the tables:
-- 1. Authors Table
CREATE TABLE authors (
author_id SERIAL PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
birth_date DATE,
nationality VARCHAR(50)
);
-- 2. Books Table
CREATE TABLE books (
book_id SERIAL PRIMARY KEY,
title VARCHAR(255),
author_id INT REFERENCES authors(author_id),
isbn VARCHAR(13) UNIQUE,
published_year INT,
genre VARCHAR(50),
barcode VARCHAR(50) UNIQUE
);
-- 3. Borrowers Table
CREATE TABLE borrowers (
borrower_id SERIAL PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
membership_date DATE,
email VARCHAR(100) UNIQUE
);
-- 4. Borrowing records Table (tracking the borrowed books)
CREATE TABLE borrowings (
borrowing_id SERIAL PRIMARY KEY,
book_id INT REFERENCES books(book_id),
borrower_id INT REFERENCES borrowers(borrower_id),
borrow_date DATE,
return_date DATE
);
-- 5. Table to track changes (deltas, to discuss later)
CREATE TABLE book_changes (
book_id INT,
title TEXT,
author_id INT,
new_row_hash TEXT,
change_timestamp TIMESTAMP DEFAULT NOW(),
UNIQUE (book_id)
);
Populate the tables:
-- 1. Authors Table
INSERT INTO authors (author_id, first_name, last_name)
VALUES
(1, 'Victor', 'Hugo'),
(2, 'Fyodor', 'Dostoevsky'),
(3, 'Joris-Karl', 'Huysmans'),
(4, 'Thomas', 'Pynchon'),
(5, 'Herman', 'Melville'),
(6, 'Leo', 'Tolstoy'),
(7, 'George', 'Orwell'),
(8, 'Franz', 'Kafka'),
(9, 'James', 'Joyce');
-- 2. Books Table
INSERT INTO books (title, author_id, isbn, published_year, genre, barcode)
VALUES
('Les Miserables', 1, '9781234567890', 1862, 'Novel', '12345'),
('Crime and Punishment', 2, '9780987654321', 1866, 'Philosophical', '23456'),
('À Rebours', 3, '9781112131415', 1884, 'Novel', '34567'),
('Gravitys Rainbow', 4, '9782223242526', 1973, 'Novel', '45678'),
('Moby-Dick', 5, '9783334354647', 1851, 'Adventure', '56789'),
('War and Peace', 6, '9784445464748', 1869, 'Historical', '67890'),
('The Brothers Karamazov', 2, '9785556575859', 1880, 'Philosophical', '78901'),
('1984', 7, '9786667686970', 1949, 'Dystopian', '89012'),
('The Trial', 8, '9787778797081', 1925, 'Novel', '90123'),
('Ulysses', 9, '9788889808192', 1922, 'Modernist', '01234');
-- 3. Borrowers Table
INSERT INTO borrowers (first_name, last_name, membership_date, email)
VALUES
('Napoleon', 'Bonaparte', '2024-01-01', 'napoleon@example.com'),
('Albert', 'Einstein', '2024-01-02', 'einstein@example.com'),
('Oscar', 'Wilde', '2024-01-03', 'oscar@example.com'),
('George', 'Washington', '2024-01-04', 'george@example.com'),
('Mark', 'Twain', '2024-01-05', 'mark@example.com'),
('Karl', 'Marx', '2024-01-06', 'karl@example.com'),
('Charles', 'Darwin', '2024-01-07', 'darwin@example.com'),
('Cleopatra', '', '2024-01-08', 'cleo@example.com'),
('Leonardo', 'da Vinci', '2024-01-09', 'leo@example.com'),
('Queen', 'Victoria', '2024-01-10', 'victoria@example.com');
-- 4. Borrowings Table with deliberate duplicates for demonstration
INSERT INTO borrowings (book_id, borrower_id, borrow_date, return_date)
VALUES
(1, 1, '2024-11-01', '2024-11-10'),
(1, 1, '2024-11-01', '2024-11-10'), -- Duplicate entry
(2, 2, '2024-11-02', '2024-11-12'),
(3, 3, '2024-11-03', '2024-11-13'),
(4, 4, '2024-11-04', '2024-11-14'),
(5, 5, '2024-11-05', '2024-11-15'),
(5, 5, '2024-11-05', '2024-11-15'), -- Duplicate entry
(6, 6, '2024-11-06', '2024-11-16'),
(7, 7, '2024-11-07', '2024-11-17'),
(8, 8, '2024-11-08', '2024-11-18'),
(9, 9, '2024-11-09', '2024-11-19'),
(10, 10, '2024-11-10', '2024-11-20');
The example purposely includes duplicate entries in the borrowing tables to illustrate some deduplication methods in the next sections.
Hash-based indexing
Hash-based indexing uses hash functions to map data entries to an index table or hash table. A hash function takes a table key as input and produces a hash value that determines the position of the corresponding data entry in the index. This way, the index points directly to the data location without reading through multiple entries. This allows for easier data retrieval since all records can be looked up easily.
Below is a simplified diagram of how hash tables work
Hash-based indexing structure
For example, let’s create a hash index on the barcode column of the books table.
In the SQL code above, we create a new index called idx_books_title, while telling PostgreSQL that this is a hash-based index.
PostgreSQL does not change actual tables when adding hash indexes. The books table remains unchanged, and you can see the newly created indexes in the PostgreSQL indexes table. When you query a specific title, PostgreSQL uses the index to quickly locate rows, making lookups faster when there are many book records.
It returns the following.
schemaname | tablename | indexname | indexdef |
---|---|---|---|
public | books | books_pkey | CREATE UNIQUE INDEX books_pkey ON public.books USING btree (book_id) |
public | books | books_isbn_key | CREATE UNIQUE INDEX books_isbn_key ON public.books USING btree (isbn) |
public | books | books_barcode_key | CREATE UNIQUE INDEX books_barcode_key ON public.books USING btree (barcode) |
public | books | idx_books_barcode | CREATE INDEX idx_books_barcode ON public.books USING hash (barcode) |
It’s important to note that hash indexes in PostgreSQL are optimized for equality com parisons, such as = or IN, but are unsuitable for range queries (e.g., <, >, or BETWEEN).
Why use hash-based indexing?
Hash-based indexing is beneficial for queries such as equality searches, where the query looks for exact key matches. Since we’re combining two concepts, index tables, and hashing, we get the benefits from both:
Rapid data retrieval
Minimal amount of memory used when compared to other indexing methods
Consistent performance as the table size grows.
Hash-based indexing is particularly useful for large datasets. It provides almost constant time complexity for lookups because hash functions directly map keys to specific locations in a hash table.
Hash-based data deduplication
Data deduplication is a common data engineering practice that consists of removing duplicate records, saving storage, and improving data accuracy. You can use hash functions to create unique hash values for each record and then detect duplicates without performing full record comparisons.
Data deduplication methods
Several hash-based data deduplication techniques can be directly applied in a data engineering pipeline context.
Method | Description | Use cases |
---|---|---|
Unique constraints | Compute a hash for each entry and apply constraints on this column, preventing duplicate records from being added. | Duplicate blocking. |
Hash aggregation | Use aggregation functions on hash values to identify duplicates within a single query. | Identification and removal of duplicate entries based on combinations of columns. |
Cross-table hash comparison | Create hash values in both the source and target tables and then use these to match and remove duplicates. | Deduplication of data across tables, such as when merging data. |
Rolling hash | Use a rolling hash to create a "sliding window" of hashes. | Detection of duplicate records in real-time. |
Bloom filters | Use a probabilistic data structure that checks if an element is likely in a set by hashing it into bit arrays. | Extensive dataset filtering since it prevents the need to store full data for deduplication. |
As an example, let’s consider two of the most popular data deduplication methods:
The unique constraints method is very simple and creates a constraint for the specified column. If you try to add duplicate records to this column, PostgreSQL returns an error.
-- Add a unique constraint to the email column of the borrowers table
ALTER TABLE borrowers ADD CONSTRAINT unique_email UNIQUE (email);
-- Try to add duplicated entries
INSERT INTO borrowers (first_name, last_name, membership_date, email)
VALUES ('John', 'Doe', '2024-11-11', 'napoleon@example.com'); -- Duplicate email
Result
You can also use hash aggregation to detect duplicate borrowings (duplicate borrowings are shown).
Hash aggregation correctly flags the duplicate entries added earlier when populating the tables. Since it is a detection method, we end up with the following:
book_id | borrower_id | duplicates |
---|---|---|
1 | 1 | 2 |
5 | 5 | 2 |
Hash-based CDC
Change Data Capture (CDC) is a set of data engineering practices that detect changed data in a given dataset. These changes are called “deltas,” and the output dataset is called a “delta-driven dataset.”
Any CDC approach includes three main components that execute sequentially:
Identification detects rows where changes have occurred using methods like hash comparisons or timestamps.
Capture extracts identified changes from the source, preparing them for downstream processing.
Delivery sends captured changes to the target system for updated data synchronization.
CDC relies on identifying and capturing changes; since hashing excels at producing unique, compact data representations, you can use it to detect even minor changes between datasets, similar to how we would monitor duplicate entries in a deduplication approach.
Hash functions for CDC
Below are the four main hash functions used in CDC, along with their main characteristics:
Method | Speed | Security against collisions | Efficiency for small data | Scalability | Consistency |
---|---|---|---|---|---|
MD5 | Y | N | Y | Y | Y |
SHA-256 | Y* | Y | Y | Y | Y |
CRC32 | Y | N | Y | Y | Y |
FNV | Y | N | Y | Y | Y |
*SHA-256 is considered moderately fast when compared to other functions (e.g., MD5)
When selecting a hash function for CDC, consider
Speed: A fast hash function minimizes overhead when comparing rows.
Security against collisions: Collisions can lead to inaccurate change detection. A hash function with higher collision resistance ensures more reliable tracking, but it may not always be critical in less sensitive environments.
Efficiency for small increments: A hash function that can handle small data increments, such as individual rows or changes, reduces latency and resource consumption.
Scalability: The hash function must process millions of rows without significant degradation in performance.
Consistency: Consistency ensures that the same input always produces the same hash, which is critical for the CDC. Without this, identifying new changes would be unreliable.
Hash CDC example
Let’s introduce a change data capture system (CDC) that allows you to monitor changes to a given column in a table in the public library database example. CDC involves several sequential steps.
Step 1- Declare the CDC function:
-- 1. Create a function to compute a hash value for a row (using MD5)
CREATE OR REPLACE FUNCTION calculate_row_hash(book_id INT, title TEXT, author_id INT)
RETURNS TEXT AS $$
BEGIN
RETURN MD5(COALESCE(book_id::TEXT, '') || COALESCE(title, '') || COALESCE(author_id::TEXT, ''));
END;
$$ LANGUAGE plpgsql;
Step 2—Create a row_hash column in the books table to store a hash value that uniquely represents each row. This hash is calculated based on the book's book_id, title, and author_id. It's like a digital fingerprint for each row
Every time a change happens (e.g., a book's title or author changes), the row_hash identifies what changed. The book_changes table stores these changes, but only for books that have been modified. If the book already exists in book_changes, it updates the record with the new data and the new hash:
-- 3. Capture and Store Changes (Deltas) - Insert only changed rows into book_changes table
INSERT INTO book_changes (book_id, title, author_id, new_row_hash)
SELECT
book_id,
title,
author_id,
row_hash AS new_row_hash
FROM books
WHERE book_id = 1
ON CONFLICT (book_id) DO UPDATE
SET
title = excluded.title,
author_id = excluded.author_id,
new_row_hash = excluded.new_row_hash,
change_timestamp = NOW();
Step 3—To see which rows have changed, join the book_changes table (which tracks changes) with the books table (which contains the original data). The query retrieves the book's book_id, the original title and author, and the new hash value that was created after the change. It then sorts the results by when the change happened, so most recent changes are shown first.
Step 4—To test or trigger the change detection, update the title of a book (book_id = 1). The system recalculates the row_hash for that row based on the new title and stores it in the row_hash column. This triggers a change that is tracked in the book_changes table.
Step 5—list the changed records using two different formats.
-- 6. View changed records
SELECT
book_id,
title,
author_id,
new_row_hash,
change_timestamp
FROM book_changes;
-- 7. View changed records with their original and new hash values
SELECT
b.book_id,
b.title AS original_title,
b.author_id AS original_author_id,
bc.new_row_hash AS new_hash,
b.row_hash AS original_hash,
bc.change_timestamp
FROM
book_changes bc
JOIN
books b ON bc.book_id = b.book_id
ORDER BY
bc.change_timestamp DESC;
Final SQL result.
book_id | original_title | original_author_id | new_hash | original_hash | change_timestamp |
---|---|---|---|---|---|
1 | Les Miserables Updated | 1 | 806c0d25073ccc6a31629e3caa864419 | 4eefca8fcb17c75cde341219dd370eaf | 33:33.2 |
Hash-based CDC vs. other CDC methods
Hash-based CDC methods are designed to excel in real-time environments. They quickly process large datasets with minimal overhead for tracking changes in high-frequency, high-volume systems. Plus, they’re easy to implement, with less complexity than other approaches.
However, hash collisions introduce challenges depending on the monitoring type; other methods might be better if you need precise, field-level tracking or stronger data integrity. Log-based CDC, for example, offers granular information of exactly which fields changed and avoids collision risks altogether.
DataForge’s hash-based CDC is unique in that it processes changes efficiently and addresses these challenges by enabling users to track hash-based data lineage through its UI. This UI simplifies monitoring and minimizes the error rate, ensuring reliability even in scenarios where common hash-based methods might fall short due to collision risks or lack of granularity.
Hash-based vs. log-based CDC
Hash joins
Hash joins leverage hashing to organize data from each table into hash tables, allowing quick lookups and minimizing the need for full-table scans. They do this by first hashing rows from one table by the join key, creating a quick-access map. Then, as the other table is scanned, each row is checked against this map for fast matching, avoiding a full comparison of both tables.
Figure N: A generic hash join diagram
This efficiency makes hash joins particularly valuable in ETL processes, where large datasets need to be joined and transformed efficiently, or in distributed databases, where minimizing data movement and optimizing join performance is critical.
Approaches for implementing hash joins
Hash joins can be implemented in various ways, depending on the context we’re working on. The main differentiating factor is memory usage; each method is designed to optimize join performance based on memory constraints and dataset size.
Join | Description | Pro | Con |
---|---|---|---|
In-memory hash join | Uses available memory to store the hash table. | It’s fast and efficient for small datasets that fit within RAM. | Can’t handle large tables due to memory constraints. |
Partitioned hash join | Splits tables into smaller partitions and hashes each separately. | Reduces memory requirements and scales to larger datasets | Introduces some overheads for partition management. |
Grace hash join | A variation of the partitioned join—writes partitions to disk. | Handles very large datasets by processing them in manageable chunks. | Introduces some overheads for disk management. |
Hybrid hash join | Starts with an in-memory hash but adapts to partitioned processing if memory overflows. | Balances efficiency and scalability. | Ideal for varying data sizes. |
Hash join example
Let us go back to the library example and illustrate how to perform a hash join using two tables.
-- 1. Create a hash index on the book_id column of borrowings
CREATE INDEX idx_borrowings_book_id ON borrowings USING HASH (book_id);
-- 2. Perform the actual join
SELECT
b.book_id,
b.title,
b.author_id,
bo.borrower_id,
bo.borrow_date,
bo.return_date
FROM
books b
JOIN
borrowings bo
ON
b.book_id = bo.book_id;
If the code above looks like a conventional join operation, that’s because it is; in PostgreSQL, joins are optimized by the query planner, so the actual HASH JOIN operation happens internally. However, you can simulate the hashing process by using a hash index, getting the following in return:
book_id | title | author_id | borrower_id | borrow_date | return_date |
---|---|---|---|---|---|
1 | Les Miserables Updated | 1 | 1 | 11/1/2024 | 11/10/2024 |
1 | Les Miserables Updated | 1 | 1 | 11/1/2024 | 11/10/2024 |
2 | Crime and Punishment | 2 | 2 | 11/2/2024 | 11/12/2024 |
3 | À Rebours | 3 | 3 | 11/3/2024 | 11/13/2024 |
4 | Gravitys Rainbow | 4 | 4 | 11/4/2024 | 11/14/2024 |
5 | Moby-Dick | 5 | 5 | 11/5/2024 | 11/15/2024 |
5 | Moby-Dick | 5 | 5 | 11/5/2024 | 11/15/2024 |
6 | War and Peace | 6 | 6 | 11/6/2024 | 11/16/2024 |
7 | The Brothers Karamazov | 2 | 7 | 11/7/2024 | 11/17/2024 |
8 | 1984 | 7 | 8 | 11/8/2024 | 11/18/2024 |
9 | The Trial | 8 | 9 | 11/9/2024 | 11/19/2024 |
10 | Ulysses | 9 | 10 | 11/10/2024 | 11/20/2024 |
Note that a hash join doesn't deduplicate data during the join process; it simply matches rows based on the hash of the book_id column.
When the same book_id appears more than once in borrowings, those duplicate entries are matched with the books table, leading to multiple output rows for the same book.
Hash functions for data sharding
Data sharding consists of splitting a large dataset across multiple database nodes or partitions to distribute load, improve performance, and enhance scalability. By dividing data into "shards," each node can handle a data subset, reduce bottlenecks, and make the system more efficient and resilient to traffic spikes.
Hash-based sharding assigns data to shards based on the output of a hash function. Each data record is hashed using a specific key (like a user ID), and the resulting hash value determines the target shard. For example, a common approach is to take the hash output modulo the number of available shards (hash(key) % shard_count). This approach balances data across shards by distributing records in a way that’s both consistent and computationally efficient.
Benefits of hash-based sharding:
Distributes data evenly, avoiding hot spots or overloaded shards.
It’s easy to add or remove shards by adjusting the modulo operation, making hash-based sharding flexible as data volume grows.
Each key reliably maps to a specific shard, simplifying data retrieval and maintaining quick lookup times.
Generalized hash-based data sharding process flow
Hash functions for data obfuscation
Data obfuscation is the process of transforming information in such a way that the original value is not accessible to unauthorized users. This practice is widely used in data engineering pipelines dealing with sensitive records such as personal identities, financial data, medical records, and other confidential information.
Since hashing is a one-way process, it is a solid choice regarding data obfuscation. By transforming sensitive data into fixed-length, seemingly random strings, hashing can effectively conceal the underlying information while preserving certain structural properties necessary for downstream processing.
Anonymizing data using hash-based obfuscation
The specific hashing approach used for obfuscation can vary depending on the requirements of the data pipeline. For smaller datasets, in-memory hashing may be a fast and efficient solution. However, larger volumes of sensitive data may call for partitioned hashing strategies, which divide the data into manageable chunks to reduce memory demands.
More advanced obfuscation techniques, such as salted hashing, incorporate unique "salt" values into the hashing process to provide an additional layer of protection. This helps defend against attacks like rainbow table lookups and further secures the concealed data.
For example, let’s obfuscate a sensitive field for the “borrowers” table:
-- 0. Enable pgcrypto extension (only needs to be done once)
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- 1. Add a new column for the hashed email
ALTER TABLE borrowers
ADD COLUMN hashed_email TEXT;
-- 2. Create a function to hash the email using SHA-256 for stronger security
CREATE OR REPLACE FUNCTION hash_email(input_email TEXT)
RETURNS TEXT AS $$
-- Hash the email using SHA-256 and return as a hex string
SELECT encode(digest(input_email, 'sha256'), 'hex');
$$ LANGUAGE sql;
-- 3. Add a column for the hashed email
ALTER TABLE borrowers ADD COLUMN IF NOT EXISTS hashed_email TEXT;
-- 4. Update the borrowers table with the hashed email
UPDATE borrowers
SET hashed_email = hash_email(email);
-- 5. View the updated borrowers with hashed emails
SELECT borrower_id, first_name, last_name, email, hashed_email
FROM borrowers;
By now, we must be familiar with how a hash value looks, which is exactly what we get from the code above.
borrower_id | first_name | last_name | hashed_email | |
---|---|---|---|---|
1 | Napoleon | Bonaparte | napoleon@example.com | c4ba331b66a0dcbc08757a771b31eeb87850c3373b6bda24ddee05b80ada8a88 |
2 | Albert | Einstein | einstein@example.com | 5efce675f54c30064875bca8f75ba2acedf594f326434f4c9c1a545cc7597984 |
3 | Oscar | Wilde | oscar@example.com | 0ece59855504dda4281f7fe85cd8d8a653a6aeefd9b1885767d6b89b08ba19a9 |
4 | George | Washington | george@example.com | 47508cb4512f7b46aaeb34b4a514c63d63ec240bf28cbc4410addd4867c10066 |
5 | Mark | Twain | mark@example.com | 70b4f63574fbce65d870a6ba773f57eaabcc39e8f94b63bb840776771919dc65 |
6 | Karl | Marx | karl@example.com | 28d554c51d4197942ae283cb2dee2241919224956f1e9aaff605a747a7ef318d |
7 | Charles | Darwin | darwin@example.com | f94769776eba1492b0a10f92e9ca50395192454dd15f714d5663aa41e06f6e52 |
8 | Cleopatra | cleo@example.com | b11b6ad3e195b530c96c9f94774644c46383b81f62116126608814c021d70f04 | |
9 | Leonardo | da Vinci | leo@example.com | 6473c0297478fe29d2e428c34026021e02a59655b7f5aa2c10c2d607f3fe4354 |
10 | Queen | Victoria | victoria@example.com | 0015778c4d4f8637d659bf95806a07ee06368ee94623ec9d1346f1096336593f |
Hash partitioning
Hash partitioning is a common technique used in data engineering to improve the performance and scalability of data processing pipelines. The basic idea is to divide a large dataset into smaller, more manageable partitions based on the hash function values of one or more columns.
Hash vs. range partitioning
Hash partitioning leverages the uniform distribution properties of hash functions to assign data to partitions. By hashing the value of a chosen partitioning column, the data is evenly distributed across partitions, ensuring a balanced workload. This contrasts with range partitioning, which assigns data to partitions based on the actual value ranges of the partitioning column(s).
For example, consider a table of customer orders. With range partitioning, we might partition the data by order date, creating partitions for orders from 2022-Q1, 2022-Q2, and so on. In hash partitioning, we could instead partition by hashing the customer ID, which would distribute the orders more evenly regardless of the dates.
The choice between hash and range partitioning depends on the specific requirements of our data pipeline. Hash partitioning excels when you want to:
Evenly distribute a large dataset for parallel processing.
Avoid hotspots or skew in our partitions.
Quickly locate data based on a unique identifier.
Range partitioning, on the other hand, is preferable when you want to:
Query data within a specific range (e.g., orders from a given quarter).
Require more human-readable partition names (e.g., date ranges).
However, in most cases, a hybrid approach can be used.
Data compression
Data compression represents data in a more compact form without losing the underlying information. It is a fundamental technique used in data engineering to reduce the storage footprint.
Hash-based compression techniques leverage the unique properties of hashing to achieve the following:
Deduplication: Identical data elements produce the same hash value, allowing you to easily identify and remove duplicate entries.
Indexing: The uniform distribution of hash values makes it easy to create efficient indexes for quickly locating specific data within the compressed dataset.
Compact representations: In many cases, the hash value can be stored in place of the original data, achieving substantial space savings.
Integrity checks: Hash values can generate checksums or other validation mechanisms to verify that decompressed data has retained its completeness and accuracy.
Most of the methods can be applied in combination with each other, as shown.
-- 1. Deduplication: Removing duplicate rows based on hash of book details
WITH hashed_books AS (
SELECT book_id, hash_email(title || author_id || isbn) AS book_hash
FROM books
)
DELETE FROM books
WHERE book_id NOT IN (SELECT MIN(book_id) FROM hashed_books GROUP BY book_hash);
-- 2. Indexing: Creating a hash-based index on a computed concatenated column (book details)
ALTER TABLE books ADD COLUMN IF NOT EXISTS title_author_isbn_hash TEXT;
UPDATE books SET title_author_isbn_hash = hash_email(title || author_id || isbn);
CREATE INDEX idx_books_hash ON books (title_author_isbn_hash);
-- 3. Compact Representations: Storing only hash values for titles
ALTER TABLE books ADD COLUMN IF NOT EXISTS title_hash TEXT;
UPDATE books SET title_hash = hash_email(title);
-- 4. Integrity Checks: Generating a checksum to validate data integrity after compression
SELECT book_id, title, encode(digest(title || author_id || isbn, 'sha256'), 'hex') AS integrity_checksum
FROM books;
The resulting “books” table after hash-based data compression transformations.
book_id | title | integrity_checksum |
---|---|---|
2 | Crime and Punishment | a26fb63c594e0f293f09a66bc9415ef2392891afac725ffd304e22e044d4f991 |
3 | À Rebours | 59f0c92af15edd76aad8d39443beaacde73ea449b9555c61ee84d45206de40bb |
4 | Gravitys Rainbow | 514f6f450f92c9ceebbf047bcf3dc01c5de30b717bdbffafd5e15cf1bb62cba4 |
5 | Moby-Dick | ca87ea21b996cb794832225189c305a9d6b18d5d3cd948d099d9bfe7449dfcdb |
6 | War and Peace | f5f5b5ab2078a6b46d1c9625ac5319cd090700a58a9117ab9ea1a2e497045f9c |
7 | The Brothers Karamazov | 69c09b753fb984864460f03c75be1a6615321e6a0969efd30f39c20ab094986c |
8 | 1984 | fcd70a1a71f6f633544851bbbad9bdad4702e04fe713543e014ccd12aeec7641 |
9 | The Trial | 89a72df81ee222bf6181e37a90ce3c7e94ccfb89073890a304b13a4874887b80 |
10 | Ulysses | 2af29ed5d64df3d9a4884a2b1af9b5e16a6fbea4cb763f0b6a1862430fda6db1 |
1 | Les Miserables Updated | 0ed876acf21033709f9111b6b17125856f92f06cf3a0a2864805d2e98fcf620c |
Last thoughts
The flexibility and versatility of hashing make it suitable for various applications and adaptable to virtually any use case. Applying hash functions in data engineering pipelines unlocks a powerful toolset that enhances data integrity, reduces redundancy, and optimizes performance across ETL pipelines. For example, DataForge uses hash-based CDC to monitor data changes on a per-row granularity, allowing for real-time updates and facilitating reliable auditing processes.