Hashing is widely used in data engineering pipelines to tackle sensitive information storage, efficient data retrieval and indexing, data partitioning, data integrity assessment, assurance, and more. In SQL, hashing applications revolve around improving query performance, monitoring data structure changes, and deduplicating database entries.

This article explains the hashing properties and types of hashing in data engineering. We explore techniques like hash-based indexing, hash joins, partitioning, data compression, etc., and best practices for implementing them in data engineering pipelines. Find answers on which hash functions to use, how to apply them to large datasets efficiently, and how to handle the resulting outputs.

Summary of key hash in SQL concepts

  
      Concept
      Description
    
      Hashing
      Hashing is a transformation operation that takes any given character set and transforms it into another value.
    
      Hash-based indexing
      Hash functions map data entries to an index table or hash table.   
    
      Hash-based data deduplication
      Hash functions create unique hash values for each record and detect duplicates without performing full record comparisons.   
    
      Hash-based change data capture
      CDC relies on identifying and capturing changes; since hashing excels at producing unique, compact data representations, you can use it to detect even minor changes between datasets.   
    
      Hash joins
      Hash rows from one table by the join key, creating a quick-access map. Check each row against this map while scanning the other table. Avoids a full comparison of both tables for fast matching.   
    
      Hash functions for data sharding
      Hash output modulo the number of available shards (hash(key) % shard_count)   
    
      Hash functions for data obfuscation
      Hash functions in data engineering pipelines obfuscate data dealing with sensitive records such as personal identities, financial data, medical records, and other confidential information.   
    
      Hash partitioning
      Divide a large dataset into smaller, more manageable partitions based on the hash values of one or more columns.   
    
      Data compression
      Hash function outputs are used to represent data in a more compact form without losing the underlying information.

Hashing overview

Hashing is a one-way transformation process composed of three main parts: the input, the hashing function, and the output. The input can consist of any text passed through a one-way hashing function. The resulting value is usually shorter and of fixed length, making it easier to find or employ the original string.

You can think of a hash value as the shadow of an object; if you see only the shadow, you cannot visualize every detail of the original object, but if you have the original object, you can always create the exact shadow.

Ideally, and under most contexts, this “shadow” is a unique value under a given set, meaning the hash function produces unique values given many inputs. However, this is not always the case: hash functions can produce non-unique values under a given set, resulting in a “hash collision”. Although the risk of hash collision exists for any given hash function, it is extremely rare for modern hashing algorithms such as SHA256.

Ordered and unordered lists or structured trees require traversal, leading to variable and often non-constant access times. In contrast, hashing allows for nearly constant data access.

Hash function types

Hash functions are characterized by their design and intended application:

  
      Group
      Description
      Examples
      Use Cases
    

  
      Cryptographic
      Designed to be secure against attacks.
      SHA-256, SHA-512, SHA-3
      Digital signatures, password hashing, and data integrity verification.
    

      Non-cryptographic
      Prioritize speed and efficiency over security.
      FNV, MurmurHash
      Hash tables, checksums, and data deduplication.
    

      Message-digest
      Produce a fixed-size output representing a "digest" of the input data.
      MD5
      Data storage and retrieval.
    

      Universal
      Designed to minimize the chance of collisions between distinct keys, which can affect performance in hash table operations.
      Multiplication-based, Polynomial, Universal hash families
      High efficiency and collision resistance requirements.
    

Hashing in data engineering

Hashing is widely used in data engineering pipelines to optimize performance and ensure data integrity across databases and applications.

Cryptographic hash functions are used in secure password storage, digital signatures, and data integrity checks.
Non-cryptographic hash functions are used in hash tables for quick lookups, deduplication, and in-memory data processing tasks.
Checksum algorithms, a subset of the above, are widely used for error-checking and data integrity verification, including data deduplication checks.
Universal hashing is used in high-demand, intensive applications such as distributed systems and hash tables in large datasets.

For each category, we have preferred functions depending on the desired outcome. Below are just some examples:

  
      Secure password storage
      Cryptographic (e.g., SHA-256)
    
      Data deduplication
      Non-cryptographic (e.g., MurmurHash)
    
      Error-checking in data transfers
      Checksum (e.g., CRC32)
    
      Distributed systems
      Universal (e.g., Multiplication-based)
    
      In-memory data processing
      Non-cryptographic (e.g., FNV)

A hands-on example

Let’s look at how some hash algorithms work by following a hands-on example using PostgreSQL. Consider a PostgreSQL database designed for a public library management system. This database tracks books, authors, and borrowers and consists of four main tables:

Entity-Relationship Diagram (ERD) for a simplified library management system.

Create the tables:


    -- 1. Authors Table
CREATE TABLE authors (
	author_id SERIAL PRIMARY KEY,
	first_name VARCHAR(100),
	last_name VARCHAR(100),
	birth_date DATE,
	nationality VARCHAR(50)
);

-- 2. Books Table
CREATE TABLE books (
	book_id SERIAL PRIMARY KEY,
	title VARCHAR(255),
	author_id INT REFERENCES authors(author_id),
	isbn VARCHAR(13) UNIQUE,
	published_year INT,
	genre VARCHAR(50),
	barcode VARCHAR(50) UNIQUE
);

-- 3. Borrowers Table
CREATE TABLE borrowers (
	borrower_id SERIAL PRIMARY KEY,
	first_name VARCHAR(100),
	last_name VARCHAR(100),
	membership_date DATE,
	email VARCHAR(100) UNIQUE
);

-- 4. Borrowing records Table (tracking the borrowed books)
CREATE TABLE borrowings (
	borrowing_id SERIAL PRIMARY KEY,
	book_id INT REFERENCES books(book_id),
	borrower_id INT REFERENCES borrowers(borrower_id),
	borrow_date DATE,
	return_date DATE
);

-- 5. Table to track changes (deltas, to discuss later)
CREATE TABLE book_changes (
	book_id INT,
	title TEXT,
	author_id INT,
	new_row_hash TEXT,
	change_timestamp TIMESTAMP DEFAULT NOW(),
	UNIQUE (book_id)
);

Populate the tables:


    -- 1. Authors Table
INSERT INTO authors (author_id, first_name, last_name)
VALUES
	(1, 'Victor', 'Hugo'),
	(2, 'Fyodor', 'Dostoevsky'),
	(3, 'Joris-Karl', 'Huysmans'),
	(4, 'Thomas', 'Pynchon'),
	(5, 'Herman', 'Melville'),
	(6, 'Leo', 'Tolstoy'),
	(7, 'George', 'Orwell'),
	(8, 'Franz', 'Kafka'),
	(9, 'James', 'Joyce');

-- 2. Books Table
INSERT INTO books (title, author_id, isbn, published_year, genre, barcode)
VALUES
	('Les Miserables', 1, '9781234567890', 1862, 'Novel', '12345'),
	('Crime and Punishment', 2, '9780987654321', 1866, 'Philosophical', '23456'),
	('À Rebours', 3, '9781112131415', 1884, 'Novel', '34567'),
	('Gravitys Rainbow', 4, '9782223242526', 1973, 'Novel', '45678'),
	('Moby-Dick', 5, '9783334354647', 1851, 'Adventure', '56789'),
	('War and Peace', 6, '9784445464748', 1869, 'Historical', '67890'),
	('The Brothers Karamazov', 2, '9785556575859', 1880, 'Philosophical', '78901'),
	('1984', 7, '9786667686970', 1949, 'Dystopian', '89012'),
	('The Trial', 8, '9787778797081', 1925, 'Novel', '90123'),
	('Ulysses', 9, '9788889808192', 1922, 'Modernist', '01234');

-- 3. Borrowers Table
INSERT INTO borrowers (first_name, last_name, membership_date, email)
VALUES
	('Napoleon', 'Bonaparte', '2024-01-01', 'napoleon@example.com'),
	('Albert', 'Einstein', '2024-01-02', 'einstein@example.com'),
	('Oscar', 'Wilde', '2024-01-03', 'oscar@example.com'),
	('George', 'Washington', '2024-01-04', 'george@example.com'),
	('Mark', 'Twain', '2024-01-05', 'mark@example.com'),
	('Karl', 'Marx', '2024-01-06', 'karl@example.com'),
	('Charles', 'Darwin', '2024-01-07', 'darwin@example.com'),
	('Cleopatra', '', '2024-01-08', 'cleo@example.com'),
	('Leonardo', 'da Vinci', '2024-01-09', 'leo@example.com'),
	('Queen', 'Victoria', '2024-01-10', 'victoria@example.com');

-- 4. Borrowings Table with deliberate duplicates for demonstration
INSERT INTO borrowings (book_id, borrower_id, borrow_date, return_date)
VALUES
	(1, 1, '2024-11-01', '2024-11-10'),
	(1, 1, '2024-11-01', '2024-11-10'),  -- Duplicate entry
	(2, 2, '2024-11-02', '2024-11-12'),
	(3, 3, '2024-11-03', '2024-11-13'),
	(4, 4, '2024-11-04', '2024-11-14'),
	(5, 5, '2024-11-05', '2024-11-15'),
	(5, 5, '2024-11-05', '2024-11-15'),  -- Duplicate entry
	(6, 6, '2024-11-06', '2024-11-16'),
	(7, 7, '2024-11-07', '2024-11-17'),
	(8, 8, '2024-11-08', '2024-11-18'),
	(9, 9, '2024-11-09', '2024-11-19'),
	(10, 10, '2024-11-10', '2024-11-20');

The example purposely includes duplicate entries in the borrowing tables to illustrate some deduplication methods in the next sections.

Hash-based indexing

Hash-based indexing uses hash functions to map data entries to an index table or hash table. A hash function takes a table key as input and produces a hash value that determines the position of the corresponding data entry in the index. This way, the index points directly to the data location without reading through multiple entries. This allows for easier data retrieval since all records can be looked up easily.

Below is a simplified diagram of how hash tables work

Hash-based indexing structure

For example, let’s create a hash index on the barcode column of the books table.


    CREATE INDEX idx_books_barcode ON books USING HASH (barcode);

In the SQL code above, we create a new index called idx_books_title, while telling PostgreSQL that this is a hash-based index.

PostgreSQL does not change actual tables when adding hash indexes. The books table remains unchanged, and you can see the newly created indexes in the PostgreSQL indexes table. When you query a specific title, PostgreSQL uses the index to quickly locate rows, making lookups faster when there are many book records.


    SELECT *
FROM pg_indexes
WHERE tablename = 'books';

It returns the following.

  
      schemaname
      tablename
      indexname
      indexdef
    

  
      public
      books
      books_pkey
      CREATE UNIQUE INDEX books_pkey ON public.books USING btree (book_id)
    

      public
      books
      books_isbn_key
      CREATE UNIQUE INDEX books_isbn_key ON public.books USING btree (isbn)
    

      public
      books
      books_barcode_key
      CREATE UNIQUE INDEX books_barcode_key ON public.books USING btree (barcode)
    

      public
      books
      idx_books_barcode
      CREATE INDEX idx_books_barcode ON public.books USING hash (barcode)
    

It’s important to note that hash indexes in PostgreSQL are optimized for equality com parisons, such as = or IN, but are unsuitable for range queries (e.g., <, >, or BETWEEN).

Why use hash-based indexing?

Hash-based indexing is beneficial for queries such as equality searches, where the query looks for exact key matches. Since we’re combining two concepts, index tables, and hashing, we get the benefits from both:

Rapid data retrieval
Minimal amount of memory used when compared to other indexing methods
Consistent performance as the table size grows.

Hash-based indexing is particularly useful for large datasets. It provides almost constant time complexity for lookups because hash functions directly map keys to specific locations in a hash table.

Hash-based data deduplication

Data deduplication is a common data engineering practice that consists of removing duplicate records, saving storage, and improving data accuracy. You can use hash functions to create unique hash values for each record and then detect duplicates without performing full record comparisons.

Data deduplication methods

Several hash-based data deduplication techniques can be directly applied in a data engineering pipeline context.

  
      Method
      Description
      Use cases
    
      Unique constraints
      Compute a hash for each entry and apply constraints on this column, preventing duplicate records from being added.
      Duplicate blocking.
    
      Hash aggregation
      Use aggregation functions on hash values to identify duplicates within a single query.
      Identification and removal of duplicate entries based on combinations of columns.
    
      Cross-table hash comparison
      Create hash values in both the source and target tables and then use these to match and remove duplicates.
      Deduplication of data across tables, such as when merging data.
    
      Rolling hash
      Use a rolling hash to create a "sliding window" of hashes.
      Detection of duplicate records in real-time.
    
      Bloom filters
      Use a probabilistic data structure that checks if an element is likely in a set by hashing it into bit arrays.
      Extensive dataset filtering since it prevents the need to store full data for deduplication.

As an example, let’s consider two of the most popular data deduplication methods:

The unique constraints method is very simple and creates a constraint for the specified column. If you try to add duplicate records to this column, PostgreSQL returns an error.


    -- Add a unique constraint to the email column of the borrowers table
ALTER TABLE borrowers ADD CONSTRAINT unique_email UNIQUE (email);

-- Try to add duplicated entries
INSERT INTO borrowers (first_name, last_name, membership_date, email)
VALUES ('John', 'Doe', '2024-11-11', 'napoleon@example.com');  -- Duplicate email

Result


    ERROR:  Key (email)=(napoleon@example.com) already exists.duplicate key value violates unique constraint "borrowers_email_key"

ERROR:  duplicate key value violates unique constraint "borrowers_email_key"
SQL state: 23505
Detail: Key (email)=(napoleon@example.com) already exists.

You can also use hash aggregation to detect duplicate borrowings (duplicate borrowings are shown).


    SELECT
	book_id,
	borrower_id,
	COUNT(*) AS duplicates
FROM
	borrowings
GROUP BY
	book_id,
	borrower_id
HAVING
	COUNT(*) > 1;

Hash aggregation correctly flags the duplicate entries added earlier when populating the tables. Since it is a detection method, we end up with the following:

  
      book_id
      borrower_id
      duplicates
    
      1
      1
      2
    
      5
      5
      2

Hash-based CDC

Change Data Capture (CDC) is a set of data engineering practices that detect changed data in a given dataset. These changes are called “deltas,” and the output dataset is called a “delta-driven dataset.”

Any CDC approach includes three main components that execute sequentially:

Identification detects rows where changes have occurred using methods like hash comparisons or timestamps.
Capture extracts identified changes from the source, preparing them for downstream processing.
Delivery sends captured changes to the target system for updated data synchronization.

CDC relies on identifying and capturing changes; since hashing excels at producing unique, compact data representations, you can use it to detect even minor changes between datasets, similar to how we would monitor duplicate entries in a deduplication approach.

Hash functions for CDC

Below are the four main hash functions used in CDC, along with their main characteristics:

  
  
    
        Method
        Speed
        Security against collisions
        Efficiency for small data
        Scalability
        Consistency
      

    
        MD5
        Y
        N
        Y
        Y
        Y
      

        SHA-256
        Y*
        Y
        Y
        Y
        Y
      

        CRC32
        Y
        N
        Y
        Y
        Y
      

        FNV
        Y
        N
        Y
        Y
        Y
      

  

*SHA-256 is considered moderately fast when compared to other functions (e.g., MD5)

When selecting a hash function for CDC, consider

Speed: A fast hash function minimizes overhead when comparing rows.
Security against collisions: Collisions can lead to inaccurate change detection. A hash function with higher collision resistance ensures more reliable tracking, but it may not always be critical in less sensitive environments.
Efficiency for small increments: A hash function that can handle small data increments, such as individual rows or changes, reduces latency and resource consumption.
Scalability: The hash function must process millions of rows without significant degradation in performance.

Consistency: Consistency ensures that the same input always produces the same hash, which is critical for the CDC. Without this, identifying new changes would be unreliable.

Hash CDC example

Let’s introduce a change data capture system (CDC) that allows you to monitor changes to a given column in a table in the public library database example. CDC involves several sequential steps.

Step 1- Declare the CDC function:


    -- 1. Create a function to compute a hash value for a row (using MD5)
CREATE OR REPLACE FUNCTION calculate_row_hash(book_id INT, title TEXT, author_id INT)
RETURNS TEXT AS $$
BEGIN
  RETURN MD5(COALESCE(book_id::TEXT, '') || COALESCE(title, '') || COALESCE(author_id::TEXT, ''));
END;
$$ LANGUAGE plpgsql;

Step 2—Create a row_hash column in the books table to store a hash value that uniquely represents each row. This hash is calculated based on the book's book_id, title, and author_id. It's like a digital fingerprint for each row


    -- 2. Populate the hash column for existing rows
ALTER TABLE books ADD COLUMN IF NOT EXISTS row_hash TEXT;
UPDATE books
SET row_hash = calculate_row_hash(book_id, title, author_id);

Every time a change happens (e.g., a book's title or author changes), the row_hash identifies what changed. The book_changes table stores these changes, but only for books that have been modified. If the book already exists in book_changes, it updates the record with the new data and the new hash:


    -- 3. Capture and Store Changes (Deltas) - Insert only changed rows into book_changes table
INSERT INTO book_changes (book_id, title, author_id, new_row_hash)
SELECT
	book_id,
	title,
	author_id,
	row_hash AS new_row_hash
FROM books
WHERE book_id = 1
ON CONFLICT (book_id) DO UPDATE
SET
	title = excluded.title,
	author_id = excluded.author_id,
	new_row_hash = excluded.new_row_hash,
	change_timestamp = NOW();

Step 3—To see which rows have changed, join the book_changes table (which tracks changes) with the books table (which contains the original data). The query retrieves the book's book_id, the original title and author, and the new hash value that was created after the change. It then sorts the results by when the change happened, so most recent changes are shown first.


    -- 4. List the changed rows (deltas)
SELECT
	bc.book_id,
	b.title AS original_title,
	b.author_id AS original_author_id,
	bc.new_row_hash AS new_hash
FROM book_changes bc
JOIN books b ON bc.book_id = b.book_id
ORDER BY bc.change_timestamp DESC;

Step 4—To test or trigger the change detection, update the title of a book (book_id = 1). The system recalculates the row_hash for that row based on the new title and stores it in the row_hash column. This triggers a change that is tracked in the book_changes table.


    -- 5. Update a book to trigger the CDC mechanism (to create a change)
UPDATE books
SET
	title = 'Les Miserables Updated',
	row_hash = calculate_row_hash(book_id, 'Les Miserables Updated', author_id)
WHERE book_id = 1;

Step 5—list the changed records using two different formats.


    -- 6. View changed records
SELECT
	book_id,
	title,
	author_id,
	new_row_hash,
	change_timestamp
FROM book_changes;

-- 7. View changed records with their original and new hash values
SELECT
	b.book_id,
	b.title AS original_title,
	b.author_id AS original_author_id,
	bc.new_row_hash AS new_hash,
	b.row_hash AS original_hash,
	bc.change_timestamp
FROM
	book_changes bc
JOIN
	books b ON bc.book_id = b.book_id
ORDER BY
	bc.change_timestamp DESC;

Final SQL result.

  
      book_id
      original_title
      original_author_id
      new_hash
      original_hash
      change_timestamp
    

  
      1
      Les Miserables Updated
      1
      806c0d25073ccc6a31629e3caa864419
      4eefca8fcb17c75cde341219dd370eaf
      33:33.2
    

Hash-based CDC vs. other CDC methods

Hash-based CDC methods are designed to excel in real-time environments. They quickly process large datasets with minimal overhead for tracking changes in high-frequency, high-volume systems. Plus, they’re easy to implement, with less complexity than other approaches.

However, hash collisions introduce challenges depending on the monitoring type; other methods might be better if you need precise, field-level tracking or stronger data integrity. Log-based CDC, for example, offers granular information of exactly which fields changed and avoids collision risks altogether.

DataForge’s hash-based CDC is unique in that it processes changes efficiently and addresses these challenges by enabling users to track hash-based data lineage through its UI. This UI simplifies monitoring and minimizes the error rate, ensuring reliability even in scenarios where common hash-based methods might fall short due to collision risks or lack of granularity.

Hash-based vs. log-based CDC

Hash joins

Hash joins leverage hashing to organize data from each table into hash tables, allowing quick lookups and minimizing the need for full-table scans. They do this by first hashing rows from one table by the join key, creating a quick-access map. Then, as the other table is scanned, each row is checked against this map for fast matching, avoiding a full comparison of both tables.

Figure N: A generic hash join diagram

This efficiency makes hash joins particularly valuable in ETL processes, where large datasets need to be joined and transformed efficiently, or in distributed databases, where minimizing data movement and optimizing join performance is critical.

Approaches for implementing hash joins

Hash joins can be implemented in various ways, depending on the context we’re working on. The main differentiating factor is memory usage; each method is designed to optimize join performance based on memory constraints and dataset size.

  
      Join
      Description
      Pro
      Con
    

  
      In-memory hash join
      Uses available memory to store the hash table.
      It’s fast and efficient for small datasets that fit within RAM.
      Can’t handle large tables due to memory constraints.
    

      Partitioned hash join
      Splits tables into smaller partitions and hashes each separately.
      Reduces memory requirements and scales to larger datasets
      Introduces some overheads for partition management.
    

      Grace hash join
      A variation of the partitioned join—writes partitions to disk.
      Handles very large datasets by processing them in manageable chunks.
      Introduces some overheads for disk management.
    

      Hybrid hash join
      Starts with an in-memory hash but adapts to partitioned processing if memory overflows.
      Balances efficiency and scalability. 
      Ideal for varying data sizes.
    

Hash join example

Let us go back to the library example and illustrate how to perform a hash join using two tables.


    -- 1. Create a hash index on the book_id column of borrowings
CREATE INDEX idx_borrowings_book_id ON borrowings USING HASH (book_id);

-- 2. Perform the actual join
SELECT
	b.book_id,
	b.title,
	b.author_id,
	bo.borrower_id,
	bo.borrow_date,
	bo.return_date
FROM
	books b
JOIN
	borrowings bo
ON
	b.book_id = bo.book_id;

If the code above looks like a conventional join operation, that’s because it is; in PostgreSQL, joins are optimized by the query planner, so the actual HASH JOIN operation happens internally. However, you can simulate the hashing process by using a hash index, getting the following in return:

  
      book_id
      title
      author_id
      borrower_id
      borrow_date
      return_date
    

  
      1
      Les Miserables Updated
      1
      1
      11/1/2024
      11/10/2024
    

      1
      Les Miserables Updated
      1
      1
      11/1/2024
      11/10/2024
    

      2
      Crime and Punishment
      2
      2
      11/2/2024
      11/12/2024
    

      3
      À Rebours
      3
      3
      11/3/2024
      11/13/2024
    

      4
      Gravitys Rainbow
      4
      4
      11/4/2024
      11/14/2024
    

      5
      Moby-Dick
      5
      5
      11/5/2024
      11/15/2024
    

      5
      Moby-Dick
      5
      5
      11/5/2024
      11/15/2024
    

      6
      War and Peace
      6
      6
      11/6/2024
      11/16/2024
    

      7
      The Brothers Karamazov
      2
      7
      11/7/2024
      11/17/2024
    

      8
      1984
      7
      8
      11/8/2024
      11/18/2024
    

      9
      The Trial
      8
      9
      11/9/2024
      11/19/2024
    

      10
      Ulysses
      9
      10
      11/10/2024
      11/20/2024
    

Note that a hash join doesn't deduplicate data during the join process; it simply matches rows based on the hash of the book_id column.

When the same book_id appears more than once in borrowings, those duplicate entries are matched with the books table, leading to multiple output rows for the same book.

Hash functions for data sharding

Data sharding consists of splitting a large dataset across multiple database nodes or partitions to distribute load, improve performance, and enhance scalability. By dividing data into "shards," each node can handle a data subset, reduce bottlenecks, and make the system more efficient and resilient to traffic spikes.

Hash-based sharding assigns data to shards based on the output of a hash function. Each data record is hashed using a specific key (like a user ID), and the resulting hash value determines the target shard. For example, a common approach is to take the hash output modulo the number of available shards (hash(key) % shard_count). This approach balances data across shards by distributing records in a way that’s both consistent and computationally efficient.

Benefits of hash-based sharding:

Distributes data evenly, avoiding hot spots or overloaded shards.
It’s easy to add or remove shards by adjusting the modulo operation, making hash-based sharding flexible as data volume grows.
Each key reliably maps to a specific shard, simplifying data retrieval and maintaining quick lookup times.

Generalized hash-based data sharding process flow

Hash functions for data obfuscation

Data obfuscation is the process of transforming information in such a way that the original value is not accessible to unauthorized users. This practice is widely used in data engineering pipelines dealing with sensitive records such as personal identities, financial data, medical records, and other confidential information.

Since hashing is a one-way process, it is a solid choice regarding data obfuscation. By transforming sensitive data into fixed-length, seemingly random strings, hashing can effectively conceal the underlying information while preserving certain structural properties necessary for downstream processing.

Anonymizing data using hash-based obfuscation

The specific hashing approach used for obfuscation can vary depending on the requirements of the data pipeline. For smaller datasets, in-memory hashing may be a fast and efficient solution. However, larger volumes of sensitive data may call for partitioned hashing strategies, which divide the data into manageable chunks to reduce memory demands.

More advanced obfuscation techniques, such as salted hashing, incorporate unique "salt" values into the hashing process to provide an additional layer of protection. This helps defend against attacks like rainbow table lookups and further secures the concealed data.

For example, let’s obfuscate a sensitive field for the “borrowers” table:


    -- 0. Enable pgcrypto extension (only needs to be done once)
CREATE EXTENSION IF NOT EXISTS pgcrypto;

-- 1. Add a new column for the hashed email
ALTER TABLE borrowers
ADD COLUMN hashed_email TEXT;

-- 2. Create a function to hash the email using SHA-256 for stronger security
CREATE OR REPLACE FUNCTION hash_email(input_email TEXT)
RETURNS TEXT AS $$
	-- Hash the email using SHA-256 and return as a hex string
	SELECT encode(digest(input_email, 'sha256'), 'hex');
$$ LANGUAGE sql;

-- 3. Add a column for the hashed email
ALTER TABLE borrowers ADD COLUMN IF NOT EXISTS hashed_email TEXT;

-- 4. Update the borrowers table with the hashed email
UPDATE borrowers
SET hashed_email = hash_email(email);

-- 5. View the updated borrowers with hashed emails
SELECT borrower_id, first_name, last_name, email, hashed_email
FROM borrowers;

By now, we must be familiar with how a hash value looks, which is exactly what we get from the code above.

  
      borrower_id
      first_name
      last_name
      email
      hashed_email
    

  
      1
      Napoleon
      Bonaparte
      napoleon@example.com
      c4ba331b66a0dcbc08757a771b31eeb87850c3373b6bda24ddee05b80ada8a88
    

      2
      Albert
      Einstein
      einstein@example.com
      5efce675f54c30064875bca8f75ba2acedf594f326434f4c9c1a545cc7597984
    

      3
      Oscar
      Wilde
      oscar@example.com
      0ece59855504dda4281f7fe85cd8d8a653a6aeefd9b1885767d6b89b08ba19a9
    

      4
      George
      Washington
      george@example.com
      47508cb4512f7b46aaeb34b4a514c63d63ec240bf28cbc4410addd4867c10066
    

      5
      Mark
      Twain
      mark@example.com
      70b4f63574fbce65d870a6ba773f57eaabcc39e8f94b63bb840776771919dc65
    

      6
      Karl
      Marx
      karl@example.com
      28d554c51d4197942ae283cb2dee2241919224956f1e9aaff605a747a7ef318d
    

      7
      Charles
      Darwin
      darwin@example.com
      f94769776eba1492b0a10f92e9ca50395192454dd15f714d5663aa41e06f6e52
    

      8
      Cleopatra
      
      cleo@example.com
      b11b6ad3e195b530c96c9f94774644c46383b81f62116126608814c021d70f04
    

      9
      Leonardo
      da Vinci
      leo@example.com
      6473c0297478fe29d2e428c34026021e02a59655b7f5aa2c10c2d607f3fe4354
    

      10
      Queen
      Victoria
      victoria@example.com
      0015778c4d4f8637d659bf95806a07ee06368ee94623ec9d1346f1096336593f
    

Hash partitioning

Hash partitioning is a common technique used in data engineering to improve the performance and scalability of data processing pipelines. The basic idea is to divide a large dataset into smaller, more manageable partitions based on the hash function values of one or more columns.

Hash vs. range partitioning

Hash partitioning leverages the uniform distribution properties of hash functions to assign data to partitions. By hashing the value of a chosen partitioning column, the data is evenly distributed across partitions, ensuring a balanced workload. This contrasts with range partitioning, which assigns data to partitions based on the actual value ranges of the partitioning column(s).

For example, consider a table of customer orders. With range partitioning, we might partition the data by order date, creating partitions for orders from 2022-Q1, 2022-Q2, and so on. In hash partitioning, we could instead partition by hashing the customer ID, which would distribute the orders more evenly regardless of the dates.

The choice between hash and range partitioning depends on the specific requirements of our data pipeline. Hash partitioning excels when you want to:

Evenly distribute a large dataset for parallel processing.
Avoid hotspots or skew in our partitions.
Quickly locate data based on a unique identifier.

Range partitioning, on the other hand, is preferable when you want to:

Query data within a specific range (e.g., orders from a given quarter).
Require more human-readable partition names (e.g., date ranges).

However, in most cases, a hybrid approach can be used.

Data compression

Data compression represents data in a more compact form without losing the underlying information. It is a fundamental technique used in data engineering to reduce the storage footprint.

Hash-based compression techniques leverage the unique properties of hashing to achieve the following:

Deduplication: Identical data elements produce the same hash value, allowing you to easily identify and remove duplicate entries.
Indexing: The uniform distribution of hash values makes it easy to create efficient indexes for quickly locating specific data within the compressed dataset.
Compact representations: In many cases, the hash value can be stored in place of the original data, achieving substantial space savings.
Integrity checks: Hash values can generate checksums or other validation mechanisms to verify that decompressed data has retained its completeness and accuracy.

Most of the methods can be applied in combination with each other, as shown.


    -- 1. Deduplication: Removing duplicate rows based on hash of book details
WITH hashed_books AS (
	SELECT book_id, hash_email(title || author_id || isbn) AS book_hash
	FROM books
)
DELETE FROM books
WHERE book_id NOT IN (SELECT MIN(book_id) FROM hashed_books GROUP BY book_hash);

-- 2. Indexing: Creating a hash-based index on a computed concatenated column (book details)
ALTER TABLE books ADD COLUMN IF NOT EXISTS title_author_isbn_hash TEXT;
UPDATE books SET title_author_isbn_hash = hash_email(title || author_id || isbn);

CREATE INDEX idx_books_hash ON books (title_author_isbn_hash);

-- 3. Compact Representations: Storing only hash values for titles
ALTER TABLE books ADD COLUMN IF NOT EXISTS title_hash TEXT;
UPDATE books SET title_hash = hash_email(title);

-- 4. Integrity Checks: Generating a checksum to validate data integrity after compression
SELECT book_id, title, encode(digest(title || author_id || isbn, 'sha256'), 'hex') AS integrity_checksum
FROM books;

The resulting “books” table after hash-based data compression transformations.

  
      book_id
      title
      integrity_checksum
    
      2
      Crime and Punishment
      a26fb63c594e0f293f09a66bc9415ef2392891afac725ffd304e22e044d4f991
    
      3
      À Rebours
      59f0c92af15edd76aad8d39443beaacde73ea449b9555c61ee84d45206de40bb
    
      4
      Gravitys Rainbow
      514f6f450f92c9ceebbf047bcf3dc01c5de30b717bdbffafd5e15cf1bb62cba4
    
      5
      Moby-Dick
      ca87ea21b996cb794832225189c305a9d6b18d5d3cd948d099d9bfe7449dfcdb
    
      6
      War and Peace
      f5f5b5ab2078a6b46d1c9625ac5319cd090700a58a9117ab9ea1a2e497045f9c
    
      7
      The Brothers Karamazov
      69c09b753fb984864460f03c75be1a6615321e6a0969efd30f39c20ab094986c
    
      8
      1984
      fcd70a1a71f6f633544851bbbad9bdad4702e04fe713543e014ccd12aeec7641
    
      9
      The Trial
      89a72df81ee222bf6181e37a90ce3c7e94ccfb89073890a304b13a4874887b80
    
      10
      Ulysses
      2af29ed5d64df3d9a4884a2b1af9b5e16a6fbea4cb763f0b6a1862430fda6db1
    
      1
      Les Miserables Updated
      0ed876acf21033709f9111b6b17125856f92f06cf3a0a2864805d2e98fcf620c

Last thoughts

The flexibility and versatility of hashing make it suitable for various applications and adaptable to virtually any use case. Applying hash functions in data engineering pipelines unlocks a powerful toolset that enhances data integrity, reduces redundancy, and optimizes performance across ETL pipelines. For example, DataForge uses hash-based CDC to monitor data changes on a per-row granularity, allowing for real-time updates and facilitating reliable auditing processes.

Navigate Chapters:

Hash In Data Engineering Key Concepts

Table of Contents