Advanced SQL Concepts

Advanced SQL Concepts for Data Engineers

Multi-chapter guide | Chapter 1

A decade ago, simple tabular structures were enough to represent most of the data that data engineers worked with. There were not many devices actively and passively collecting information. Additionally, the data volume that data pipelines managed was limited as cloud computing was more expensive. 


Today, things are different — data does not quite fit into neat tabular arrangements anymore, and the annual information volume has seen more than a 10-fold increase since 2014.


Due to these challenges, data engineers have become increasingly creative with data querying, transformation, and storage. Many adopt advanced SQL techniques going far beyond the basics. This article explores five advanced SQL concepts actively used in production environments. Each represents a practical solution to daily real-world problems when working with today's larger and more complex datasets.

Summary of key advanced SQL concepts

Concept Description
Data modeling with JSON and semi-structured data Use the ->> operator to retrieve text from the JSON object, then cast to the required data type - for example, numeric value for comparison.
Advanced query patterns with Common Table Expressions (CTEs) Use the WITH clause to define a temporary result set
Window functions Apply ROW_NUMBER(), RANK(), SUM(), etc., to perform cumulative calculations within defined partitions. This avoids grouping while enabling calculations across rows.
Subqueries & lateral joins Use correlated subqueries for accessing values from the outer query.

Use lateral joins by employing the JOIN LATERAL() syntax to access columns from earlier tables in the query, allowing for row-by-row operations.
Temporal data management in data warehouses Create temporal tables for tracking the state of the other tables, use the PARTITION BY RANGE syntax for time-based partitioning, perform time travel queries for accessing historical data as it existed at that time, and use the OVER syntax for window functions.

We illustrate these concepts using a PostgreSQL e-commerce website, starting with a basic schema and incrementally building on it. The website database contains:

Table Description
Products Product information such as barcode, description, price
Customers Customer information such as ID, name, email, address
Orders Order information such as order timestamp, customer
Reviews Review information such as customer name, rating, etc.
Categories Category ID and description. Each product is associated with a single category.

#1 Data modeling with JSON and semi-structured data

Modern APIs use semi-structured formats such as JSON, XML, and YAML due to their flexibility, ease of use, and lightweight nature. Many modern SQL platforms have mechanisms to work with semi-structured json data.  For example, PostgreSQL has two main JSON data types: 

  • JSON stores an exact copy of the input text, which processing functions must reparse on each execution.

  • JSONB data stores a decomposed binary format and supports indexing. 

Both are similar, but JSON does not support indexing. There are performance implications as well. JSONB’s binary information is not reparsed for every read operation, saving time and resources for large and constant transformation operations. However, JSONB ingestion process is slower because the JSON format must be parsed to binary. 

To understand, let’s create the “Products” and “Reviews” tables of the e-commerce website. The “Reviews” table ingests data from an external API. It includes fields for review ID, product barcode, etc., along with a hierarchical structure for storing the original API JSON response with multiple key fields such as customer name, rating, etc. The JSONB data type is used for the api_response field.

-- Create Products Table
CREATE TABLE Products (
	barcode VARCHAR(20) PRIMARY KEY,
	name VARCHAR(100) NOT NULL,
	description TEXT,
	price DECIMAL(10, 2) NOT NULL
);

-- Create Reviews Table
CREATE TABLE Reviews (
	review_id SERIAL PRIMARY KEY,
	product_barcode VARCHAR(20) REFERENCES Products(barcode),
	api_response JSONB NOT NULL
);
  

You can then populate the “Products” and “Reviews” tables as follows.

-- Insert two products
INSERT INTO Products (barcode, name, description, price)
VALUES
('123456789', 'The Optimists'' Calendar', 'Perpetual calendar that always shows tomorrow as a "better day". Requires a monthly subscription.', 49.99),
('987654321', 'Polystyrene Rosebud', 'Decorative rosebud that emits beautiful smells throughout your home.', 199.99);

-- Insert reviews for the first product
INSERT INTO Reviews (product_barcode, api_response)
VALUES
('123456789',
	'{"reviewer": "John Doe",
	"rating": 5.0,
	"comment": "Great product!",
	"verified": true}'
),
('123456789',
	'{"reviewer": "Jane Smith",
	"rating": 3.8,
	"comment": "Starting to feel more positive after 1 month. Worth it!",
	"verified": true}'
),
('123456789',
	'{"reviewer": "Bob Johnson",
	"rating": 1.0,
	"comment": "Subscription? Are you kiddin''?",
	"verified": false}'
);

-- Insert reviews for the second product
INSERT INTO Reviews (product_barcode, api_response)
VALUES
('987654321',
	'{"reviewer": "Alice Brown",
	"rating": 1.5,
	"comment": "Terrible quality. Broke after one use.",
	"verified": true}'
),
('987654321',
	'{"reviewer": "Charlie Davis",
	"rating": 5.0,
	"comment": "Choking hazard! Otherwise good smells!!",
	"verified": true}'
),
('987654321',
	'{"reviewer": "Eve Wilson",
	"rating": 1.0,
	"comment": "Worst purchase ever. Got a massive rash just by looking at it!!",
	"verified": true}'
),
('987654321',
	'{"reviewer": "Frank Miller",
	"rating": 1.2,
	"comment": "This product smells like the public loo. Will never buy again.",
	"verified": false}'
);  
  

Check the resulting tables.

  
-- Select all products & reviews
SELECT * FROM Products;
SELECT * FROM Reviews;  
  

Products table

barcode name description price
123456789 The Optimists’ Calendar Perpetual calendar that always shows tomorrow as a “better day”. Requires a monthly subscription. 49.99
987654321 Polystyrene Rosebud Decorative rosebud that emits beautiful smells throughout your home. 199.99

Reviews table

review_id product_barcode api_response
1 123456789 {"rating": 5.0, "comment": "Great product!", "reviewer": "John Doe", "verified": true}
2 123456789 {"rating": 3.8, "comment": "Starting to feel more positive after 1 month. Worth it!", "reviewer": "Jane Smith", "verified": true}
3 123456789 {"rating": 1.0, "comment": "Subscription? Are you kiddin'?", "reviewer": "Bob Johnson", "verified": false}
4 987654321 {"rating": 1.5, "comment": "Terrible quality. Broke after one use.", "reviewer": "Alice Brown", "verified": true}
5 987654321 {"rating": 5.0, "comment": "Choking hazard! Otherwise good smells!!", "reviewer": "Charlie Davis", "verified": true}
6 987654321 {"rating": 1.0, "comment": "Worst purchase ever. Got a massive rash just by looking at it!!", "reviewer": "Eve Wilson", "verified": true}
7 987654321 {"rating": 1.2, "comment": "This product smells like the public loo. Will never buy again.", "reviewer": "Frank Miller", "verified": false}

We’re using conventional JSON notation to do this. However, the information will not be stored this way; it will be parsed upon query execution and be converted to a binary format. This also changes how the information is returned since an inverse parsing process will need to occur. If we try to get the values by querying the table, we get the same data we inserted.

To avoid JSON in the retrieved results, you can use the ->> operator to retrieve text from the JSON object, then cast to the required data type - example,  numeric value for comparison.

-- Return JSON values
SELECT product_barcode, api_response->>'reviewer' AS reviewer,
   	(api_response->>'rating')::numeric AS rating
FROM Reviews
WHERE (api_response->>'rating')::numeric > 3; 
  

Product ratings with rating greater than 3

product_barcode reviewer rating
123456789 John Doe 5.0
123456789 Jane Smith 3.8
987654321 Charlie Davis 5.0

By using JSONB, you can also create indexes on specific fields within the JSON structure, just like with regular columns in a relational database. These indexes allow PostgreSQL to quickly locate records, potentially improving query performance for large datasets. 

For example, you can create an index on the 'rating' field:

DataForge
SQL
-- Create index
CREATE INDEX idx_review_rating ON Reviews ((api_response->>'rating')); 
  

We have outlined one approach above. While there are various approaches available to parse and store semi-structured data within data pipelines the challenge increases when dealing with large nested arrays and hierarchical structures. Modern data platforms have developed solutions to address this - for instance, DataForge recently introduced sub-sources, which allow the creation of virtual tables from nested complex arrays (ARRAY<STRUCT<..>>) without physical normalization.

#2 Advanced query patterns with Common Table Expressions (CTEs)

Common Table Expressions (CTEs) are temporary subqueries that exist in the context of the larger query. You can conceptualize them as “virtual tables” that can be referenced multiple times within a single SQL statement. They provide another level of abstraction to facilitate complex querying, reusability, and scalability.

For example, let’s say you want to analyze product performance based on their reviews. You can create a query that:

  • Calculates the average product rating

  • Determines above-average products and returns their data

-- Calculate average for each product
WITH avg_product_ratings AS (
	SELECT
    	product_barcode,
    	AVG((api_response->>'rating')::numeric) as avg_rating
	FROM Reviews
	GROUP BY product_barcode
) 
  

The above CTE transforms raw review data into a simple table containing each product's average rating. It extracts the rating from the JSONB structure using the same technique we discussed earlier.

-- Extend calculation to all products
, overall_average AS (
	SELECT AVG(avg_rating) as overall_avg
	FROM avg_product_ratings
) 
  

The above CTE calculates the average rating across all products:

DataForge
SQL
-- Combine both expressions to a single query
SELECT
	p.name,
	p.price,
	apr.avg_rating,
	oa.overall_avg
FROM avg_product_ratings apr
JOIN Products p ON p.barcode = apr.product_barcode
CROSS JOIN overall_average oa
WHERE apr.avg_rating > oa.overall_avg
ORDER BY apr.avg_rating DESC;  
  

Combining both CTEs you get products with average rating greater than the overall average.

name price avg_rating overall_avg
The Optimists' Calendar 49.99 3.2666 2.7208

Without CTEs, you would need to either nest these queries, making them harder to read and maintain, or create temporary tables, introducing a wide array of extra work and potential conflicts with current tables.


For comparison purposes, here is the same query using temporary tables.

-- Create first temporary table
CREATE TEMPORARY TABLE avg_product_ratings AS
	SELECT product_barcode,
       	AVG((api_response->>'rating')::numeric) as avg_rating
	FROM Reviews
	GROUP BY product_barcode;

-- Create second temporary table
CREATE TEMPORARY TABLE overall_average AS
	SELECT AVG(avg_rating) as overall_avg
	FROM avg_product_ratings;

-- Final query
SELECT [...]

-- Cleanup
DROP TABLE avg_product_ratings;
DROP TABLE overall_average; 
  

CTEs simplify database management and are popular. However, they can severely cap the parent query due to suboptimal query plans. Besides, a CTE can only be referenced within the single SQL statement in which it’s defined. For example, PostgreSQL recalculates a CTE's results each time it is referenced in the main query, leading to performance bottlenecks. This often forces developers to use temporary tables when the same data transformation is needed across multiple stages or queries.

Materialized CTEs

Materialized CTEs offer a way to "save" the results of a CTE in memory. Instead of recalculating these results every time they're needed, PostgreSQL (version 12 onwards) refers back to this saved version. This improves query performance where:

  1. The CTE is referenced multiple times in the query.

  2. Each reference would be costly to compute.

However, materialized CTEs can also negatively impact performance if the CTE result set is large enough or it’s used sparsely throughout the query. Here’s an example creating a materialized CTE for average product ratings.

-- Define materialized CTE explicitly
WITH MATERIALIZED avg_product_ratings AS (
	SELECT
    	product_barcode,
    	AVG((api_response->>'rating')::numeric) as avg_rating
	FROM Reviews
	GROUP BY product_barcode
)
  

The MATERIALIZED keyword tells PostgreSQL to compute the average ratings once and store the results in memory.

You can add another CTE to calculate the overall average rating and use them both same as before.

  , overall_average AS (
	SELECT AVG(avg_rating) as overall_avg
	FROM avg_product_ratings
)
  
  SELECT
	p.name,
	p.price,
	apr.avg_rating,
	oa.overall_avg
FROM avg_product_ratings apr
JOIN Products p ON p.barcode = apr.product_barcode
CROSS JOIN overall_average oa
WHERE apr.avg_rating > oa.overall_avg
ORDER BY apr.avg_rating DESC;
  

The avg_product_ratings expression is materialized and used twice: once to calculate the overall average and again in the final SELECT statement. The materialized CTE ensures the potentially expensive operation of calculating average JSON data ratings is only performed once, even though we use the results multiple times.

Recursive CTEs

Recursive CTEs are relevant with hierarchical or self-referential data structures, allowing you to query deeply nested organizational structures, large category trees, etc. Like programmatic recursion, they consist of two main components: a base case (initial query) and a recursive part that references the CTE itself. The recursive expression is evaluated upon query execution, building up the result set until no more rows are produced.

While recursive CTEs might sound like a feasible approach to solving nested structures, they are not commonly used due to the unexpected impacts they might have on query performance; recursive calls can potentially lead to infinite loops if not properly designed. Their computational complexity often grows exponentially with recursive depth, making them a difficult to get right.

#3 Window functions

Window functions, introduced in the SQL:2003 standard, use values from one or multiple rows, also referred to as a "window of rows." These functions return each row’s value without collapsing all rows involved, as an aggregate function such as a GROUP BY statement would.

In essence, window functions operate on a data "window" that can be based on partitions, ordered sets, or even a specified number of rows before and after the current row. This capability enables complex operations without resorting to unnecessarily complex queries.

All window functions consist of two main components:

  1. The window function itself is like aggregate functions such as SUM, AVG, and COUNT or rank functions such as OVER, DENSE_RANK, and ROW_NUMBER.

  2. An OVER clause that specifies how to partition the rows (PARTITION BY) and how to order the rows (ORDER BY).

For example, let's say you want to calculate a running total of product sales and rank products within their categories. Create the “Orders” table and add some entries to it.

  -- Create new table “Orders”
CREATE TABLE Orders (
	order_id SERIAL PRIMARY KEY,
	product_barcode VARCHAR(20) REFERENCES Products(barcode),
	order_date DATE,
	sales_amount DECIMAL(10, 2)
);

-- Insert 3 orders into the Orders table
INSERT INTO Orders (product_barcode, order_date, sales_amount)
VALUES
      -- Order for 'The Optimists' Calendar'
	('123456789', '2024-10-01', 49.99),
      -- Order for 'Polystyrene Rosebud' 
	('987654321', '2024-10-05', 199.99),
      -- Another optimist order for 'The Optimists' Calendar'
	('123456789', '2024-10-07', 49.99);
  

Introduce window functions to perform the analysis. The query calculates a running sales total within each product category, ordered by date, and then ranks products within their category based on sales amount.

  -- Perform window function
SELECT
	p.name AS product_name,
	o.order_date,
	o.sales_amount,
	SUM(o.sales_amount) OVER (
    	PARTITION BY p.name
    	ORDER BY o.order_date
	) AS product_running_total,
	RANK() OVER (
    	ORDER BY o.sales_amount DESC
	) AS sales_rank
FROM
	Orders o
JOIN
	Products p ON o.product_barcode = p.barcode
ORDER BY
	p.name, o.order_date;
  

The window function in the above query is composed of two main parts.

  1. The SUM() OVER() function — Creates a running total, composed of the function, and the OVER keyword.

  2. The RANK() OVER() functionRanks each product within its category, composed of the function, and the OVER keyword.

Although not directly part of the window functions, we also use the PARTITION()and ORDER BY()functions to perform the operations by categories and sort the results. 

You’ll get the following products with their sales and rankings.

product_name order_date sales_amount product_running_total sales_rank
Polystyrene
Rosebud
2024-10-05 199.99 199.99 1
The Optimists' Calendar 2024-10-01 49.99 49.99 2
The Optimists' Calendar 2024-10-07 49.99 99.98 2

Without window functions you would have needed multiple subqueries and self-joins to achieve the same result.

-- Perform without window function
WITH sales_data AS (
	SELECT
    	p.name AS product_name,
    	o.order_date,
    	o.sales_amount
	FROM
    	Orders o
	JOIN
    	Products p ON o.product_barcode = p.barcode
),
running_totals AS (
	SELECT
    	sd1.product_name,
    	sd1.order_date,
    	sd1.sales_amount,
    	(SELECT SUM(sd2.sales_amount)
     	FROM sales_data sd2
     	WHERE sd2.product_name = sd1.product_name
       	AND sd2.order_date <= sd1.order_date) AS product_running_total
	FROM
    	sales_data sd1
),
sales_ranks AS (
	SELECT
    	product_name,
    	order_date,
    	sales_amount,
    	RANK() OVER (ORDER BY sales_amount DESC) AS sales_rank
	FROM
    	sales_data
)
SELECT
	rt.product_name,
	rt.order_date,
	rt.sales_amount,
	rt.product_running_total,
	sr.sales_rank
FROM
	running_totals rt
JOIN
	sales_ranks sr ON rt.product_name = sr.product_name
              	AND rt.order_date = sr.order_date
              	AND rt.sales_amount = sr.sales_amount
ORDER BY
	rt.product_name, rt.order_date;
  

The above query achieves the same result as our window function example, but it's significantly more complex. You need multiple CTEs to break down the logic and correlated subqueries that are less efficient for large datasets. The query is also much longer and harder to read, making it more error-prone and difficult to maintain. 

#4 Subqueries and lateral joins

Subqueries & lateral joins are related SQL concepts that can be used as standalone or in combination.

Subqueries

Subqueries, also known as inner queries or nested queries, are SQL queries embedded within a larger query. They are classified by the result type they return.

  • Scalar subqueries return a single value, which can be one row or one column.

  • Row subqueries return a single row with multiple columns.

  • Column subqueries return a single column with multiple rows.

  • Table subqueries return multiple rows and multiple columns.

Correlated subqueries

In addition to the subquery types above, you have one more type; correlated subqueries. They can return a scalar value, a single row, or multiple rows depending on how they are written. They reference one or more columns from the outer query, typically in the WHERE clause, executing exactly once for each row processed by the outer query.

Unlike other subquery types, correlated subqueries do not execute independently of the outer query; their results change for each row of the outer query, as they use the current row's values in their calculation.

For example, you can use subqueries to analyze product performance by comparing individual products to category averages and retrieving recent customer feedback. Correlated subqueries perform row-specific comparisons across categories and lateral joins for fetching the most recent reviews.

For this example, let’s first add a new category column to the “Products” table, more products with varying categories, and more reviews and orders for added variability.

  -- Insert more products with varying categories
INSERT INTO Products (barcode, name, description, price, category)
VALUES
('112233445', 'Inspirational Desk Quotes', 'A set of motivational quotes for your desk.', 19.99, 'Stationery'),
('998877665', 'Wooden Scented Candle Holder', 'Handmade candle holder that emits a natural aroma.', 39.99, 'Decorative'),
('443322110', 'Self-Care Journal', 'A journal designed for personal growth and self-care.', 29.99, 'Stationery'),
('554433221', 'Ceramic Flower Pot', 'A minimalist flower pot for indoor plants.', 59.99, 'Decorative');

-- Insert more reviews
INSERT INTO Reviews (product_barcode, api_response)
VALUES

-- Insert reviews for 'The Optimists' Calendar'
('123456789',
	'{"reviewer": "Mia Kim",
	"rating": 4.5,
	"comment": "Good concept, but a tad expensive.",
	"verified": true}'
),
('123456789',
	'{"reviewer": "Ethan Taylor",
	"rating": 2.0,
	"comment": "I don't really get it.",
	"verified": false}'
),
-- Insert more reviews for 'Polystyrene Rosebud'
('987654321',
	'{"reviewer": "Hannah Lee",
	"rating": 4.0,
	"comment": "Pretty good, but my dog ate it on the second day of use. He’s OK though.",
	"verified": true}'
),
('987654321',
	'{"reviewer": "James White",
	"rating": 2.0,
	"comment": "Not what I expected. Smells…funny.",
	"verified": true}'
),

-- Insert reviews for new products
('112233445',
	'{"reviewer": "Liam Carter",
	"rating": 5.0,
	"comment": "Love the daily inspiration!",
	"verified": true}'
),
('998877665',
	'{"reviewer": "Sophie Green",
	"rating": 4.2,
	"comment": "Beautiful craftsmanship.",
	"verified": true}'
),
('443322110',
	'{"reviewer": "Olivia Martin",
	"rating": 3.9,
	"comment": "Helps me stay organized, but could be better.",
	"verified": true}'
),
('554433221',
	'{"reviewer": "Jack Cooper",
	"rating": 1.8,
	"comment": "Too fragile, broke quickly.",
	"verified": false}'
);

-- Insert more orders
INSERT INTO Orders (product_barcode, order_date, sales_amount)
VALUES
	-- Orders for new products
	('112233445', '2024-09-30', 19.99),
	('998877665', '2024-10-02', 39.99),
	('443322110', '2024-10-03', 29.99),
	('554433221', '2024-10-05', 59.99),

	-- Additional orders for existing products
	('123456789', '2024-10-06', 49.99),
	('987654321', '2024-10-08', 199.99);
  

We can then calculate the review count versus the category average as shown.

  -- Return products with number of reviews greater than category averages
SELECT p.name,
   	p.category,
   	review_count,
   	r.api_response->>'reviewer' AS reviewer,
   	r.api_response->>'rating' AS rating,
   	r.api_response->>'comment' AS comment
FROM Products p
JOIN LATERAL (
	SELECT COUNT(*) AS review_count
	FROM Reviews r
	WHERE r.product_barcode = p.barcode
) subq ON true
JOIN LATERAL (
	SELECT r.api_response
	FROM Reviews r
	WHERE r.product_barcode = p.barcode
	ORDER BY r.api_response->>'rating' DESC
	LIMIT 1
) r ON true
WHERE review_count > (
	SELECT AVG(review_count)
	FROM (
    	SELECT COUNT(*) AS review_count
    	FROM Reviews r2
    	JOIN Products p2 ON r2.product_barcode = p2.barcode
    	WHERE p2.category = p.category
    	GROUP BY r2.product_barcode
	) AS category_avg
);
  

This query combines the two methods:

  1. The correlated subquery identifies which products have above-average review counts for their category.

  2. The lateral join - JOIN LATERAL() syntax - fetches the top review based on the rating for each product that the correlated subquery identified.

The relevant idea here is that this would not be possible with a regular join; you are increasing your calculation scope, allowing for reasonably complex operations.

The query returns products with their sales and rankings using no window function.

name category review_count top_reviewer rating comment
The Optimists' Calendar Stationery 5 John Doe 5.0 Great product!
Polystyrene Rosebud Decorative 6 Charlie Davis 5.0 Choking hazard! Otherwise good smells!!

#5 Temporal data management in data warehouses

Temporal data management is storing, querying, and analyzing time-based data in databases or warehouses. You can use it to:

  • Maintain a history of data changes over defined periods for historical tracking and auditing.

  • Analyze a sequence of data points collected over time.

  • Monitor data quality over time.

  • Identify and resolve bugs by backtracking through historical data states.

Temporal data management is not a single method but a set of practices outlined below.

Method Description Use
Temporal tables Tables that track the state of the other tables, registering the history of data changes. Maintain an audit trail of data modifications over time.
Time-based partitioning Divide or partition larger tables into smaller chunks, delimited by time intervals. Improve query performance in large datasets.
Time travel queries Access historical data as it existed at that time. Audit, debt, historical analysis, and compare database states across different periods.
Window functions Use values from one or multiple rows and return a value for each row without collapsing all rows involved. Define our windows as time frames, allowing for time-series analyses such as rolling averages, moving totals, and many more.

Different platforms offer distinct approaches to implementing temporal data management methods across their services. For instance, Databricks provides time travel features through services like Delta Lake to access historical data versions. You can use it to simplify auditing, facilitate data rollback in cases of accidental changes, and enhance the reproducibility of experiments and reports.

Snowflake offers similar capabilities, supporting tasks such as restoring deleted tables, duplicating data from specific points in time, and analyzing data usage changes. 

For example, consider tracking and analyzing product prices and sales trends over time. You can implement it with a combination of temporal tables, window functions, and time travel queries.  

Create a table for tracking price changes and populate it with data entries:

-- Create a temporal table for tracking product price changes
CREATE TABLE ProductPriceHistory (
	barcode VARCHAR(20),
	price DECIMAL(10, 2),
	valid_from TIMESTAMP,
	valid_to TIMESTAMP,
	PRIMARY KEY (barcode, valid_from)
);

-- Insert initial product prices
INSERT INTO ProductPriceHistory (barcode, price, valid_from, valid_to)
SELECT barcode, price, '2024-01-01 00:00:00'::TIMESTAMP, '9999-12-31 23:59:59'::TIMESTAMP
FROM Products;
  

Simulate a price change for 'The Optimists' Calendar':

-- Simulate a price change for 'The Optimists' Calendar'
UPDATE ProductPriceHistory
SET valid_to = '2024-06-30 23:59:59'::TIMESTAMP
WHERE barcode = '123456789' AND valid_to = '9999-12-31 23:59:59'::TIMESTAMP;

INSERT INTO ProductPriceHistory (barcode, price, valid_from, valid_to)
VALUES ('123456789', 59.99, '2024-07-01 00:00:00'::TIMESTAMP, '9999-12-31 23:59:59'::TIMESTAMP);
  

You then create the current product prices view as shown.

-- Create a view for the current product prices
CREATE VIEW CurrentProductPrices AS
SELECT p.barcode, p.name, pph.price
FROM Products p
JOIN ProductPriceHistory pph ON p.barcode = pph.barcode
WHERE pph.valid_to = '9999-12-31 23:59:59'::TIMESTAMP;
  

Use window functions to analyze sales trends.

-- Analyze sales trends using window functions
WITH monthly_sales AS (
	SELECT
    	DATE_TRUNC('month', o.order_date) AS month,
    	p.name AS product_name,
    	SUM(o.sales_amount) AS total_sales,
    	COUNT(*) AS order_count
	FROM Orders o
	JOIN Products p ON o.product_barcode = p.barcode
	GROUP BY DATE_TRUNC('month', o.order_date), p.name
)
SELECT
	month,
	product_name,
	total_sales,
	order_count,
	AVG(total_sales) OVER (
    	PARTITION BY product_name
    	ORDER BY month
    	ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
	) AS three_month_moving_avg,
	SUM(total_sales) OVER (
    	PARTITION BY product_name
    	ORDER BY month
	) AS cumulative_sales
FROM monthly_sales
ORDER BY product_name, month;
  

You get sales trends by product per month as shown.

month product_name total_sales order_count three_month_moving_avg cumulative_sales
2024-10-01 00:00:00+00 Ceramic Flower Pot 59.99 1 59.99 59.99
2024-09-01 00:00:00+00 Inspirational Desk Quotes 19.99 1 19.99 19.99
2024-10-01 00:00:00+00 Polystyrene Rosebud 399.98 2 399.98 399.98
2024-10-01 00:00:00+00 Self-Care Journal 29.99 1 29.99 29.99
2024-10-01 00:00:00+00 The Optimists' Calendar 149.97 3 149.97 149.97
2024-10-01 00:00:00+00 Wooden Scented Candle Holder 39.99 1 39.99 39.99

You can also analyze the impact of price changes on sales as shown.

-- Analyze the impact of price changes on sales
SELECT
	p.name AS product_name,
	pph.price,
	pph.valid_from,
	pph.valid_to,
	COUNT(o.order_id) AS orders_in_period,
	SUM(o.sales_amount) AS sales_in_period,
	AVG(o.sales_amount) AS avg_sale_in_period
FROM ProductPriceHistory pph
JOIN Products p ON pph.barcode = p.barcode
LEFT JOIN Orders o ON p.barcode = o.product_barcode
	AND o.order_date >= pph.valid_from
	AND o.order_date < COALESCE(pph.valid_to, '9999-12-31'::TIMESTAMP)
GROUP BY p.name, pph.price, pph.valid_from, pph.valid_to
ORDER BY p.name, pph.valid_from;
  

Sales vs. price changes results are as shown.

product_name price valid_from valid_to orders_in_period sales_in_period avg_sale_in_period
Ceramic Flower Pot 59.99 2024-01-01 00:00:00 9999-12-31 23:59:59 1 59.99 59.99
Inspirational Desk Quotes 19.99 2024-01-01 00:00:00 9999-12-31 23:59:59 1 19.99 19.99
Polystyrene Rosebud 199.99 2024-01-01 00:00:00 9999-12-31 23:59:59 2 399.99 199.99
Self-Care Journal 29.99 2024-01-01 00:00:00 9999-12-31 23:59:59 1 29.99 29.99
The Optimists' Calendar 49.99 2024-01-01 00:00:00 2024-06-30 23:59:59 0
The Optimists' Calendar 59.99 2024-07-01 00:00:00 9999-12-31 23:59:59 3 149.97 49.99
Wooden Scented Candle Holder 39.99 2024-01-01 00:00:00 9999-12-31 23:59:59 1 39.99 39.99

Perform a time travel query to compare product prices at different points in time.

-- Time travel query to compare product prices at different points in time
SELECT
	p.name AS product_name,
	pph_past.price AS price_6_months_ago,
	pph_current.price AS current_price,
	(pph_current.price - pph_past.price) / pph_past.price * 100 AS price_change_percentage
FROM Products p
JOIN ProductPriceHistory pph_past ON p.barcode = pph_past.barcode
	AND '2024-04-01'::TIMESTAMP BETWEEN pph_past.valid_from AND pph_past.valid_to
JOIN ProductPriceHistory pph_current ON p.barcode = pph_current.barcode
	AND CURRENT_TIMESTAMP BETWEEN pph_current.valid_from AND pph_current.valid_to
WHERE pph_past.price != pph_current.price;
  

You get the price change percentage vs. 6 months ago.

product_name price_6_months_ago current_price price_change_percentage
The Optimists' Calendar 49.99 59.99 20.0040

You can see that "The Optimists' Calendar" is the only product that experienced a price change, with a 20% increase from $49.99 to $59.99 over the past six months.

Last thoughts

Advanced SQL concepts are a practical response to real challenges in modern data engineering. You can use them to handle complex data types, optimize performance for large datasets, and build more efficient data pipelines.

However, knowing about these techniques is just one part of the equation; implementing them efficiently at scale is still challenging. This is where tools like DataForge come into play. DataForge allows data engineers to create reusable transformation code blocks, encapsulating complex SQL logic into modular, maintainable units.

DataForge also provides the ability to automate the orchestration of these code blocks, addressing one of the key pain points in implementing complex SQL methods—managing dependencies and execution order. Automation frees data engineers to focus on transformation logic over mechanics, reliance, and execution maintainability.

As available data grows in volume, a more interdisciplinary approach is required; it's not just about writing complex SQL queries anymore but about architecting efficient, scalable data systems.

References

Navigate Chapters:

Table of Contents

  • Summary
  • #1 Data modeling with JSON and semi-structured data
  • #2 Advanced query patterns with Common Table Expressions (CTEs)
  • #3 Window functions
  • #4 Subqueries and lateral joins
  • #5 Temporal data management in data warehouses
  • Last thoughts