Advanced SQL Concepts
Advanced SQL Concepts for Data Engineers
Multi-chapter guide | Chapter 1
A decade ago, simple tabular structures were enough to represent most of the data that data engineers worked with. There were not many devices actively and passively collecting information. Additionally, the data volume that data pipelines managed was limited as cloud computing was more expensive.
Today, things are different — data does not quite fit into neat tabular arrangements anymore, and the annual information volume has seen more than a 10-fold increase since 2014.
Due to these challenges, data engineers have become increasingly creative with data querying, transformation, and storage. Many adopt advanced SQL techniques going far beyond the basics. This article explores five advanced SQL concepts actively used in production environments. Each represents a practical solution to daily real-world problems when working with today's larger and more complex datasets.
Summary of key advanced SQL concepts
Concept | Description |
---|---|
Data modeling with JSON and semi-structured data | Use the ->> operator to retrieve text from the JSON object, then cast to the required data type - for example, numeric value for comparison. |
Advanced query patterns with Common Table Expressions (CTEs) | Use the WITH clause to define a temporary result set |
Window functions | Apply ROW_NUMBER(), RANK(), SUM(), etc., to perform cumulative calculations within defined partitions. This avoids grouping while enabling calculations across rows. |
Subqueries & lateral joins | Use correlated subqueries for accessing values from the outer query.
Use lateral joins by employing the JOIN LATERAL() syntax to access columns from earlier tables in the query, allowing for row-by-row operations. |
Temporal data management in data warehouses | Create temporal tables for tracking the state of the other tables, use the PARTITION BY RANGE syntax for time-based partitioning, perform time travel queries for accessing historical data as it existed at that time, and use the OVER syntax for window functions. |
We illustrate these concepts using a PostgreSQL e-commerce website, starting with a basic schema and incrementally building on it. The website database contains:
Table | Description |
---|---|
Products | Product information such as barcode, description, price |
Customers | Customer information such as ID, name, email, address |
Orders | Order information such as order timestamp, customer |
Reviews | Review information such as customer name, rating, etc. |
Categories | Category ID and description. Each product is associated with a single category. |
#1 Data modeling with JSON and semi-structured data
Modern APIs use semi-structured formats such as JSON, XML, and YAML due to their flexibility, ease of use, and lightweight nature. Many modern SQL platforms have mechanisms to work with semi-structured json data. For example, PostgreSQL has two main JSON data types:
JSON stores an exact copy of the input text, which processing functions must reparse on each execution.
JSONB data stores a decomposed binary format and supports indexing.
Both are similar, but JSON does not support indexing. There are performance implications as well. JSONB’s binary information is not reparsed for every read operation, saving time and resources for large and constant transformation operations. However, JSONB ingestion process is slower because the JSON format must be parsed to binary.
To understand, let’s create the “Products” and “Reviews” tables of the e-commerce website. The “Reviews” table ingests data from an external API. It includes fields for review ID, product barcode, etc., along with a hierarchical structure for storing the original API JSON response with multiple key fields such as customer name, rating, etc. The JSONB data type is used for the api_response field.
-- Create Products Table
CREATE TABLE Products (
barcode VARCHAR(20) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
description TEXT,
price DECIMAL(10, 2) NOT NULL
);
-- Create Reviews Table
CREATE TABLE Reviews (
review_id SERIAL PRIMARY KEY,
product_barcode VARCHAR(20) REFERENCES Products(barcode),
api_response JSONB NOT NULL
);
You can then populate the “Products” and “Reviews” tables as follows.
-- Insert two products
INSERT INTO Products (barcode, name, description, price)
VALUES
('123456789', 'The Optimists'' Calendar', 'Perpetual calendar that always shows tomorrow as a "better day". Requires a monthly subscription.', 49.99),
('987654321', 'Polystyrene Rosebud', 'Decorative rosebud that emits beautiful smells throughout your home.', 199.99);
-- Insert reviews for the first product
INSERT INTO Reviews (product_barcode, api_response)
VALUES
('123456789',
'{"reviewer": "John Doe",
"rating": 5.0,
"comment": "Great product!",
"verified": true}'
),
('123456789',
'{"reviewer": "Jane Smith",
"rating": 3.8,
"comment": "Starting to feel more positive after 1 month. Worth it!",
"verified": true}'
),
('123456789',
'{"reviewer": "Bob Johnson",
"rating": 1.0,
"comment": "Subscription? Are you kiddin''?",
"verified": false}'
);
-- Insert reviews for the second product
INSERT INTO Reviews (product_barcode, api_response)
VALUES
('987654321',
'{"reviewer": "Alice Brown",
"rating": 1.5,
"comment": "Terrible quality. Broke after one use.",
"verified": true}'
),
('987654321',
'{"reviewer": "Charlie Davis",
"rating": 5.0,
"comment": "Choking hazard! Otherwise good smells!!",
"verified": true}'
),
('987654321',
'{"reviewer": "Eve Wilson",
"rating": 1.0,
"comment": "Worst purchase ever. Got a massive rash just by looking at it!!",
"verified": true}'
),
('987654321',
'{"reviewer": "Frank Miller",
"rating": 1.2,
"comment": "This product smells like the public loo. Will never buy again.",
"verified": false}'
);
Check the resulting tables.
Products table
barcode | name | description | price |
---|---|---|---|
123456789 | The Optimists’ Calendar | Perpetual calendar that always shows tomorrow as a “better day”. Requires a monthly subscription. | 49.99 |
987654321 | Polystyrene Rosebud | Decorative rosebud that emits beautiful smells throughout your home. | 199.99 |
Reviews table
review_id | product_barcode | api_response |
---|---|---|
1 | 123456789 | {"rating": 5.0, "comment": "Great product!", "reviewer": "John Doe", "verified": true} |
2 | 123456789 | {"rating": 3.8, "comment": "Starting to feel more positive after 1 month. Worth it!", "reviewer": "Jane Smith", "verified": true} |
3 | 123456789 | {"rating": 1.0, "comment": "Subscription? Are you kiddin'?", "reviewer": "Bob Johnson", "verified": false} |
4 | 987654321 | {"rating": 1.5, "comment": "Terrible quality. Broke after one use.", "reviewer": "Alice Brown", "verified": true} |
5 | 987654321 | {"rating": 5.0, "comment": "Choking hazard! Otherwise good smells!!", "reviewer": "Charlie Davis", "verified": true} |
6 | 987654321 | {"rating": 1.0, "comment": "Worst purchase ever. Got a massive rash just by looking at it!!", "reviewer": "Eve Wilson", "verified": true} |
7 | 987654321 | {"rating": 1.2, "comment": "This product smells like the public loo. Will never buy again.", "reviewer": "Frank Miller", "verified": false} |
We’re using conventional JSON notation to do this. However, the information will not be stored this way; it will be parsed upon query execution and be converted to a binary format. This also changes how the information is returned since an inverse parsing process will need to occur. If we try to get the values by querying the table, we get the same data we inserted.
To avoid JSON in the retrieved results, you can use the ->> operator to retrieve text from the JSON object, then cast to the required data type - example, numeric value for comparison.
Product ratings with rating greater than 3
product_barcode | reviewer | rating |
---|---|---|
123456789 | John Doe | 5.0 |
123456789 | Jane Smith | 3.8 |
987654321 | Charlie Davis | 5.0 |
By using JSONB, you can also create indexes on specific fields within the JSON structure, just like with regular columns in a relational database. These indexes allow PostgreSQL to quickly locate records, potentially improving query performance for large datasets.
For example, you can create an index on the 'rating' field:
We have outlined one approach above. While there are various approaches available to parse and store semi-structured data within data pipelines the challenge increases when dealing with large nested arrays and hierarchical structures. Modern data platforms have developed solutions to address this - for instance, DataForge recently introduced sub-sources, which allow the creation of virtual tables from nested complex arrays (ARRAY<STRUCT<..>>) without physical normalization.
#2 Advanced query patterns with Common Table Expressions (CTEs)
Common Table Expressions (CTEs) are temporary subqueries that exist in the context of the larger query. You can conceptualize them as “virtual tables” that can be referenced multiple times within a single SQL statement. They provide another level of abstraction to facilitate complex querying, reusability, and scalability.
For example, let’s say you want to analyze product performance based on their reviews. You can create a query that:
Calculates the average product rating
Determines above-average products and returns their data
The above CTE transforms raw review data into a simple table containing each product's average rating. It extracts the rating from the JSONB structure using the same technique we discussed earlier.
The above CTE calculates the average rating across all products:
Combining both CTEs you get products with average rating greater than the overall average.
name | price | avg_rating | overall_avg |
---|---|---|---|
The Optimists' Calendar | 49.99 | 3.2666 | 2.7208 |
Without CTEs, you would need to either nest these queries, making them harder to read and maintain, or create temporary tables, introducing a wide array of extra work and potential conflicts with current tables.
For comparison purposes, here is the same query using temporary tables.
-- Create first temporary table
CREATE TEMPORARY TABLE avg_product_ratings AS
SELECT product_barcode,
AVG((api_response->>'rating')::numeric) as avg_rating
FROM Reviews
GROUP BY product_barcode;
-- Create second temporary table
CREATE TEMPORARY TABLE overall_average AS
SELECT AVG(avg_rating) as overall_avg
FROM avg_product_ratings;
-- Final query
SELECT [...]
-- Cleanup
DROP TABLE avg_product_ratings;
DROP TABLE overall_average;
CTEs simplify database management and are popular. However, they can severely cap the parent query due to suboptimal query plans. Besides, a CTE can only be referenced within the single SQL statement in which it’s defined. For example, PostgreSQL recalculates a CTE's results each time it is referenced in the main query, leading to performance bottlenecks. This often forces developers to use temporary tables when the same data transformation is needed across multiple stages or queries.
Materialized CTEs
Materialized CTEs offer a way to "save" the results of a CTE in memory. Instead of recalculating these results every time they're needed, PostgreSQL (version 12 onwards) refers back to this saved version. This improves query performance where:
The CTE is referenced multiple times in the query.
Each reference would be costly to compute.
However, materialized CTEs can also negatively impact performance if the CTE result set is large enough or it’s used sparsely throughout the query. Here’s an example creating a materialized CTE for average product ratings.
The MATERIALIZED keyword tells PostgreSQL to compute the average ratings once and store the results in memory.
You can add another CTE to calculate the overall average rating and use them both same as before.
The avg_product_ratings expression is materialized and used twice: once to calculate the overall average and again in the final SELECT statement. The materialized CTE ensures the potentially expensive operation of calculating average JSON data ratings is only performed once, even though we use the results multiple times.
Recursive CTEs
Recursive CTEs are relevant with hierarchical or self-referential data structures, allowing you to query deeply nested organizational structures, large category trees, etc. Like programmatic recursion, they consist of two main components: a base case (initial query) and a recursive part that references the CTE itself. The recursive expression is evaluated upon query execution, building up the result set until no more rows are produced.
While recursive CTEs might sound like a feasible approach to solving nested structures, they are not commonly used due to the unexpected impacts they might have on query performance; recursive calls can potentially lead to infinite loops if not properly designed. Their computational complexity often grows exponentially with recursive depth, making them a difficult to get right.
#3 Window functions
Window functions, introduced in the SQL:2003 standard, use values from one or multiple rows, also referred to as a "window of rows." These functions return each row’s value without collapsing all rows involved, as an aggregate function such as a GROUP BY statement would.
In essence, window functions operate on a data "window" that can be based on partitions, ordered sets, or even a specified number of rows before and after the current row. This capability enables complex operations without resorting to unnecessarily complex queries.
All window functions consist of two main components:
The window function itself is like aggregate functions such as SUM, AVG, and COUNT or rank functions such as OVER, DENSE_RANK, and ROW_NUMBER.
An OVER clause that specifies how to partition the rows (PARTITION BY) and how to order the rows (ORDER BY).
For example, let's say you want to calculate a running total of product sales and rank products within their categories. Create the “Orders” table and add some entries to it.
-- Create new table “Orders”
CREATE TABLE Orders (
order_id SERIAL PRIMARY KEY,
product_barcode VARCHAR(20) REFERENCES Products(barcode),
order_date DATE,
sales_amount DECIMAL(10, 2)
);
-- Insert 3 orders into the Orders table
INSERT INTO Orders (product_barcode, order_date, sales_amount)
VALUES
-- Order for 'The Optimists' Calendar'
('123456789', '2024-10-01', 49.99),
-- Order for 'Polystyrene Rosebud'
('987654321', '2024-10-05', 199.99),
-- Another optimist order for 'The Optimists' Calendar'
('123456789', '2024-10-07', 49.99);
Introduce window functions to perform the analysis. The query calculates a running sales total within each product category, ordered by date, and then ranks products within their category based on sales amount.
-- Perform window function
SELECT
p.name AS product_name,
o.order_date,
o.sales_amount,
SUM(o.sales_amount) OVER (
PARTITION BY p.name
ORDER BY o.order_date
) AS product_running_total,
RANK() OVER (
ORDER BY o.sales_amount DESC
) AS sales_rank
FROM
Orders o
JOIN
Products p ON o.product_barcode = p.barcode
ORDER BY
p.name, o.order_date;
The window function in the above query is composed of two main parts.
The SUM() OVER() function — Creates a running total, composed of the function, and the OVER keyword.
The RANK() OVER() function — Ranks each product within its category, composed of the function, and the OVER keyword.
Although not directly part of the window functions, we also use the PARTITION()and ORDER BY()functions to perform the operations by categories and sort the results.
You’ll get the following products with their sales and rankings.
product_name | order_date | sales_amount | product_running_total | sales_rank |
---|---|---|---|---|
Polystyrene Rosebud |
2024-10-05 | 199.99 | 199.99 | 1 |
The Optimists' Calendar | 2024-10-01 | 49.99 | 49.99 | 2 |
The Optimists' Calendar | 2024-10-07 | 49.99 | 99.98 | 2 |
Without window functions you would have needed multiple subqueries and self-joins to achieve the same result.
-- Perform without window function
WITH sales_data AS (
SELECT
p.name AS product_name,
o.order_date,
o.sales_amount
FROM
Orders o
JOIN
Products p ON o.product_barcode = p.barcode
),
running_totals AS (
SELECT
sd1.product_name,
sd1.order_date,
sd1.sales_amount,
(SELECT SUM(sd2.sales_amount)
FROM sales_data sd2
WHERE sd2.product_name = sd1.product_name
AND sd2.order_date <= sd1.order_date) AS product_running_total
FROM
sales_data sd1
),
sales_ranks AS (
SELECT
product_name,
order_date,
sales_amount,
RANK() OVER (ORDER BY sales_amount DESC) AS sales_rank
FROM
sales_data
)
SELECT
rt.product_name,
rt.order_date,
rt.sales_amount,
rt.product_running_total,
sr.sales_rank
FROM
running_totals rt
JOIN
sales_ranks sr ON rt.product_name = sr.product_name
AND rt.order_date = sr.order_date
AND rt.sales_amount = sr.sales_amount
ORDER BY
rt.product_name, rt.order_date;
The above query achieves the same result as our window function example, but it's significantly more complex. You need multiple CTEs to break down the logic and correlated subqueries that are less efficient for large datasets. The query is also much longer and harder to read, making it more error-prone and difficult to maintain.
#4 Subqueries and lateral joins
Subqueries & lateral joins are related SQL concepts that can be used as standalone or in combination.
Subqueries
Subqueries, also known as inner queries or nested queries, are SQL queries embedded within a larger query. They are classified by the result type they return.
Scalar subqueries return a single value, which can be one row or one column.
Row subqueries return a single row with multiple columns.
Column subqueries return a single column with multiple rows.
Table subqueries return multiple rows and multiple columns.
Correlated subqueries
In addition to the subquery types above, you have one more type; correlated subqueries. They can return a scalar value, a single row, or multiple rows depending on how they are written. They reference one or more columns from the outer query, typically in the WHERE clause, executing exactly once for each row processed by the outer query.
Unlike other subquery types, correlated subqueries do not execute independently of the outer query; their results change for each row of the outer query, as they use the current row's values in their calculation.
For example, you can use subqueries to analyze product performance by comparing individual products to category averages and retrieving recent customer feedback. Correlated subqueries perform row-specific comparisons across categories and lateral joins for fetching the most recent reviews.
For this example, let’s first add a new category column to the “Products” table, more products with varying categories, and more reviews and orders for added variability.
-- Insert more products with varying categories
INSERT INTO Products (barcode, name, description, price, category)
VALUES
('112233445', 'Inspirational Desk Quotes', 'A set of motivational quotes for your desk.', 19.99, 'Stationery'),
('998877665', 'Wooden Scented Candle Holder', 'Handmade candle holder that emits a natural aroma.', 39.99, 'Decorative'),
('443322110', 'Self-Care Journal', 'A journal designed for personal growth and self-care.', 29.99, 'Stationery'),
('554433221', 'Ceramic Flower Pot', 'A minimalist flower pot for indoor plants.', 59.99, 'Decorative');
-- Insert more reviews
INSERT INTO Reviews (product_barcode, api_response)
VALUES
-- Insert reviews for 'The Optimists' Calendar'
('123456789',
'{"reviewer": "Mia Kim",
"rating": 4.5,
"comment": "Good concept, but a tad expensive.",
"verified": true}'
),
('123456789',
'{"reviewer": "Ethan Taylor",
"rating": 2.0,
"comment": "I don't really get it.",
"verified": false}'
),
-- Insert more reviews for 'Polystyrene Rosebud'
('987654321',
'{"reviewer": "Hannah Lee",
"rating": 4.0,
"comment": "Pretty good, but my dog ate it on the second day of use. He’s OK though.",
"verified": true}'
),
('987654321',
'{"reviewer": "James White",
"rating": 2.0,
"comment": "Not what I expected. Smells…funny.",
"verified": true}'
),
-- Insert reviews for new products
('112233445',
'{"reviewer": "Liam Carter",
"rating": 5.0,
"comment": "Love the daily inspiration!",
"verified": true}'
),
('998877665',
'{"reviewer": "Sophie Green",
"rating": 4.2,
"comment": "Beautiful craftsmanship.",
"verified": true}'
),
('443322110',
'{"reviewer": "Olivia Martin",
"rating": 3.9,
"comment": "Helps me stay organized, but could be better.",
"verified": true}'
),
('554433221',
'{"reviewer": "Jack Cooper",
"rating": 1.8,
"comment": "Too fragile, broke quickly.",
"verified": false}'
);
-- Insert more orders
INSERT INTO Orders (product_barcode, order_date, sales_amount)
VALUES
-- Orders for new products
('112233445', '2024-09-30', 19.99),
('998877665', '2024-10-02', 39.99),
('443322110', '2024-10-03', 29.99),
('554433221', '2024-10-05', 59.99),
-- Additional orders for existing products
('123456789', '2024-10-06', 49.99),
('987654321', '2024-10-08', 199.99);
We can then calculate the review count versus the category average as shown.
-- Return products with number of reviews greater than category averages
SELECT p.name,
p.category,
review_count,
r.api_response->>'reviewer' AS reviewer,
r.api_response->>'rating' AS rating,
r.api_response->>'comment' AS comment
FROM Products p
JOIN LATERAL (
SELECT COUNT(*) AS review_count
FROM Reviews r
WHERE r.product_barcode = p.barcode
) subq ON true
JOIN LATERAL (
SELECT r.api_response
FROM Reviews r
WHERE r.product_barcode = p.barcode
ORDER BY r.api_response->>'rating' DESC
LIMIT 1
) r ON true
WHERE review_count > (
SELECT AVG(review_count)
FROM (
SELECT COUNT(*) AS review_count
FROM Reviews r2
JOIN Products p2 ON r2.product_barcode = p2.barcode
WHERE p2.category = p.category
GROUP BY r2.product_barcode
) AS category_avg
);
This query combines the two methods:
The correlated subquery identifies which products have above-average review counts for their category.
The lateral join - JOIN LATERAL() syntax - fetches the top review based on the rating for each product that the correlated subquery identified.
The relevant idea here is that this would not be possible with a regular join; you are increasing your calculation scope, allowing for reasonably complex operations.
The query returns products with their sales and rankings using no window function.
name | category | review_count | top_reviewer | rating | comment |
---|---|---|---|---|---|
The Optimists' Calendar | Stationery | 5 | John Doe | 5.0 | Great product! |
Polystyrene Rosebud | Decorative | 6 | Charlie Davis | 5.0 | Choking hazard! Otherwise good smells!! |
#5 Temporal data management in data warehouses
Temporal data management is storing, querying, and analyzing time-based data in databases or warehouses. You can use it to:
Maintain a history of data changes over defined periods for historical tracking and auditing.
Analyze a sequence of data points collected over time.
Monitor data quality over time.
Identify and resolve bugs by backtracking through historical data states.
Temporal data management is not a single method but a set of practices outlined below.
Method | Description | Use |
---|---|---|
Temporal tables | Tables that track the state of the other tables, registering the history of data changes. | Maintain an audit trail of data modifications over time. |
Time-based partitioning | Divide or partition larger tables into smaller chunks, delimited by time intervals. | Improve query performance in large datasets. |
Time travel queries | Access historical data as it existed at that time. | Audit, debt, historical analysis, and compare database states across different periods. |
Window functions | Use values from one or multiple rows and return a value for each row without collapsing all rows involved. | Define our windows as time frames, allowing for time-series analyses such as rolling averages, moving totals, and many more. |
Different platforms offer distinct approaches to implementing temporal data management methods across their services. For instance, Databricks provides time travel features through services like Delta Lake to access historical data versions. You can use it to simplify auditing, facilitate data rollback in cases of accidental changes, and enhance the reproducibility of experiments and reports.
Snowflake offers similar capabilities, supporting tasks such as restoring deleted tables, duplicating data from specific points in time, and analyzing data usage changes.
For example, consider tracking and analyzing product prices and sales trends over time. You can implement it with a combination of temporal tables, window functions, and time travel queries.
Create a table for tracking price changes and populate it with data entries:
-- Create a temporal table for tracking product price changes
CREATE TABLE ProductPriceHistory (
barcode VARCHAR(20),
price DECIMAL(10, 2),
valid_from TIMESTAMP,
valid_to TIMESTAMP,
PRIMARY KEY (barcode, valid_from)
);
-- Insert initial product prices
INSERT INTO ProductPriceHistory (barcode, price, valid_from, valid_to)
SELECT barcode, price, '2024-01-01 00:00:00'::TIMESTAMP, '9999-12-31 23:59:59'::TIMESTAMP
FROM Products;
Simulate a price change for 'The Optimists' Calendar':
-- Simulate a price change for 'The Optimists' Calendar'
UPDATE ProductPriceHistory
SET valid_to = '2024-06-30 23:59:59'::TIMESTAMP
WHERE barcode = '123456789' AND valid_to = '9999-12-31 23:59:59'::TIMESTAMP;
INSERT INTO ProductPriceHistory (barcode, price, valid_from, valid_to)
VALUES ('123456789', 59.99, '2024-07-01 00:00:00'::TIMESTAMP, '9999-12-31 23:59:59'::TIMESTAMP);
You then create the current product prices view as shown.
Use window functions to analyze sales trends.
-- Analyze sales trends using window functions
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', o.order_date) AS month,
p.name AS product_name,
SUM(o.sales_amount) AS total_sales,
COUNT(*) AS order_count
FROM Orders o
JOIN Products p ON o.product_barcode = p.barcode
GROUP BY DATE_TRUNC('month', o.order_date), p.name
)
SELECT
month,
product_name,
total_sales,
order_count,
AVG(total_sales) OVER (
PARTITION BY product_name
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS three_month_moving_avg,
SUM(total_sales) OVER (
PARTITION BY product_name
ORDER BY month
) AS cumulative_sales
FROM monthly_sales
ORDER BY product_name, month;
You get sales trends by product per month as shown.
month | product_name | total_sales | order_count | three_month_moving_avg | cumulative_sales |
---|---|---|---|---|---|
2024-10-01 00:00:00+00 | Ceramic Flower Pot | 59.99 | 1 | 59.99 | 59.99 |
2024-09-01 00:00:00+00 | Inspirational Desk Quotes | 19.99 | 1 | 19.99 | 19.99 |
2024-10-01 00:00:00+00 | Polystyrene Rosebud | 399.98 | 2 | 399.98 | 399.98 |
2024-10-01 00:00:00+00 | Self-Care Journal | 29.99 | 1 | 29.99 | 29.99 |
2024-10-01 00:00:00+00 | The Optimists' Calendar | 149.97 | 3 | 149.97 | 149.97 |
2024-10-01 00:00:00+00 | Wooden Scented Candle Holder | 39.99 | 1 | 39.99 | 39.99 |
You can also analyze the impact of price changes on sales as shown.
-- Analyze the impact of price changes on sales
SELECT
p.name AS product_name,
pph.price,
pph.valid_from,
pph.valid_to,
COUNT(o.order_id) AS orders_in_period,
SUM(o.sales_amount) AS sales_in_period,
AVG(o.sales_amount) AS avg_sale_in_period
FROM ProductPriceHistory pph
JOIN Products p ON pph.barcode = p.barcode
LEFT JOIN Orders o ON p.barcode = o.product_barcode
AND o.order_date >= pph.valid_from
AND o.order_date < COALESCE(pph.valid_to, '9999-12-31'::TIMESTAMP)
GROUP BY p.name, pph.price, pph.valid_from, pph.valid_to
ORDER BY p.name, pph.valid_from;
Sales vs. price changes results are as shown.
product_name | price | valid_from | valid_to | orders_in_period | sales_in_period | avg_sale_in_period |
---|---|---|---|---|---|---|
Ceramic Flower Pot | 59.99 | 2024-01-01 00:00:00 | 9999-12-31 23:59:59 | 1 | 59.99 | 59.99 |
Inspirational Desk Quotes | 19.99 | 2024-01-01 00:00:00 | 9999-12-31 23:59:59 | 1 | 19.99 | 19.99 |
Polystyrene Rosebud | 199.99 | 2024-01-01 00:00:00 | 9999-12-31 23:59:59 | 2 | 399.99 | 199.99 |
Self-Care Journal | 29.99 | 2024-01-01 00:00:00 | 9999-12-31 23:59:59 | 1 | 29.99 | 29.99 |
The Optimists' Calendar | 49.99 | 2024-01-01 00:00:00 | 2024-06-30 23:59:59 | 0 | ||
The Optimists' Calendar | 59.99 | 2024-07-01 00:00:00 | 9999-12-31 23:59:59 | 3 | 149.97 | 49.99 |
Wooden Scented Candle Holder | 39.99 | 2024-01-01 00:00:00 | 9999-12-31 23:59:59 | 1 | 39.99 | 39.99 |
Perform a time travel query to compare product prices at different points in time.
-- Time travel query to compare product prices at different points in time
SELECT
p.name AS product_name,
pph_past.price AS price_6_months_ago,
pph_current.price AS current_price,
(pph_current.price - pph_past.price) / pph_past.price * 100 AS price_change_percentage
FROM Products p
JOIN ProductPriceHistory pph_past ON p.barcode = pph_past.barcode
AND '2024-04-01'::TIMESTAMP BETWEEN pph_past.valid_from AND pph_past.valid_to
JOIN ProductPriceHistory pph_current ON p.barcode = pph_current.barcode
AND CURRENT_TIMESTAMP BETWEEN pph_current.valid_from AND pph_current.valid_to
WHERE pph_past.price != pph_current.price;
You get the price change percentage vs. 6 months ago.
product_name | price_6_months_ago | current_price | price_change_percentage |
---|---|---|---|
The Optimists' Calendar | 49.99 | 59.99 | 20.0040 |
You can see that "The Optimists' Calendar" is the only product that experienced a price change, with a 20% increase from $49.99 to $59.99 over the past six months.
Last thoughts
Advanced SQL concepts are a practical response to real challenges in modern data engineering. You can use them to handle complex data types, optimize performance for large datasets, and build more efficient data pipelines.
However, knowing about these techniques is just one part of the equation; implementing them efficiently at scale is still challenging. This is where tools like DataForge come into play. DataForge allows data engineers to create reusable transformation code blocks, encapsulating complex SQL logic into modular, maintainable units.
DataForge also provides the ability to automate the orchestration of these code blocks, addressing one of the key pain points in implementing complex SQL methods—managing dependencies and execution order. Automation frees data engineers to focus on transformation logic over mechanics, reliance, and execution maintainability.
As available data grows in volume, a more interdisciplinary approach is required; it's not just about writing complex SQL queries anymore but about architecting efficient, scalable data systems.