Advanced SQL Concepts
SQL Recursive Hierarchy Query: Key Concepts
Multi-chapter guide | Chapter 7
Join with subquery in SQL provides a more advanced way to create multi-step data processing pipelines within a single SQL statement. Basic JOIN operations connect tables horizontally, but you can combine them with subqueries to perform vertical data operations where results from one query feed into another.
This article discusses implementation patterns and optimization techniques that combine the SQL JOIN clause with subqueries. We also provide hands-on examples of real-world scenarios.
Summary of key join with subqueries in SQL concepts
Concept | Description |
---|---|
An introduction to subquery types | A formal categorization of subqueries into scalar (single value), row (single row, multiple columns), and table (multiple rows/columns) types, with each serving specific analytical purposes in data processing. |
Basic JOIN-subquery patterns |
|
Nested joins with subqueries |
Advanced technique chaining multiple JOIN operations with integrated
subqueries. Useful for complex calculations like comparing product performance against historical data and category averages simultaneously. |
Correlated subquery JOINS | Specialized JOIN operation where subqueries reference outer query values, executing per-row calculations. Useful for sophisticated cross-table analysis with dynamic value comparisons. |
Semi/anti joins using subqueries | Filtering technique using EXISTS/NOT EXISTS or IN/NOT IN with subqueries to find matches or non-matches between datasets. Useful for checking existence rather than full table scans. |
Common table expressions | Modular approach to complex queries, breaking down JOIN-subquery combinations into named, reusable blocks. They improve maintainability and enable recursive operations. |
Window functions with JOIN-subqueries | Integration of window functions with JOINs for advanced analytics. Allows multi-level aggregations, moving averages, and rankings while maintaining row-level detail. |
Performance optimization techniques |
Comprehensive approach to query efficiency through
|
Integration patterns | Strategic combination of CTEs, window functions, and specialized SQL features to create sophisticated data processing pipelines while maintaining query performance and maintainability. |
An introduction to subqueries
A subquery in SQL is a query defined inside a SELECT, INSERT, UPDATE, or DELETE statement or inside another subquery. Its result can be as simple as a single value used for comparison in the outer query or as complex as a fully derived table.
Each RDBMS dictates the theoretical and practical limits for subquery nesting. For theoretical limits, this would range from 32 levels (SQL Server) to undefined (PostgreSQL); however, the practical limits are usually around the 15 - 20 level mark. Your engines would struggle to generate efficient execution plans for deeply nested queries.
Types of subqueries
Subqueries can be classified by their output structure and relationship to the outer query. Let's explore different patterns using a typical data warehouse scenario where we’re trying to represent a software and services company selling items like analytics dashboards and cloud storage plans.
We first create the table and populate it with values:
-- Create sample tables
CREATE TABLE sales_data (
order_id INT PRIMARY KEY,
product_id INT,
sale_date DATE,
quantity INT,
unit_price DECIMAL(10,2)
);
CREATE TABLE product_info (
product_id INT PRIMARY KEY,
name VARCHAR(100),
category VARCHAR(50),
base_price DECIMAL(10,2)
);
-- Insert sample data
INSERT INTO product_info VALUES
(1, 'Analytics Dashboard', 'Software', 1200.00),
(2, 'Data Pipeline Tool', 'Software', 2500.00),
(3, 'Cloud Storage Plan', 'Services', 500.00),
(4, 'ML Model Training', 'Services', 3000.00);
INSERT INTO sales_data VALUES
(1, 1, '2024-01-15', 2, 1150.00),
(2, 2, '2024-01-15', 1, 2400.00),
(3, 1, '2024-01-16', 3, 1100.00),
(4, 3, '2024-01-16', 5, 450.00),
(5, 4, '2024-01-17', 1, 2900.00);
Let us now introduce the main types of subqueries:
Scalar subqueries
Row subqueries
Table subqueries
Scalar subqueries
They return a single value (one row, one column). These are commonly used in comparisons or calculations within WHERE clauses or SELECT lists.
In the following example, we're using scalar subqueries to find products where the base price is higher than average, while also showing each product's average actual selling price for comparison:
Row subqueries
They return a single row with multiple columns. It enables complex row-level comparisons and is useful when comparing multiple columns simultaneously. Here, we're using a row subquery to find the most expensive product in each category by comparing both the category and price simultaneously:
Table subqueries
They return multiple rows and columns, creating temporary result sets that can be treated as regular tables in the outer query and used in complex data transformations.
In the following example, we're comparing each product's total sales against its category average by joining the main product table with two subqueries:
One for individual product sales.
The other is for category-level averages.
-- Compare product performance against category averages
SELECT p.category, p.name,
sales.total_sales,
cat_avg.avg_category_sales
FROM product_info p
JOIN (
SELECT product_id, SUM(quantity * unit_price) as total_sales
FROM sales_data
GROUP BY product_id
) sales ON p.product_id = sales.product_id
JOIN (
SELECT category, AVG(s.quantity * s.unit_price) as avg_category_sales
FROM product_info p
JOIN sales_data s ON p.product_id = s.product_id
GROUP BY category
) cat_avg ON p.category = cat_avg.category;
Basic JOIN-subquery patterns
JOIN operations with subqueries allow us to combine horizontal and vertical data operations, rather than connecting tables side by side as is the case with basic JOINs. There are common patterns we can use in our queries. Here we will tackle four core ones:
Derived table joins
Correlated subquery joins
Nested joins with subqueries
Semi-joins and anti-joins using subqueries
Let's explore these common patterns using the previous dataset with an additional data table, adding customer information:
-- Add customer dimension
CREATE TABLE customer_info (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
segment VARCHAR(20),
country VARCHAR(50)
);
-- Add customer reference to sales
ALTER TABLE sales_data ADD COLUMN customer_id INT;
-- Insert sample customers
INSERT INTO customer_info VALUES
(1, 'TechCorp', 'Enterprise', 'USA'),
(2, 'DataFlow Inc', 'Enterprise', 'UK'),
(3, 'SmartSystems', 'SMB', 'Germany'),
(4, 'AI Solutions', 'SMB', 'France');
-- Link sales to customers
UPDATE sales_data SET customer_id = 1 WHERE order_id IN (1, 3);
UPDATE sales_data SET customer_id = 2 WHERE order_id = 2;
UPDATE sales_data SET customer_id = 3 WHERE order_id = 4;
UPDATE sales_data SET customer_id = 4 WHERE order_id = 5;
Derived table JOINS
A derived table is a subquery that appears in the FROM clause and returns a complete result set. It is a temporary table that exists only for the duration of the query.
For example, a simple derived table using our software company data:
Becomes a derived table when we use it within another query's FROM clause.
You can use derived tables such as the one above in JOIN operations. The query looks as follows.
In the query above, the derived table sales_metrics calculates revenue and sale count per product. We then JOIN this with the product_info table to add product details. Finally, the result combines base product data with calculated metrics.
Correlated subquery JOINS
A correlated subquery is a subquery in SQL that refers to values from the outer query to perform a calculation on the inside. Unlike regular subqueries, a correlated subquery is executed once for each row processed by the outer query. The subquery uses values from the outer query to perform its calculations.
Let us look at a simple example, where we want to calculate the average selling price for our products.
Here, the outer query selects from the product_info table. For each product, the subquery calculates its average selling price. Now, if combined with JOIN expressions, correlated queries can open up the possibility of doing more advanced operations.
Consider a situation where you want to find products where the actual selling price consistently differs from the base price. The pattern would look like such.
SELECT
p.name,
p.category,
p.base_price,
price_stats.avg_price,
price_stats.price_difference
FROM product_info p
JOIN (
SELECT
product_id,
AVG(unit_price) as avg_price,
ABS(AVG(unit_price) - (
SELECT base_price
FROM product_info p2
WHERE p2.product_id = s.product_id
)) as price_difference
FROM sales_data s
GROUP BY product_id
) price_stats ON p.product_id = price_stats.product_id
WHERE price_stats.price_difference > 5;
In the query above we’re leveraging a derived table for aggregation (price_stats) and a correlated subquery for cross-referencing base prices. We create a single-pass analysis that efficiently identifies price deviations that exceed our threshold, avoiding multiple table scans or complex window functions that would otherwise be needed for this type of variance detection.
Nested joins with subqueries
A nested join uses one join input as the outer input table. In simpler terms, a nested join uses multiple JOIN operations chained together in sequence, with each subsequent join building upon the results of previous joins.
A nested join with subquery extends this concept by incorporating derived tables or correlated subqueries at different levels of the join chain. This enables complex calculations like comparing a product's performance against both its historical data and category averages in a single query flow.
Consider product performance analysis at multiple levels: you want to compare each product's revenue against both its historical performance and its category average. This requires chaining joins with aggregated sales data and category-level metrics:
SELECT
p.name,
p.category,
current_sales.revenue as current_revenue,
historical.avg_revenue as historical_avg,
cat_benchmarks.category_avg
FROM product_info p
JOIN (
SELECT
product_id,
SUM(quantity * unit_price) as revenue
FROM sales_data
WHERE sale_date >= '2024-01-16'
GROUP BY product_id
) current_sales ON p.product_id = current_sales.product_id
JOIN (
SELECT
product_id,
AVG(quantity * unit_price) as avg_revenue
FROM sales_data
WHERE sale_date < '2024-01-16'
GROUP BY product_id
) historical ON p.product_id = historical.product_id
JOIN (
SELECT
p2.category,
AVG(s.quantity * s.unit_price) as category_avg
FROM product_info p2
JOIN sales_data s ON p2.product_id = s.product_id
GROUP BY p2.category
) cat_benchmarks ON p.category = cat_benchmarks.category;
In the query above we’re chaining three derived tables:
One for current performance.
One for historical metrics.
One for category benchmarks.
The subqueries create these derived tables by pre-aggregating data at different levels: product-level current sales, product-level historical averages, and category-level benchmarks. Each subquery transforms raw transactional data into a specific metric before the join, which is more efficient than calculating these metrics after joining and enables us to maintain clean, readable query logic while handling complex comparative analysis.
Semi/anti joins with subqueries
A semi-join returns records from the first table where one or more matches exist in the second table. An anti-join returns records from the first table where no matches exist in the second table. In SQL, these are typically implemented using EXISTS/NOT EXISTS or IN/NOT IN with subqueries, rather than explicit JOIN syntax.
Semi/anti joins with subqueries extend this concept by using subqueries to create sophisticated matching conditions. This enables filtering based on complex criteria or aggregated results, rather than simple one-to-one matches between tables.
Consider product penetration analysis in the enterprise segment: you want to find products that have never been purchased by enterprise customers, and those that have been purchased by all enterprise customers.
This requires using both semi and anti-join patterns with subqueries:
-- Anti-join: Products never bought by enterprise customers
SELECT p.name, p.category, p.base_price
FROM product_info p
WHERE NOT EXISTS (
SELECT 1
FROM sales_data s
JOIN customer_info c ON s.customer_id = c.customer_id
WHERE s.product_id = p.product_id
AND c.segment = 'Enterprise'
)
UNION ALL
-- Semi-join: Products bought by all enterprise customers
SELECT p.name, p.category, p.base_price
FROM product_info p
WHERE NOT EXISTS (
SELECT 1
FROM customer_info c
WHERE c.segment = 'Enterprise'
AND NOT EXISTS (
SELECT 1
FROM sales_data s
WHERE s.product_id = p.product_id
AND s.customer_id = c.customer_id
)
);
In the query above we're using two types of subquery-based joins:
An anti-join to find products with no enterprise sales
A semi-join to find products sold to all enterprise customers
The subqueries create nested existence checks that filter products based on their relationship with customer segments. This pattern is more efficient than full joins for existence checking, as it stops processing a product as soon as it finds a matching (or non-matching) record, rather than having to process all possible combinations.
Integrating JOIN-subquery patterns
Up until now, we have reviewed relatively simple query examples. In practice, you often have to combine different JOIN-subquery patterns for certain scenarios. Two powerful techniques for this integration are Common table expressions (CTEs) and Window functions.
Common table expressions
A common table expression (CTE) is the result set of a query that exists temporarily and for use only within the context of a larger query. Integrating CTEs in the context of JOIN subqueries can make the SQL code easier to understand and maintain by breaking down complicated joins into smaller, reusable pieces. Instead of nesting multiple subqueries within a join, we can use CTEs to define each subquery separately, then join them in the main query.
Let's transform our earlier product performance analysis query from nested subqueries into clearer, modular pieces using CTEs:
WITH product_revenue AS (
-- First, calculate revenue per product
SELECT
product_id,
SUM(quantity * unit_price) as total_revenue
FROM sales_data
GROUP BY product_id
),
category_averages AS (
-- Then, calculate category-level averages
SELECT
p.category,
AVG(pr.total_revenue) as avg_category_revenue
FROM product_info p
JOIN product_revenue pr ON p.product_id = pr.product_id
GROUP BY p.category
)
-- Finally, combine everything to compare products against category averages
SELECT
p.name,
p.category,
pr.total_revenue,
ca.avg_category_revenue
FROM product_info p
JOIN product_revenue pr ON p.product_id = pr.product_id
JOIN category_averages ca ON p.category = ca.category;
Instead of nesting multiple derived tables in the main query, we've broken the logic into three clear steps:
Calculating product revenue.
Deriving category averages.
Joining everything together.
Window functions with JOIN-subqueries
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
We can combine window functions with JOIN subqueries to extend the capabilities of both.
For example, the JOIN subquery can first prepare our dataset (filtering records, pre-calculating values), and then window functions can analyze this refined data from multiple angles without requiring additional joins or aggregations.
Let us think of an example where we want to analyze sales performance across different regions and time periods. We want to know the total sales for each region as well as how each region's sales compare to their historical performance and to other regions:
SELECT
c.country as region_name,
to_char(s.sale_date, 'YYYY-MM') as month,
s.monthly_sales,
-- Compare current sales to region's average
AVG(s.monthly_sales) OVER (
PARTITION BY c.country
) as region_avg_sales,
-- Track sales trend with 3-month moving average
AVG(s.monthly_sales) OVER (
PARTITION BY c.country
ORDER BY to_char(s.sale_date, 'YYYY-MM')
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as moving_avg_sales,
-- Rank regions by sales within each month
RANK() OVER (
PARTITION BY to_char(s.sale_date, 'YYYY-MM')
ORDER BY s.monthly_sales DESC
) as monthly_rank
FROM customer_info c
JOIN (
-- JOIN subquery aggregates sales to monthly level
SELECT
customer_id,
sale_date,
SUM(quantity * unit_price) as monthly_sales
FROM sales_data
WHERE sale_date >= '2024-01-01'
GROUP BY customer_id, sale_date
) s ON c.customer_id = s.customer_id;
In the query above, we see the following:
The JOIN subquery handles the heavy lifting of aggregating daily sales into monthly totals
Window functions then provide multiple analytical perspectives on this prepared data:
Long-term regional performance comparison
Short-term trend analysis
Competitive ranking
All of this analysis happens in a single query, with each row showing both its individual sales figures and its broader context in the data.
Performance optimization for JOIN-subqueries
Query optimization becomes key when dealing with JOIN and subquery combinations. We have multiple tools at our disposal, depending on where and how we want to focus the optimization.
Execution plans
An execution plan shows how the database will actually run our query - which tables it'll read first, how it'll join them, and what operations it'll perform. We can use the EXPLAIN ANALYZE keywords for this:
EXPLAIN ANALYZE
SELECT p.category,
p.name,
sales.total_sales,
cat_avg.avg_category_sales
FROM product_info p
JOIN (
SELECT product_id,
SUM(quantity * unit_price) as total_sales
FROM sales_data
GROUP BY product_id
) sales ON p.product_id = sales.product_id
JOIN (
SELECT category,
AVG(s.quantity * s.unit_price) as avg_category_sales
FROM product_info p2
JOIN sales_data s ON p2.product_id = s.product_id
GROUP BY category
) cat_avg ON p.category = cat_avg.category;
Resulting in the following:
"Hash Join (cost=102.93..116.12 rows=200 width=400) (actual time=0.081..0.083 rows=4 loops=1)"
" Hash Cond: ((p.category)::text = (cat_avg.category)::text)"
" -> Hash Join (cost=42.40..55.06 rows=200 width=368) (actual time=0.056..0.057 rows=4 loops=1)"
" Hash Cond: (p.product_id = sales.product_id)"
" -> Seq Scan on product_info p (cost=0.00..12.10 rows=210 width=340) (actual time=0.012..0.013 rows=4 loops=1)"
" -> Hash (cost=39.90..39.90 rows=200 width=36) (actual time=0.022..0.022 rows=4 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 9kB"
" -> Subquery Scan on sales (cost=35.40..39.90 rows=200 width=36) (actual time=0.018..0.020 rows=4 loops=1)"
" -> HashAggregate (cost=35.40..37.90 rows=200 width=36) (actual time=0.018..0.019 rows=4 loops=1)"
" Group Key: sales_data.product_id"
" Batches: 1 Memory Usage: 40kB"
" -> Seq Scan on sales_data (cost=0.00..22.70 rows=1270 width=24) (actual time=0.004..0.005 rows=5 loops=1)"
" -> Hash (cost=58.03..58.03 rows=200 width=150) (actual time=0.020..0.020 rows=2 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 9kB"
" -> Subquery Scan on cat_avg (cost=53.53..58.03 rows=200 width=150) (actual time=0.018..0.019 rows=2 loops=1)"
" -> HashAggregate (cost=53.53..56.03 rows=200 width=150) (actual time=0.018..0.019 rows=2 loops=1)"
" Group Key: p2.category"
" Batches: 1 Memory Usage: 40kB"
" -> Hash Join (cost=14.72..40.83 rows=1270 width=138) (actual time=0.012..0.013 rows=5 loops=1)"
" Hash Cond: (s.product_id = p2.product_id)"
" -> Seq Scan on sales_data s (cost=0.00..22.70 rows=1270 width=24) (actual time=0.002..0.002 rows=5 loops=1)"
" -> Hash (cost=12.10..12.10 rows=210 width=122) (actual time=0.005..0.005 rows=4 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 9kB"
" -> Seq Scan on product_info p2 (cost=0.00..12.10 rows=210 width=122) (actual time=0.003..0.003 rows=4 loops=1)"
"Planning Time: 0.205 ms"
"Execution Time: 0.150 ms"
The execution plan shows us how, in this case, PostgreSQL, tackles this JOIN-subquery pattern: it first processes each subquery (calculating per-product sales and category averages), materializes these results into temporary tables, then uses hash joins to combine them with the main product table.
Indexing strategies
An index is a separate data structure that helps the database find rows quickly without scanning entire tables. For JOIN-subquery combinations, we need to consider indexes that support both the join conditions and the subquery filters. The key is to identify which columns are used in subquery processing and final join conditions.
Let us consider this JOIN-subquery pattern that's common in our software company:
WITH product_stats AS (
SELECT
product_id,
SUM(quantity * unit_price) as revenue
FROM sales_data
GROUP BY product_id
),
category_stats AS (
SELECT
p2.category,
AVG(s.quantity * s.unit_price) as category_avg
FROM product_info p2
JOIN sales_data s ON p2.product_id = s.product_id
GROUP BY p2.category
)
SELECT
p.category,
p.name,
ps.revenue,
cs.category_avg
FROM product_info p
LEFT JOIN product_stats ps ON p.product_id = ps.product_id
LEFT JOIN category_stats cs ON p.category = cs.category;
These indexes support the subquery aggregations (product and category statistics) and the subsequent joins between the derived tables. The composite index on sales_data helps with both the grouping and the amount calculations.
Memory management
JOIN-subquery patterns can be memory-intensive because they often create temporary result sets for derived tables before joining them. The key is to structure queries so these intermediate results don't overwhelm our database's memory.
For example, this JOIN-subquery pattern can be memory-intensive:
SELECT t1.category,
t1.revenue,
t2.avg_revenue
FROM (
SELECT p.category,
SUM(s.quantity * unit_price) as revenue
FROM sales_data s
JOIN product_info p ON s.product_id = p.product_id
GROUP BY p.category
) t1
JOIN (
SELECT category,
AVG(monthly_revenue) as avg_revenue
FROM (
SELECT p.category,
DATE_TRUNC('month', s.sale_date) as month,
SUM(s.quantity * s.unit_price) as monthly_revenue
FROM sales_data s
JOIN product_info p ON s.product_id = p.product_id
GROUP BY p.category, DATE_TRUNC('month', s.sale_date)
) monthly_stats
GROUP BY category
) t2 ON t1.category = t2.category;
While this alternative that uses CTEs is more memory-efficient.
WITH monthly_sales AS (
SELECT p.category,
DATE_TRUNC('month', s.sale_date) as month,
SUM(s.quantity * s.unit_price) as revenue
FROM sales_data s
JOIN product_info p ON s.product_id = p.product_id
GROUP BY p.category, DATE_TRUNC('month', s.sale_date)
),
category_stats AS (
SELECT category,
SUM(revenue) as total_revenue,
AVG(revenue) as avg_monthly_revenue
FROM monthly_sales
GROUP BY category
)
SELECT * FROM category_stats;
The CTE approach breaks down complex JOIN-subquery operations into more manageable steps, making it easier for the database to manage memory and potentially reuse intermediate results.
Last thoughts
In this article, we've introduced the concept of subqueries and how JOIN operations can be combined with subquery operations. These techniques can be powerful, but performance implications become more difficult to identify as SQL techniques become more complex. Data engineers must always weigh query sophistication against system performance.
Three key principles actively address these implications:
The strategic use of CTEs and modular design patterns can improve query maintainability without sacrificing performance.
Proper indexing strategies and execution plan analysis are key when implementing certain JOIN-subquery combinations.
Modern SQL features like window functions and JSON operations, when properly integrated with subqueries, can play very nicely, offering opportunities that would otherwise not be accessible.
Understanding not just joins with subqueries but how they can be used in the context of other SQL features allows data engineers to take a unique approach to problem-solving: Instead of thinking about isolated SQL operations, you can consider data processing systems in their entirety. However, this holistic architectural perspective requires tools & practices that can handle both the complexity of your queries and the scale of your operations.
DataForge addresses this need directly through its declarative data management platform, which transforms complex SQL patterns into reusable transformation blocks within a larger, maintainable system.
Additionally, the platform's functional code architecture brings software development best practices to data engineering, a critical aspect if you want to scale a system reliably. One example is auto-orchestration handling, where DataForge automates the sequencing of transformation code blocks and manages dependencies.