Advanced SQL Concepts
Complex SQL Queries Best Practices
Multi-chapter guide | Chapter 5
Structured Query Language (SQL) is a standard language for defining, modifying, and controlling data within relational databases. A basic SQL query typically involves selecting data from tables, applying filters, and sorting the results.
However, advanced SQL queries become necessary when data structure and business logic complexity increase. They enable intricate data transformations at scale in areas like ETL (Extract, Transform, Load), data warehousing, data analysis, and reporting.
Key techniques in complex SQL queries include subqueries, joins, unions, intersections, aggregate functions, window functions, common table expressions (CTEs), pivoting, recursive queries, string manipulation, date and time functions, and case statements. This article provides an overview of each technique, along with best practices to achieve precise and efficient data processing.
Summary of complex SQL Queries techniques
Concept | Description |
---|---|
Common Table Expression (CTE) | CTEs preserve query results, allowing queries to reference them. |
Recursive queries | Recursive queries process hierarchical or self-referencing data. They are typically built using CTEs. |
Window functions | Calculates values over a defined row window but related to the current row, enabling complex analytics. |
Union and intersection | Combines results from multiple queries (UNION for all, INTERSECT for common results). |
Pivoting and unpivoting | Transforms data from rows to columns, and vice versa, for comparative analysis. |
Case statements | Performs conditional logic within an SQL query |
String manipulation | Performs operations like concatenation, trimming, and substring extraction, including regular expressions |
Date and time functions | Handles date and time calculations, comparisons, and transformations. |
Recursive queries and CTEs
Recursive queries allow you to link one record with others in the same table, making them suitable for handling hierarchical data structures such as organizational charts or product dependencies. They are typically built using Common Table Expressions (CTEs).
CTEs simplify complex queries by breaking them into smaller, reusable subparts. It involves defining the subquery with a meaningful name so that it can be referenced multiple times within the main query. CTEs improve readability and allow you to build queries modularly, making debugging and testing more efficient.
Example
Consider an EMPLOYEE table with a column called ManagerID, which refers to that employee's manager. This allows for many levels of employees and managers.
EmployeeID | EmployeeName | ManagerID |
---|---|---|
1 | Sarah | NULL |
2 | John | 1 |
3 | Michelle | 1 |
4 | Frank | 2 |
5 | Jerry | 4 |
6 | Paula | 4 |
We can use a recursive query with CTE to retrieve the hierarchy of employees with their levels.
WITH RECURSIVE OrgHierarchy AS (
-- Anchor Member: Start with the top-level manager (no ManagerID)
SELECT
EmployeeID,
EmployeeName,
ManagerID,
1 AS Level
FROM Employee
WHERE ManagerID IS NULL
UNION ALL
-- Recursive Member: Fetch employees reporting to the current level
SELECT
e.EmployeeID,
e.EmployeeName,
e.ManagerID,
oh.Level + 1 AS Level
FROM Employee e
INNER JOIN OrgHierarchy oh
ON e.ManagerID = oh.EmployeeID
)
-- Final Query: Retrieve the hierarchical structure
SELECT
EmployeeID,
EmployeeName,
ManagerID,
Level
FROM OrgHierarchy
ORDER BY Level, EmployeeID;
The RECURSIVE keyword is not required if you're using Oracle or SQL Server. The recursion is implicit when the CTE references itself. If you're using MySQL or Postgres, you'll need to add the word RECURSIVE after the WITH keyword, as shown above.
The query results in the following:
EmployeeID | EmployeeName | ManagerID | Level |
---|---|---|---|
1 | Sarah | Null | 1 |
2 | John | 1 | 2 |
3 | Michelle | 1 | 2 |
4 | Frank | 2 | 3 |
5 | Jerry | 4 | 4 |
6 | Paula | 4 | 4 |
Best practices
Clearly define when recursion should stop. The recursion stops automatically in the above example when there are no more rows to match, ensuring finite results. You can also avoid infinite recursions by explicitly limiting the recursion depth using a WHERE condition or SQL options like MAXRECURSION(in SQL Server) or LEVEL(in Oracle).
Keep the recursive member lightweight by selecting only required fields and avoiding complex calculations. Ensure proper indexing (e.g., on ManagerID) for efficient joins in recursive members.
Window functions
Window functions perform calculations across a defined row window relative to the current row without collapsing the rows into a single result as grouped aggregations do. They are particularly useful for running totals, moving averages, rank calculations, and comparisons between rows.
Ranking functions
ROW_NUMBER()assigns a unique sequential number to each row within a partition, starting from 1.
RANK() assigns a rank to each row within a partition, with gaps in ranking if there are ties.
DENSE_RANK()is similar to RANK() but without gaps in ranking for ties.
We can see how these functions rank data using the below table ORDERS containing order details.
OrderID | OrderDate | CustomerID | OrderAmount | Region |
---|---|---|---|---|
101 | 2024-11-01 | 1 | 250.00 | North |
102 | 2024-11-01 | 2 | 300.00 | North |
103 | 2024-11-02 | 3 | 300.00 | North |
104 | 2024-11-02 | 4 | 200.00 | South |
105 | 2024-11-03 | 5 | 150.00 | South |
106 | 2024-11-03 | 6 | 200.00 | South |
The below query calculates the row number, RANK(), and DENSE_RANK() of an order based on the order amount within a region.
SELECT
Region,
OrderID,
CustomerID,
OrderAmount,
ROW_NUMBER() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS RowNumber,
RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Rank,
DENSE_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS DenseRank
FROM Orders
ORDER BY Region, OrderAmount DESC;
Region | OrderID | CustomerID | OrderAmount | RowNumber | Rank | DenseRank |
---|---|---|---|---|---|---|
North | 102 | 2 | 300.00 | 1 | 1 | 1 |
North | 103 | 3 | 300.00 | 2 | 1 | 1 |
North | 101 | 1 | 250.00 | 3 | 3 | 2 |
South | 104 | 4 | 200.00 | 1 | 1 | 1 |
South | 106 | 6 | 200.00 | 2 | 1 | 1 |
South | 105 | 5 | 150.00 | 3 | 3 | 2 |
Offset functions
LAG()fetches the value of a column from a preceding row within the same partition.
LEAD()fetches the value of a column from a subsequent row within the same partition.
Region | OrderID | CustomerID | OrderAmount | PreviousOrderAmount | NextOrderAmount |
---|---|---|---|---|---|
North | 102 | 2 | 300.00 | NULL | 300.00 |
North | 103 | 3 | 300.00 | 300.00 | 250.00 |
North | 101 | 1 | 250.00 | 300.00 | NULL |
South | 104 | 4 | 200.00 | NULL | 200.00 |
South | 106 | 6 | 200.00 | 200.00 | 150.00 |
South | 105 | 5 | 150.00 | 200.00 | NULL |
Percentile (bucket) function
NTILE(n)divides rows into “n” equal groups and assigns a bucket number to each row. It is useful for creating histograms, analyzing rankings, or splitting data to enable parallel processing
Region | OrderID | CustomerID | OrderAmount | Bucket |
---|---|---|---|---|
North | 102 | 2 | 300.00 | 1 |
North | 103 | 3 | 300.00 | 1 |
North | 101 | 1 | 250.00 | 2 |
South | 104 | 4 | 200.00 | 1 |
South | 106 | 6 | 200.00 | 1 |
South | 105 | 5 | 150.00 | 2 |
Aggregates
Standard aggregate functions like SUM(), AVG(), MIN(), and MAX() can be used as window functions, allowing calculations over a row window. The query below uses SUM() as a window function to calculate the cumulative or running sum.
Union and intersection
UNION, UNION ALL, and INTERSECT are set operators useful when combining or comparing datasets in various analytical and operational contexts.
The UNION operator combines results from two or more queries into a single result set, removing duplicates by default. The UNION ALL operator performs better when duplicate removal is unnecessary.
UNION vs. UNION ALL
Feature | UNION | UNION ALL |
---|---|---|
Duplicates | Removes duplicates | Keep all rows, including duplicates |
Performance | Slower (due to duplicate removal) | Faster |
Use case | When unique results are needed | When all results are needed |
Suppose customer data is stored in two different tables, OnlineCustomers and OfflineCustomers. Some customers may have purchased both online and offline, resulting in duplicates. Union provides a de-duplicated list of customers from both tables.
Similarly, there are scenarios in which all records are needed, even if they are repeated. Consider sales data stored in two tables: OnlineSales and OfflineSales. You need all sales records, even if some transactions (e.g., the same product purchased by the same customer online and offline) are repeated.
On the other hand, the INTERSECT operator returns only the rows that are common to all involved queries.
The INTERSECT operator is not supported in all SQL dialects (e.g., MySQL versions before 8.0 do not support it). If unavailable, you can achieve similar functionality using INNER JOIN or subqueries.
Best practices
Ensure compatibility—the number and order of columns must match across queries with compatible data types.
Use UNION ALL for performance if duplicate removal is unnecessary.
Filter early by applying WHERE conditions in individual queries to reduce data before combining.
Ensure indexed columns are used in INTERSECT and filtering operations for better performance.
Pivoting and unpivoting
Pivoting and unpivoting are complex SQL queries that transform data structures for better analysis and reporting. Pivoting converts rows into columns, which is useful for creating summary or comparative views. Here's an example of pivoting to display the total order amount for each customer by year.
CUSTOMERS
CustomerID | CustomerName |
---|---|
1 | Alice |
2 | Bob |
ORDERS
OrderID | CustomerID | OrderDate | OrderAmount |
---|---|---|---|
101 | 1 | 10-05-2019 | 200 |
102 | 1 | 15-07-2020 | 300 |
103 | 2 | 12-09-2019 | 150 |
104 | 2 | 08-03-2021 | 400 |
The inner query prepares the source data with customer names, years, and order amounts. The PIVOT operator transforms values of OrderYear (2019, 2020, 2021) into columns. The SUM(OrderAmount) aggregates order amounts for each customer name and year.
PIVOTEDDATA
CustomerName | 2019 | 2020 | 2021 |
---|---|---|---|
Alice | 200 | 300 | NULL |
Bob | 150 | NULL | 400 |
Unpivoting converts columns into rows, which is useful for normalizing data or restructuring for analytical purposes.
The below query unpivots the pivoted data to transform the year columns (2019, 2020, 2021) back into rows, making it easier to analyze the data by year and product.
Product | SalesYear | SalesAmount |
---|---|---|
Alice | 2019 | 200 |
Alice | 2020 | 300 |
Bob | 2019 | 150 |
Bob | 2021 | 400 |
However, it's important to note that these operations are dialect-specific. While SQL Server and Oracle natively support the PIVOT and UNPIVOT operators, databases like MySQL and PostgreSQL do not. In such cases, alternative methods like conditional aggregation (for pivoting) or JOIN/UNION operations (for unpivoting) achieve similar results.
For example, the below query achieves pivoting by using a combination of CASE statements and aggregation functions like SUM:
SELECT
c.CustomerName,
SUM(CASE WHEN YEAR(o.OrderDate) = 2019 THEN o.OrderAmount ELSE 0 END) AS [2019],
SUM(CASE WHEN YEAR(o.OrderDate) = 2020 THEN o.OrderAmount ELSE 0 END) AS [2020],
SUM(CASE WHEN YEAR(o.OrderDate) = 2021 THEN o.OrderAmount ELSE 0 END) AS [2021]
FROM
Customers c
INNER JOIN
Orders o ON c.CustomerID = o.CustomerID
GROUP BY
c.CustomerName;
The below query achieve unpivoting using a UNION ALL query, manually stacking columns into rows
Case statements
Case statements allow you to perform conditional logic within a SQL query. They are useful for tasks such as data categorization or creating custom calculations based on specific conditions.
The below query uses a case statement to implement bonus calculation logic. Bonuses are calculated based on employee sales, and bonus calculation rules are defined using different cases.
String manipulation
String manipulation functions enable text data modifications and analysis within queries, such as concatenation, trimming, and extracting substrings. These functions are especially useful in ETL processes, data cleaning, and reporting.
CONCAT combines multiple strings into one.
TRIM removes leading and/or trailing whitespace or specified characters.
SUBSTRING in MySQL (or SUBSTR in SQL Server) extracts a portion of a string.
INSTR in MySQL (or CHARINDEX in SQL Server) finds the position of a substring within a string.
domain | |
---|---|
john.doe@gmail.com | gmail.com |
alice.smith@xyz.com | xyz.com |
REPLACE substitutes occurrences of a substring with another substring.
phone_number | standardized_phone |
---|---|
123-456-7890 | 1234567890 |
987-654-3210 | 9876543210 |
Regular expressions
Regular expressions are used for pattern matching, text manipulation, and validations within SQL. Databases like MySQL and PostgreSQL support regex natively, while others, such as SQL Server, may require additional functions or external libraries.
Example—validating email addresses in MySQL.
Note: SQL syntax and function names for string manipulation (e.g., CHARINDEX in SQL Server vs. INSTR in MySQL) can vary across dialects. Always check the documentation for your specific SQL system to ensure compatibility.
FullName | EmailStatus | |
---|---|---|
John Doe | John.doe@gmail.com | Valid |
Emily Susan | Emily.susan@xyz | Invalid |
Date and time functions
Date and time functions facilitate temporal analysis and transformations, such as calculating intervals, converting date formats, and comparing time-related data. Refer to the documentation to find the best function for your requirements. We give some examples below.
DATEPART extracts a specific part of a date, such as a year, month, day, or hour.
DATEADD adds a specified interval to a date. For example, adding 7 days to the current date.
DATEDIFF calculates the difference between two dates.
FORMAT is useful for formatting dates.
EOMONTH/LAST_DAY is useful for finding the last day of the month
Recommendations for writing complex SQL queries
Keep it simple
Avoid using complex joins, recursive functions, and nested aggregations unless absolutely necessary. Overly complex queries can be difficult to debug and may affect performance.
Build query bottom-up
Start by testing the basic query requirements. Then, introduce complexity step-by-step. This approach ensures that each query stage functions correctly before adding more layers.
Use more readable syntax
Use proper formatting, indentation, and comments to ensure clarity. Where applicable, simplify type conversions and transformations with shorthand syntax.
Adopt a declarative approach
Use a declarative approach to build reusable code blocks and make your SQL logic more modular and understandable. Declarative configurations emphasize specifying what to do rather than how to do it. They tend to be easier to debug, reuse, and optimize.
Unfortunately, SQL statements are only declarative in isolation. Multi-statement SQL scripts dictate the exact order of persistence (e.g., using INSERT and UPDATE in a specific order) and tend to become rigid and monolithic.
DataForge is an open-source, next-generation data transformation tool that improves SQL’s declarative approach. It provides more flexibility and declarative power, especially for complex data transformations and processing tasks.
For example, we could define a “products” source using the below code with Dataforge :
source_name: "Products"
source_query: "SELECT * FROM Products"
raw_attributes:
- product_id int
- product_name string
- average_cost decimal
- average_sell_price decimal
rules:
- name: "Round Average Cost"
expression: "ROUND([This].average_cost, 2)"
- name: "Truncate Product Name"
expression: "CASE
WHEN LENGTH([This].product_name) > 30
THEN CONCAT(LEFT([This].product_name, 30), '...')
ELSE [This].product_name
END"
In this declarative approach, DataForge allows users to define queries in terms of transformations and conditions rather than explicit step-by-step updates. This configuration creates reusable rules (e.g., rounding or truncating) and enforces modular, maintainable transformations.
Conclusion
SQL offers a powerful suite of tools for data manipulation and transformation. By mastering techniques like recursive queries, window functions, CTEs, pivoting, and date/time manipulation, engineers can perform intricate data analysis and report directly within SQL, streamlining processes and enhancing efficiency.