Advanced SQL Concepts

Complex SQL Queries Best Practices

Multi-chapter guide | Chapter 5

Table of Contents

  • Summary
  • Recursive queries and CTEs
  • Window functions
  • Union and intersection
  • Pivoting and unpivoting
  • Case statements
  • String manipulation
  • Date and time functions
  • Recommendations for writing complex SQL queries
  • Conclusion

Structured Query Language (SQL) is a standard language for defining, modifying, and controlling data within relational databases. A basic SQL query typically involves selecting data from tables, applying filters, and sorting the results.

However, advanced SQL queries become necessary when data structure and business logic complexity increase. They enable intricate data transformations at scale in areas like ETL (Extract, Transform, Load), data warehousing, data analysis, and reporting. 

Key techniques in complex SQL queries include subqueries, joins, unions, intersections, aggregate functions, window functions, common table expressions (CTEs), pivoting, recursive queries, string manipulation, date and time functions, and case statements. This article provides an overview of each technique, along with best practices to achieve precise and efficient data processing.

Summary of complex SQL Queries techniques

Concept Description
Common Table Expression (CTE) CTEs preserve query results, allowing queries to reference them.
Recursive queries Recursive queries process hierarchical or self-referencing data. They are typically built using CTEs.
Window functions Calculates values over a defined row window but related to the current row, enabling complex analytics.
Union and intersection Combines results from multiple queries (UNION for all, INTERSECT for common results).
Pivoting and unpivoting Transforms data from rows to columns, and vice versa, for comparative analysis.
Case statements Performs conditional logic within an SQL query
String manipulation Performs operations like concatenation, trimming, and substring extraction, including regular expressions
Date and time functions Handles date and time calculations, comparisons, and transformations.

Recursive queries and CTEs

Recursive queries allow you to link one record with others in the same table, making them suitable for handling hierarchical data structures such as organizational charts or product dependencies. They are typically built using Common Table Expressions (CTEs).

CTEs simplify complex queries by breaking them into smaller, reusable subparts. It involves defining the subquery with a meaningful name so that it can be referenced multiple times within the main query. CTEs improve readability and allow you to build queries modularly, making debugging and testing more efficient.

Example

Consider an EMPLOYEE table with a column called ManagerID, which refers to that employee's manager. This allows for many levels of employees and managers.

EmployeeID EmployeeName ManagerID
1 Sarah NULL
2 John 1
3 Michelle 1
4 Frank 2
5 Jerry 4
6 Paula 4

We can use a recursive query with CTE to retrieve the hierarchy of employees with their levels.

  WITH RECURSIVE OrgHierarchy AS (
-- Anchor Member: Start with the top-level manager (no ManagerID)
    SELECT 
        EmployeeID, 
        EmployeeName, 
        ManagerID, 
        1 AS Level
    FROM Employee
    WHERE ManagerID IS NULL

    UNION ALL

-- Recursive Member: Fetch employees reporting to the current level
    SELECT 
        e.EmployeeID, 
        e.EmployeeName, 
        e.ManagerID, 
        oh.Level + 1 AS Level
    FROM Employee e
    INNER JOIN OrgHierarchy oh 
        ON e.ManagerID = oh.EmployeeID
)
-- Final Query: Retrieve the hierarchical structure
SELECT 
    EmployeeID, 
    EmployeeName, 
    ManagerID, 
    Level
FROM OrgHierarchy
ORDER BY Level, EmployeeID;
  

The RECURSIVE keyword is not required if you're using Oracle or SQL Server. The recursion is implicit when the CTE references itself. If you're using MySQL or Postgres, you'll need to add the word RECURSIVE after the WITH keyword, as shown above.

The query results in the following:

EmployeeID EmployeeName ManagerID Level
1 Sarah Null 1
2 John 1 2
3 Michelle 1 2
4 Frank 2 3
5 Jerry 4 4
6 Paula 4 4

Best practices

Clearly define when recursion should stop. The recursion stops automatically in the above example when there are no more rows to match, ensuring finite results. You can also avoid infinite recursions by explicitly limiting the recursion depth using a WHERE condition or SQL options like MAXRECURSION(in SQL Server) or LEVEL(in Oracle).

Keep the recursive member lightweight by selecting only required fields and avoiding complex calculations. Ensure proper indexing (e.g., on ManagerID) for efficient joins in recursive members.

Window functions

Window functions perform calculations across a defined row window relative to the current row without collapsing the rows into a single result as grouped aggregations do. They are particularly useful for running totals, moving averages, rank calculations, and comparisons between rows.

Ranking functions

  • ROW_NUMBER()assigns a unique sequential number to each row within a partition, starting from 1.

  • RANK() assigns a rank to each row within a partition, with gaps in ranking if there are ties.

  • DENSE_RANK()is similar to RANK() but without gaps in ranking for ties.

We can see how these functions rank data using the below table ORDERS containing order details.

OrderID OrderDate CustomerID OrderAmount Region
101 2024-11-01 1 250.00 North
102 2024-11-01 2 300.00 North
103 2024-11-02 3 300.00 North
104 2024-11-02 4 200.00 South
105 2024-11-03 5 150.00 South
106 2024-11-03 6 200.00 South

The below query calculates the row number, RANK(), and DENSE_RANK() of an order based on the order amount within a region.

  SELECT 
    Region,
    OrderID,   
    CustomerID,
    OrderAmount,
    ROW_NUMBER() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS RowNumber,
    RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Rank,
    DENSE_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS DenseRank
FROM Orders 
ORDER BY Region, OrderAmount DESC;
  
Region OrderID CustomerID OrderAmount RowNumber Rank DenseRank
North 102 2 300.00 1 1 1
North 103 3 300.00 2 1 1
North 101 1 250.00 3 3 2
South 104 4 200.00 1 1 1
South 106 6 200.00 2 1 1
South 105 5 150.00 3 3 2

Offset functions

  • LAG()fetches the value of a column from a preceding row within the same partition.

  • LEAD()fetches the value of a column from a subsequent row within the same partition.

  SELECT 
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    LAG(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS PreviousOrderAmount,
    LEAD(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS NextOrderAmount
FROM Orders
ORDER BY Region, OrderAmount DESC;
  
Region OrderID CustomerID OrderAmount PreviousOrderAmount NextOrderAmount
North 102 2 300.00 NULL 300.00
North 103 3 300.00 300.00 250.00
North 101 1 250.00 300.00 NULL
South 104 4 200.00 NULL 200.00
South 106 6 200.00 200.00 150.00
South 105 5 150.00 200.00 NULL

Percentile (bucket) function

NTILE(n)divides rows into “n” equal groups and assigns a bucket number to each row.  It is useful for creating histograms, analyzing rankings, or splitting data to enable parallel processing

  SELECT
 Region, OrderID, CustomerID, OrderAmount, 
 NTILE(2) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Bucket 
FROM Orders ORDER BY Region, OrderAmount DESC;
  
Region OrderID CustomerID OrderAmount Bucket
North 102 2 300.00 1
North 103 3 300.00 1
North 101 1 250.00 2
South 104 4 200.00 1
South 106 6 200.00 1
South 105 5 150.00 2

Aggregates

Standard aggregate functions like SUM(), AVG(), MIN(), and MAX() can be used as window functions, allowing calculations over a row window. The query below uses SUM() as a window function to calculate the cumulative or running sum.

  SELECT 
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    SUM(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC ROWS UNBOUNDED PRECEDING) AS RunningSum
FROM Orders
ORDER BY Region, OrderAmount DESC;
  

Union and intersection

UNION, UNION ALL, and INTERSECT are set operators useful when combining or comparing datasets in various analytical and operational contexts.

The UNION operator combines results from two or more queries into a single result set, removing duplicates by default. The UNION ALL operator performs better when duplicate removal is unnecessary.

UNION vs. UNION ALL

Feature UNION UNION ALL
Duplicates Removes duplicates Keep all rows, including duplicates
Performance Slower (due to duplicate removal) Faster
Use case When unique results are needed When all results are needed

Suppose customer data is stored in two different tables, OnlineCustomers and OfflineCustomers. Some customers may have purchased both online and offline, resulting in duplicates. Union provides a de-duplicated list of customers from both tables.

  SELECT CustomerID, Name, Email FROM OnlineCustomers
UNION
SELECT CustomerID, Name, Email FROM OfflineCustomers;
  

Similarly, there are scenarios in which all records are needed, even if they are repeated. Consider sales data stored in two tables: OnlineSales and OfflineSales. You need all sales records, even if some transactions (e.g., the same product purchased by the same customer online and offline) are repeated.

  SELECT CustomerID, ProductID, Quantity, SalesAmount FROM OnlineSales
UNION ALL
SELECT CustomerID, ProductID, Quantity, SalesAmount FROM OfflineSales;
  

On the other hand, the INTERSECT operator returns only the rows that are common to all involved queries.

  SELECT CustomerID, Name, Email FROM OnlineCustomers 
INTERSECT 
SELECT CustomerID, Name, Email FROM OfflineCustomers;
  

The INTERSECT operator is not supported in all SQL dialects (e.g., MySQL versions before 8.0 do not support it). If unavailable, you can achieve similar functionality using INNER JOIN or subqueries.

Best practices

  1. Ensure compatibility—the number and order of columns must match across queries with compatible data types.

  2. Use UNION ALL for performance if duplicate removal is unnecessary.

  3. Filter early by applying WHERE conditions in individual queries to reduce data before combining.

  4. Ensure indexed columns are used in INTERSECT and filtering operations for better performance.

Pivoting and unpivoting

Pivoting and unpivoting are complex SQL queries that transform data structures for better analysis and reporting. Pivoting converts rows into columns, which is useful for creating summary or comparative views. Here's an example of pivoting to display the total order amount for each customer by year.

CUSTOMERS

CustomerID CustomerName
1 Alice
2 Bob

ORDERS

OrderID CustomerID OrderDate OrderAmount
101 1 10-05-2019 200
102 1 15-07-2020 300
103 2 12-09-2019 150
104 2 08-03-2021 400
  SELECT *
FROM (
    SELECT c.CustomerName, YEAR(o.OrderDate) AS OrderYear, o.OrderAmount
    FROM Customers c
    INNER JOIN Orders o ON c.CustomerID = o.CustomerID
) AS SourceData
PIVOT (
    SUM(OrderAmount)
    FOR OrderYear IN ([2019], [2020], [2021])
) AS PivotTable;
  

The inner query prepares the source data with customer names, years, and order amounts. The PIVOT operator transforms values of OrderYear (2019, 2020, 2021) into columns. The SUM(OrderAmount) aggregates order amounts for each customer name and year.

PIVOTEDDATA

CustomerName 2019 2020 2021
Alice 200 300 NULL
Bob 150 NULL 400

Unpivoting converts columns into rows, which is useful for normalizing data or restructuring for analytical purposes. 

The below query unpivots the pivoted data to transform the year columns (2019, 2020, 2021) back into rows, making it easier to analyze the data by year and product.

  SELECT CustomerName, SalesYear, SalesAmount
FROM (
    SELECT CustomerName, [2019], [2020], [2021]
    FROM PivotedData
) AS PivotedData
UNPIVOT (
    SalesAmount FOR SalesYear IN ([2019], [2020], [2021])
) AS UnpivotedData;
  
Product SalesYear SalesAmount
Alice 2019 200
Alice 2020 300
Bob 2019 150
Bob 2021 400

However, it's important to note that these operations are dialect-specific. While SQL Server and Oracle natively support the PIVOT and UNPIVOT operators, databases like MySQL and PostgreSQL do not. In such cases, alternative methods like conditional aggregation (for pivoting) or JOIN/UNION operations (for unpivoting) achieve similar results.

For example, the below query achieves pivoting by using a combination of CASE statements and aggregation functions like SUM:

  SELECT 
    c.CustomerName,
    SUM(CASE WHEN YEAR(o.OrderDate) = 2019 THEN o.OrderAmount ELSE 0 END) AS [2019],
    SUM(CASE WHEN YEAR(o.OrderDate) = 2020 THEN o.OrderAmount ELSE 0 END) AS [2020],
    SUM(CASE WHEN YEAR(o.OrderDate) = 2021 THEN o.OrderAmount ELSE 0 END) AS [2021]
FROM 
    Customers c
INNER JOIN 
    Orders o ON c.CustomerID = o.CustomerID
GROUP BY 
    c.CustomerName;
  

The below query achieve unpivoting using a UNION ALL query, manually stacking columns into rows

  SELECT CustomerName, '2019' AS SalesYear, [2019] AS SalesAmount
FROM PivotedData
UNION ALL
SELECT CustomerName, '2020' AS SalesYear, [2020] AS SalesAmount
FROM PivotedData
UNION ALL
SELECT CustomerName, '2021' AS SalesYear, [2021] AS SalesAmount
FROM PivotedData;
  

Case statements

Case statements allow you to perform conditional logic within a SQL query. They are useful for tasks such as data categorization or creating custom calculations based on specific conditions.

The below query uses a case statement to implement bonus calculation logic. Bonuses are calculated based on employee sales, and bonus calculation rules are defined using different cases.

  SELECT 
    Name,
    Sales,
    CASE 
        WHEN Sales >= 700 THEN Salary * 0.10
        WHEN Sales >= 500 THEN Salary * 0.05
        ELSE Sales * 0.02
    END AS Bonus
FROM Employees;
  

String manipulation

String manipulation functions enable text data modifications and analysis within queries, such as concatenation, trimming, and extracting substrings. These functions are especially useful in ETL processes, data cleaning, and reporting.

CONCAT combines multiple strings into one.

  SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;
  
  • TRIM removes leading and/or trailing whitespace or specified characters.

  • SUBSTRING in MySQL (or SUBSTR in SQL Server) extracts a portion of a string.

  • INSTR in MySQL (or CHARINDEX in SQL Server) finds the position of a substring within a string.

  SELECT 
    email,
    SUBSTRING(email, INSTR(email, '@') + 1) AS domain 
FROM users;
  
email domain
john.doe@gmail.com gmail.com
alice.smith@xyz.com xyz.com

REPLACE substitutes occurrences of a substring with another substring.

  SELECT 
    phone_number,
    REPLACE(phone_number, '-', '') AS standardized_phone
FROM contacts;
  
phone_number standardized_phone
123-456-7890 1234567890
987-654-3210 9876543210

Regular expressions

Regular expressions are used for pattern matching, text manipulation, and validations within SQL. Databases like MySQL and PostgreSQL support regex natively, while others, such as SQL Server, may require additional functions or external libraries.

Example—validating email addresses in MySQL.

  SELECT 
    FullName,
    CASE 
        WHEN Email REGEXP '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$' THEN 'Valid'
        ELSE 'Invalid'
    END AS EmailStatus
FROM Users;
  

Note: SQL syntax and function names for string manipulation (e.g., CHARINDEX in SQL Server vs. INSTR in MySQL) can vary across dialects. Always check the documentation for your specific SQL system to ensure compatibility.

FullName Email EmailStatus
John Doe John.doe@gmail.com Valid
Emily Susan Emily.susan@xyz Invalid

Date and time functions

Date and time functions facilitate temporal analysis and transformations, such as calculating intervals, converting date formats, and comparing time-related data. Refer to the documentation to find the best function for your requirements. We give some examples below.

DATEPART extracts a specific part of a date, such as a year, month, day, or hour.

  -- SQL Server
SELECT 
    DATEPART(YEAR, GETDATE()) AS Year,
    DATEPART(MONTH, GETDATE()) AS Month,
    DATEPART(DAY, GETDATE()) AS Day;

-- PostgreSQL/Oracle
SELECT 
    EXTRACT(YEAR FROM CURRENT_DATE) AS Year,
    EXTRACT(MONTH FROM CURRENT_DATE) AS Month,
    EXTRACT(DAY FROM CURRENT_DATE) AS Day;
  

DATEADD adds a specified interval to a date. For example, adding 7 days to the current date.

  -- SQL Server
SELECT DATEADD(DAY, 7, GETDATE()) AS NewDate;

-- PostgreSQL
SELECT CURRENT_DATE + INTERVAL '7 days' AS NewDate;

-- Oracle
SELECT CURRENT_DATE + 7 AS NewDate;
  

DATEDIFF calculates the difference between two dates.

  -- SQL Server
SELECT DATEDIFF(DAY, '2023-11-01', GETDATE()) AS DaysDifference;

-- PostgreSQL
SELECT EXTRACT(DAY FROM (CURRENT_DATE - DATE '2023-11-01')) AS DaysDifference;

-- Oracle
SELECT TRUNC(CURRENT_DATE - DATE '2023-11-01') AS DaysDifference;
  

FORMAT is useful for formatting dates.

  -- SQL Server
SELECT FORMAT(GETDATE(), 'yyyy-MM-dd HH:mm:ss') AS FormattedDate;

-- PostgreSQL/Oracle
SELECT TO_CHAR(CURRENT_DATE, 'YYYY-MM-DD HH24:MI:SS') AS FormattedDate;
  

EOMONTH/LAST_DAY is useful for finding the last day of the month

  -- SQL Server
SELECT EOMONTH(GETDATE()) AS LastDayOfMonth;

-- Oracle
SELECT LAST_DAY(CURRENT_DATE) AS LastDayOfMonth;

--- PostgreSQL
SELECT (DATE_TRUNC('month', now()) + interval '1 month - 1 day');
  

Recommendations for writing complex SQL queries

Keep it simple

Avoid using complex joins, recursive functions, and nested aggregations unless absolutely necessary. Overly complex queries can be difficult to debug and may affect performance.

Build query bottom-up

Start by testing the basic query requirements. Then, introduce complexity step-by-step. This approach ensures that each query stage functions correctly before adding more layers.

Use more readable syntax

Use proper formatting, indentation, and comments to ensure clarity. Where applicable, simplify type conversions and transformations with shorthand syntax.

Adopt a declarative approach

Use a declarative approach to build reusable code blocks and make your SQL logic more modular and understandable. Declarative configurations emphasize specifying what to do rather than how to do it. They tend to be easier to debug, reuse, and optimize. 

Unfortunately, SQL statements are only declarative in isolation. Multi-statement SQL scripts dictate the exact order of persistence (e.g., using INSERT and UPDATE in a specific order) and tend to become rigid and monolithic. 

DataForge is an open-source, next-generation data transformation tool that improves SQL’s declarative approach. It provides more flexibility and declarative power, especially for complex data transformations and processing tasks. 

For example, we could define a “products” source using the below code with Dataforge :

  source_name: "Products"
source_query: "SELECT * FROM Products"
raw_attributes:
- product_id int
- product_name string
- average_cost decimal
- average_sell_price decimal
rules:
 - name: "Round Average Cost"
   expression: "ROUND([This].average_cost, 2)"
 - name: "Truncate Product Name"
   expression: "CASE
     WHEN LENGTH([This].product_name) > 30
     THEN CONCAT(LEFT([This].product_name, 30), '...')
     ELSE [This].product_name
   END"
  

In this declarative approach, DataForge allows users to define queries in terms of transformations and conditions rather than explicit step-by-step updates. This configuration creates reusable rules (e.g., rounding or truncating) and enforces modular, maintainable transformations.

Conclusion

SQL offers a powerful suite of tools for data manipulation and transformation. By mastering techniques like recursive queries, window functions, CTEs, pivoting, and date/time manipulation, engineers can perform intricate data analysis and report directly within SQL, streamlining processes and enhancing efficiency.

Navigate Chapters: