Key techniques in complex SQL queries include subqueries, joins, unions, intersections, aggregate functions, window functions, common table expressions (CTEs), pivoting, recursive queries, string manipulation, date and time functions, and case statements. This article provides an overview of each technique, along with best practices to achieve precise and efficient data processing.

Summary of complex SQL Queries techniques

  
      Concept
      Description
    
      Common Table Expression (CTE)
      CTEs preserve query results, allowing queries to reference them.    
    
      Recursive queries
      Recursive queries process hierarchical or self-referencing data. They are typically built using CTEs.    
    
      Window functions
      Calculates values over a defined row window but related to the current row, enabling complex analytics.    
    
      Union and intersection
      Combines results from multiple queries (UNION for all, INTERSECT for common results).    
    
      Pivoting and unpivoting
      Transforms data from rows to columns, and vice versa, for comparative analysis.    
    
      Case statements
      Performs conditional logic within an SQL query    
    
      String manipulation
      Performs operations like concatenation, trimming, and substring extraction, including regular expressions    
    
      Date and time functions
      Handles date and time calculations, comparisons, and transformations.

Recursive queries and CTEs

Recursive queries allow you to link one record with others in the same table, making them suitable for handling hierarchical data structures such as organizational charts or product dependencies. They are typically built using Common Table Expressions (CTEs).

CTEs simplify complex queries by breaking them into smaller, reusable subparts. It involves defining the subquery with a meaningful name so that it can be referenced multiple times within the main query. CTEs improve readability and allow you to build queries modularly, making debugging and testing more efficient.

Example

Consider an EMPLOYEE table with a column called ManagerID, which refers to that employee's manager. This allows for many levels of employees and managers.

  
      EmployeeID
      EmployeeName
      ManagerID
    
      1
      Sarah
      NULL
    
      2
      John
      1
    
      3
      Michelle
      1
    
      4
      Frank
      2
    
      5
      Jerry
      4
    
      6
      Paula
      4

We can use a recursive query with CTE to retrieve the hierarchy of employees with their levels.


    WITH RECURSIVE OrgHierarchy AS (
-- Anchor Member: Start with the top-level manager (no ManagerID)
    SELECT 
        EmployeeID, 
        EmployeeName, 
        ManagerID, 
        1 AS Level
    FROM Employee
    WHERE ManagerID IS NULL

    UNION ALL

-- Recursive Member: Fetch employees reporting to the current level
    SELECT 
        e.EmployeeID, 
        e.EmployeeName, 
        e.ManagerID, 
        oh.Level + 1 AS Level
    FROM Employee e
    INNER JOIN OrgHierarchy oh 
        ON e.ManagerID = oh.EmployeeID
)
-- Final Query: Retrieve the hierarchical structure
SELECT 
    EmployeeID, 
    EmployeeName, 
    ManagerID, 
    Level
FROM OrgHierarchy
ORDER BY Level, EmployeeID;

The RECURSIVE keyword is not required if you're using Oracle or SQL Server. The recursion is implicit when the CTE references itself. If you're using MySQL or Postgres, you'll need to add the word RECURSIVE after the WITH keyword, as shown above.

The query results in the following:

  
      EmployeeID
      EmployeeName
      ManagerID
      Level
    

  
      1
      Sarah
      Null
      1
    

      2
      John
      1
      2
    

      3
      Michelle
      1
      2
    

      4
      Frank
      2
      3
    

      5
      Jerry
      4
      4
    

      6
      Paula
      4
      4
    

Best practices

Clearly define when recursion should stop. The recursion stops automatically in the above example when there are no more rows to match, ensuring finite results. You can also avoid infinite recursions by explicitly limiting the recursion depth using a WHERE condition or SQL options like MAXRECURSION(in SQL Server) or LEVEL(in Oracle).

Keep the recursive member lightweight by selecting only required fields and avoiding complex calculations. Ensure proper indexing (e.g., on ManagerID) for efficient joins in recursive members.

Window functions

Window functions perform calculations across a defined row window relative to the current row without collapsing the rows into a single result as grouped aggregations do. They are particularly useful for running totals, moving averages, rank calculations, and comparisons between rows.

Ranking functions

ROW_NUMBER()assigns a unique sequential number to each row within a partition, starting from 1.

RANK() assigns a rank to each row within a partition, with gaps in ranking if there are ties.
DENSE_RANK()is similar to RANK() but without gaps in ranking for ties.

We can see how these functions rank data using the below table ORDERS containing order details.

  
      OrderID
      OrderDate
      CustomerID
      OrderAmount
      Region
    

  
      101
      2024-11-01
      1 
      250.00
      North
    

      102
      2024-11-01
      2
      300.00
      North
    

      103
      2024-11-02
      3
      300.00
      North
    

      104
      2024-11-02
      4
      200.00
      South
    

      105
      2024-11-03
      5
      150.00
      South
    

      106
      2024-11-03
      6
      200.00
      South
    

The below query calculates the row number, RANK(), and DENSE_RANK() of an order based on the order amount within a region.


    SELECT 
    Region,
    OrderID,   
    CustomerID,
    OrderAmount,
    ROW_NUMBER() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS RowNumber,
    RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Rank,
    DENSE_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS DenseRank
FROM Orders 
ORDER BY Region, OrderAmount DESC;

  
      Region
      OrderID
      CustomerID
      OrderAmount
      RowNumber
      Rank
      DenseRank
    

  
      North
      102
      2
      300.00
      1
      1
      1
    

      North
      103
      3
      300.00
      2
      1
      1
    

      North
      101
      1
      250.00
      3
      3
      2
    

      South
      104
      4
      200.00
      1
      1
      1
    

      South
      106
      6
      200.00
      2
      1
      1
    

      South
      105
      5
      150.00
      3
      3
      2
    

Offset functions

LAG()fetches the value of a column from a preceding row within the same partition.
LEAD()fetches the value of a column from a subsequent row within the same partition.


    SELECT 
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    LAG(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS PreviousOrderAmount,
    LEAD(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS NextOrderAmount
FROM Orders
ORDER BY Region, OrderAmount DESC;

  
      Region
      OrderID
      CustomerID
      OrderAmount
      PreviousOrderAmount
      NextOrderAmount
    

  
      North
      102
      2
      300.00
      NULL
      300.00
    

      North
      103
      3
      300.00
      300.00
      250.00
    

      North
      101
      1
      250.00
      300.00
      NULL
    

      South
      104
      4
      200.00
      NULL
      200.00
    

      South
      106
      6
      200.00
      200.00
      150.00
    

      South
      105
      5
      150.00
      200.00
      NULL
    

Percentile (bucket) function

NTILE(n)divides rows into “n” equal groups and assigns a bucket number to each row. It is useful for creating histograms, analyzing rankings, or splitting data to enable parallel processing


    SELECT
 Region, OrderID, CustomerID, OrderAmount, 
 NTILE(2) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Bucket 
FROM Orders ORDER BY Region, OrderAmount DESC;

  
      Region
      OrderID
      CustomerID
      OrderAmount
      Bucket
    

  
      North
      102
      2
      300.00
      1
    

      North
      103
      3
      300.00
      1
    

      North
      101
      1
      250.00
      2
    

      South
      104
      4
      200.00
      1
    

      South
      106
      6
      200.00
      1
    

      South
      105
      5
      150.00
      2
    

Aggregates

Standard aggregate functions like SUM(), AVG(), MIN(), and MAX() can be used as window functions, allowing calculations over a row window. The query below uses SUM() as a window function to calculate the cumulative or running sum.


    SELECT 
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    SUM(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC ROWS UNBOUNDED PRECEDING) AS RunningSum
FROM Orders
ORDER BY Region, OrderAmount DESC;

Union and intersection

UNION, UNION ALL, and INTERSECT are set operators useful when combining or comparing datasets in various analytical and operational contexts.

The UNION operator combines results from two or more queries into a single result set, removing duplicates by default. The UNION ALL operator performs better when duplicate removal is unnecessary.

UNION vs. UNION ALL

  
      Feature 
      UNION
      UNION ALL
    
      Duplicates
      Removes duplicates
      Keep all rows, including duplicates
    
      Performance
      Slower (due to duplicate removal)
      Faster
    
      Use case
      When unique results are needed
      When all results are needed

Suppose customer data is stored in two different tables, OnlineCustomers and OfflineCustomers. Some customers may have purchased both online and offline, resulting in duplicates. Union provides a de-duplicated list of customers from both tables.


    SELECT CustomerID, Name, Email FROM OnlineCustomers
UNION
SELECT CustomerID, Name, Email FROM OfflineCustomers;

Similarly, there are scenarios in which all records are needed, even if they are repeated. Consider sales data stored in two tables: OnlineSales and OfflineSales. You need all sales records, even if some transactions (e.g., the same product purchased by the same customer online and offline) are repeated.


    SELECT CustomerID, ProductID, Quantity, SalesAmount FROM OnlineSales
UNION ALL
SELECT CustomerID, ProductID, Quantity, SalesAmount FROM OfflineSales;

On the other hand, the INTERSECT operator returns only the rows that are common to all involved queries.


    SELECT CustomerID, Name, Email FROM OnlineCustomers 
INTERSECT 
SELECT CustomerID, Name, Email FROM OfflineCustomers;

The INTERSECT operator is not supported in all SQL dialects (e.g., MySQL versions before 8.0 do not support it). If unavailable, you can achieve similar functionality using INNER JOIN or subqueries.

Best practices

Ensure compatibility—the number and order of columns must match across queries with compatible data types.
Use UNION ALL for performance if duplicate removal is unnecessary.
Filter early by applying WHERE conditions in individual queries to reduce data before combining.
Ensure indexed columns are used in INTERSECT and filtering operations for better performance.

Pivoting and unpivoting

Pivoting and unpivoting are complex SQL queries that transform data structures for better analysis and reporting. Pivoting converts rows into columns, which is useful for creating summary or comparative views. Here's an example of pivoting to display the total order amount for each customer by year.

CUSTOMERS

  
      CustomerID
      CustomerName
    
      1
      Alice
    
      2
      Bob

ORDERS

  
      OrderID
      CustomerID
      OrderDate
      OrderAmount
    

  
      101
      1
      10-05-2019
      200
    

      102
      1
      15-07-2020
      300
    

      103
      2
      12-09-2019
      150
    

      104
      2
      08-03-2021
      400
    


    SELECT *
FROM (
    SELECT c.CustomerName, YEAR(o.OrderDate) AS OrderYear, o.OrderAmount
    FROM Customers c
    INNER JOIN Orders o ON c.CustomerID = o.CustomerID
) AS SourceData
PIVOT (
    SUM(OrderAmount)
    FOR OrderYear IN ([2019], [2020], [2021])
) AS PivotTable;

The inner query prepares the source data with customer names, years, and order amounts. The PIVOT operator transforms values of OrderYear (2019, 2020, 2021) into columns. The SUM(OrderAmount) aggregates order amounts for each customer name and year.

PIVOTEDDATA

  
      CustomerName
      2019
      2020
      2021
    

  
      Alice
      200
      300
      NULL
    

      Bob
      150
      NULL
      400
    

Unpivoting converts columns into rows, which is useful for normalizing data or restructuring for analytical purposes.

The below query unpivots the pivoted data to transform the year columns (2019, 2020, 2021) back into rows, making it easier to analyze the data by year and product.


    SELECT CustomerName, SalesYear, SalesAmount
FROM (
    SELECT CustomerName, [2019], [2020], [2021]
    FROM PivotedData
) AS PivotedData
UNPIVOT (
    SalesAmount FOR SalesYear IN ([2019], [2020], [2021])
) AS UnpivotedData;

  
      Product
      SalesYear
      SalesAmount
    
      Alice
      2019
      200
    
      Alice
      2020
      300
    
      Bob
      2019
      150
    
      Bob
      2021
      400

However, it's important to note that these operations are dialect-specific. While SQL Server and Oracle natively support the PIVOT and UNPIVOT operators, databases like MySQL and PostgreSQL do not. In such cases, alternative methods like conditional aggregation (for pivoting) or JOIN/UNION operations (for unpivoting) achieve similar results.

For example, the below query achieves pivoting by using a combination of CASE statements and aggregation functions like SUM:


    SELECT 
    c.CustomerName,
    SUM(CASE WHEN YEAR(o.OrderDate) = 2019 THEN o.OrderAmount ELSE 0 END) AS [2019],
    SUM(CASE WHEN YEAR(o.OrderDate) = 2020 THEN o.OrderAmount ELSE 0 END) AS [2020],
    SUM(CASE WHEN YEAR(o.OrderDate) = 2021 THEN o.OrderAmount ELSE 0 END) AS [2021]
FROM 
    Customers c
INNER JOIN 
    Orders o ON c.CustomerID = o.CustomerID
GROUP BY 
    c.CustomerName;

The below query achieve unpivoting using a UNION ALL query, manually stacking columns into rows


    SELECT CustomerName, '2019' AS SalesYear, [2019] AS SalesAmount
FROM PivotedData
UNION ALL
SELECT CustomerName, '2020' AS SalesYear, [2020] AS SalesAmount
FROM PivotedData
UNION ALL
SELECT CustomerName, '2021' AS SalesYear, [2021] AS SalesAmount
FROM PivotedData;

Case statements

Case statements allow you to perform conditional logic within a SQL query. They are useful for tasks such as data categorization or creating custom calculations based on specific conditions.

The below query uses a case statement to implement bonus calculation logic. Bonuses are calculated based on employee sales, and bonus calculation rules are defined using different cases.


    SELECT 
    Name,
    Sales,
    CASE 
        WHEN Sales >= 700 THEN Salary * 0.10
        WHEN Sales >= 500 THEN Salary * 0.05
        ELSE Sales * 0.02
    END AS Bonus
FROM Employees;

String manipulation

String manipulation functions enable text data modifications and analysis within queries, such as concatenation, trimming, and extracting substrings. These functions are especially useful in ETL processes, data cleaning, and reporting.

CONCAT combines multiple strings into one.


    SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;

TRIM removes leading and/or trailing whitespace or specified characters.
SUBSTRING in MySQL (or SUBSTR in SQL Server) extracts a portion of a string.
INSTR in MySQL (or CHARINDEX in SQL Server) finds the position of a substring within a string.


    SELECT 
    email,
    SUBSTRING(email, INSTR(email, '@') + 1) AS domain 
FROM users;

  
      email
      domain
    
      john.doe@gmail.com
      gmail.com
    
      alice.smith@xyz.com
      xyz.com

REPLACE substitutes occurrences of a substring with another substring.


    SELECT 
    phone_number,
    REPLACE(phone_number, '-', '') AS standardized_phone
FROM contacts;

  
      phone_number
      standardized_phone
    
      123-456-7890
      1234567890
    
      987-654-3210
      9876543210

Regular expressions

Regular expressions are used for pattern matching, text manipulation, and validations within SQL. Databases like MySQL and PostgreSQL support regex natively, while others, such as SQL Server, may require additional functions or external libraries.

Example—validating email addresses in MySQL.


    SELECT 
    FullName,
    CASE 
        WHEN Email REGEXP '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$' THEN 'Valid'
        ELSE 'Invalid'
    END AS EmailStatus
FROM Users;

Note: SQL syntax and function names for string manipulation (e.g., CHARINDEX in SQL Server vs. INSTR in MySQL) can vary across dialects. Always check the documentation for your specific SQL system to ensure compatibility.

  
      FullName
      Email
      EmailStatus
    
      John Doe
      John.doe@gmail.com
      Valid
    
      Emily Susan
      Emily.susan@xyz
      Invalid

Date and time functions

Date and time functions facilitate temporal analysis and transformations, such as calculating intervals, converting date formats, and comparing time-related data. Refer to the documentation to find the best function for your requirements. We give some examples below.

DATEPART extracts a specific part of a date, such as a year, month, day, or hour.


    -- SQL Server
SELECT 
    DATEPART(YEAR, GETDATE()) AS Year,
    DATEPART(MONTH, GETDATE()) AS Month,
    DATEPART(DAY, GETDATE()) AS Day;

-- PostgreSQL/Oracle
SELECT 
    EXTRACT(YEAR FROM CURRENT_DATE) AS Year,
    EXTRACT(MONTH FROM CURRENT_DATE) AS Month,
    EXTRACT(DAY FROM CURRENT_DATE) AS Day;

DATEADD adds a specified interval to a date. For example, adding 7 days to the current date.


    -- SQL Server
SELECT DATEADD(DAY, 7, GETDATE()) AS NewDate;

-- PostgreSQL
SELECT CURRENT_DATE + INTERVAL '7 days' AS NewDate;

-- Oracle
SELECT CURRENT_DATE + 7 AS NewDate;

DATEDIFF calculates the difference between two dates.


    -- SQL Server
SELECT DATEDIFF(DAY, '2023-11-01', GETDATE()) AS DaysDifference;

-- PostgreSQL
SELECT EXTRACT(DAY FROM (CURRENT_DATE - DATE '2023-11-01')) AS DaysDifference;

-- Oracle
SELECT TRUNC(CURRENT_DATE - DATE '2023-11-01') AS DaysDifference;

FORMAT is useful for formatting dates.


    -- SQL Server
SELECT FORMAT(GETDATE(), 'yyyy-MM-dd HH:mm:ss') AS FormattedDate;

-- PostgreSQL/Oracle
SELECT TO_CHAR(CURRENT_DATE, 'YYYY-MM-DD HH24:MI:SS') AS FormattedDate;

EOMONTH/LAST_DAY is useful for finding the last day of the month


    -- SQL Server
SELECT EOMONTH(GETDATE()) AS LastDayOfMonth;

-- Oracle
SELECT LAST_DAY(CURRENT_DATE) AS LastDayOfMonth;

--- PostgreSQL
SELECT (DATE_TRUNC('month', now()) + interval '1 month - 1 day');

Recommendations for writing complex SQL queries

Keep it simple

Avoid using complex joins, recursive functions, and nested aggregations unless absolutely necessary. Overly complex queries can be difficult to debug and may affect performance.

Build query bottom-up

Start by testing the basic query requirements. Then, introduce complexity step-by-step. This approach ensures that each query stage functions correctly before adding more layers.

Use more readable syntax

Use proper formatting, indentation, and comments to ensure clarity. Where applicable, simplify type conversions and transformations with shorthand syntax.

Adopt a declarative approach

Use a declarative approach to build reusable code blocks and make your SQL logic more modular and understandable. Declarative configurations emphasize specifying what to do rather than how to do it. They tend to be easier to debug, reuse, and optimize.

Unfortunately, SQL statements are only declarative in isolation. Multi-statement SQL scripts dictate the exact order of persistence (e.g., using INSERT and UPDATE in a specific order) and tend to become rigid and monolithic.

DataForge is an open-source, next-generation data transformation tool that improves SQL’s declarative approach. It provides more flexibility and declarative power, especially for complex data transformations and processing tasks.

For example, we could define a “products” source using the below code with Dataforge :


    source_name: "Products"
source_query: "SELECT * FROM Products"
raw_attributes:
- product_id int
- product_name string
- average_cost decimal
- average_sell_price decimal
rules:
 - name: "Round Average Cost"
   expression: "ROUND([This].average_cost, 2)"
 - name: "Truncate Product Name"
   expression: "CASE
     WHEN LENGTH([This].product_name) > 30
     THEN CONCAT(LEFT([This].product_name, 30), '...')
     ELSE [This].product_name
   END"

In this declarative approach, DataForge allows users to define queries in terms of transformations and conditions rather than explicit step-by-step updates. This configuration creates reusable rules (e.g., rounding or truncating) and enforces modular, maintainable transformations.

Conclusion

SQL offers a powerful suite of tools for data manipulation and transformation. By mastering techniques like recursive queries, window functions, CTEs, pivoting, and date/time manipulation, engineers can perform intricate data analysis and report directly within SQL, streamlining processes and enhancing efficiency.

Navigate Chapters:

Complex SQL Queries Best Practices

Table of Contents

Summary of complex SQL Queries techniques

Recursive queries and CTEs

Example

Best practices

Window functions

Ranking functions

Offset functions

Percentile (bucket) function

Aggregates

Union and intersection

UNION vs. UNION ALL

Best practices

Pivoting and unpivoting

Case statements

String manipulation

Regular expressions

Date and time functions

Recommendations for writing complex SQL queries

Keep it simple

Build query bottom-up

Use more readable syntax

Adopt a declarative approach

Conclusion

Continue reading this series

Resources

Product

Legal

Follow