Advanced SQL Concepts

SQL Temporary Tables Modern Iterations

Multi-chapter guide | Chapter 2

Temporary tables in SQL are intermediary staging tables that act as staging areas before data transformation. For many years, temporary tables have been convenient tools for engineers to manage data and improve the performance of query executions without complicating the original schemas. They reduce the need for multiple subqueries or repeated joins. 

From being mainly used within reporting and warehousing, relational database management systems improved temporary tables to become:

  • Global among multiple sessions

  • Indexable, implying much better performance

  • Replaceable by lightweight table variables in cases dealing with less complicated data.

With the rise of big data technologies, these inherent features eliminated the need for engineers to customize temporary tables mechanically, for example:

  • Dataframes in Apache Spark allow users to create temporary views easily.

  • Data warehouses like Google’s BigQuery also provide temporary views.

  • Common Table Expressions (CTEs) in SQL behave very similarly to temporary tables without creating physical tables.

While modern data technologies have simplified the usage of temporary tables, their principles and functionality are still widely present in the data engineering industry. The concept of temporary tables and their abilities is valuable knowledge for engineers who want to maximize the functionality of modern data tools.

This article compares tools and technologies based on temporary tables widely used in today’s data-driven technology landscape.

Summary of key SQL temporary table concepts

Concept Description
RDBMS temporary tables Temporary tables that store intermediate results in a single session hosted by an RDBMS.
CTEs and non-materialized views Temporary result sets that simplify and modularize complex queries by decomposing them into smaller parts.
Materialized views Non-virtual query results that persist on the disk for quick retrieval.
In-memory cached tables Data structures that improve performance by storing table data in RAM for easy access.
Persisted views Optimized and automated data storage in pipelines that are constantly evolving and increasing in volume.

#1 RDMBS temporary tables

Most relational database management systems (RDBMS) support creating and using temporary tables. The purpose of these tables is to store intermediary results that exist only within the scope of a session or a transaction. These tables are dropped automatically at the end of a session, meaning that the developer does not have to perform a manual cleanup. 

The most practical aspect of using temporary tables is that they improve session performance by holding result sets that can be reused when needed instead of recalculated each time. Some examples of popular RDBMS implementations of temporary tables are as follows:

PostgreSQL

PostgreSQL temporary tables can be created using the CREATE TEMP TABLE (shorthand notation) or CREATE TEMPORARY TABLE keywords. These exist within a session and are automatically dropped afterward.

  CREATE TEMP TABLE People (ID SERIAL PRIMARY KEY, Name VARCHAR(50));
  

MySQL

MySQL temporary tables are created using the CREATE TEMPORARY TABLE keywords, and similarly to PostgreSQL, they only exist in the context of a session.

  CREATE TEMPORARY TABLE People (ID INT, Name VARCHAR(50));
  

SQL Server

SQL Server temporary tables are created a little differently. Temporary tables are created similarly to regular tables. However, a single “#” is added before the table name to indicate that it is a local temporary table.

  CREATE TABLE #People (ID INT, Name VARCHAR(50));
  

SQL Server also allows the creation of global temporary tables (that are accessible to all sessions) with a double “#” added before the table name:

  CREATE TABLE ##People (ID INT, Name VARCHAR(50));
  

Oracle

In Oracle, temporary tables are defined as “Global Temporary Tables.” This means that the table structure is embedded in the database, and the table data exists only within the scope of a session. Different sessions can’t access each other’s temporary table data. Oracle temporary tables are created as shown below.

  CREATE GLOBAL TEMPORARY TABLE People (ID NUMBER, Name VARCHAR2(50)) ON COMMIT PRESERVE ROWS;
  

“ON COMMIT PRESERVE ROWS” preserves the data in the temporary table until the session ends. Changing it to “ON COMMIT DELETE ROWS” allows you to preserve the data for a single transaction.

#2 Common Table Expressions (CTEs) and non-materialized views

CTEs and non-materialized views are temporary and virtual result sets commonly used in SQL to simplify long queries. They produce results dynamically each time they are queried. SQL’s declarative nature tends to become verbose and unreadable, making CTEs and non-materialized views important for scalability and modularity.

Common Table Expressions

CTEs exist only within the scope of the query they are defined in and are discarded beyond execution. They are defined using the  WITH keyword in SQL, for example:

  WITH RevenueCTE AS (
	SELECT ProductID, SUM(Amount) AS TotalRevenue
	FROM Sales
	GROUP BY ProductID
)
SELECT ProductID, TotalRevenue
FROM RevenueCTE
WHERE TotalRevenue > 100;
  

In the example above, the CTE queries all products with a combined revenue of above 100. It makes the script look far more concise than direct SQL queries. Typically, CTEs are not stored in memory but can be materialized using a query optimizer. 

Non-materialized views

Non-materialized views, or simply “views,” are database objects that do not physically store results. However, unlike CTEs, views can be reused across multiple queries and exist within the database schema instead of a single query.  They act as saved queries or virtual tables. The previous example would look like this as a view:

  CREATE VIEW RevenueView AS
    SELECT ProductID, SUM(Amount) AS TotalRevenue
    FROM Sales
    GROUP BY ProductID;

-- Querying the non-materialized view
SELECT ProductID, TotalRevenue
FROM RevenueView
WHERE TotalRevenue > 100;
  

Here, RevenueView can be reused for different total revenue filters, making reusability one of its key strengths.

CTEs and non-materialized views vs. temporary tables

While temporary tables sound similar in syntax and functionality to CTEs and views, they persist differently. Temporary tables are physically stored for a session, making them quicker to access than virtual CTEs and views. As mentioned, CTEs can also be persisted but only for the duration of a single SQL statement, while temporary tables are persisted for the duration of the sessions. However, this comes with the trade-off of storage overhead. For example, the revenue example could be represented as a temporary table:

  CREATE TEMPORARY TABLE TotalRevenueTemp AS
    SELECT ProductID, SUM(Amount) AS TotalRevenue
    FROM Sales
    GROUP BY ProductID;

-- Querying the temporary table
SELECT ProductID, TotalRevenue
FROM TotalRevenueTemp
WHERE TotalRevenue > 100;
  

You must consider performance, storage, and scalability factors when deciding which structure to use. Virtual CTEs and views are handy when reusing straightforward logic but can have performance implications if they are complex and meant to handle large data volumes. In such cases, temporary tables could be more practical as execution times are lower.

#3 Materialized views

Materialized views are SQL database objects that physically store result sets on disk. Unlike non-materialized views, materialized views do not actively calculate their result every time they are called, significantly improving performance.

Materialized views also come with a refresh mechanism since data in the associated tables changes frequently. Refreshing can be manual or automatic. You can also perform a full or incremental refresh, which is less performance-intensive. Materialized views are best suited for data warehouse solutions like Snowflake and for critical systems that handle large data volumes and can’t afford to keep executing queries on the fly. 

An example of creating a materialized view with Snowflake and SQL:

  CREATE MATERIALIZED VIEW RevenueMaterializedView AS
	SELECT ProductID, SUM(Amount) AS TotalRevenue
	FROM Sales
	GROUP BY ProductID;

-- Querying the materialized view
SELECT ProductID, TotalRevenue
FROM RevenueMaterializedView
WHERE TotalRevenue > 100;
  

Materialized views in Snowflake are quite efficient as they include automatic refreshing when the underlying data changes. There is no need to implement automatic or manual refresh functions, ensuring data integrity and reducing maintenance. 

However, in more traditional SQL systems like PostgreSQL, you must manually refresh your materialized views.

  CREATE MATERIALIZED VIEW RevenueMaterializedView AS
SELECT ProductID, SUM(Amount) AS TotalRevenue
FROM Sales
GROUP BY ProductID;

-- To refresh the materialized view
REFRESH MATERIALIZED VIEW RevenueMaterializedView;
  

The REFRESH statement above can be called as necessary when table data changes.

The close coupling of materialized views with their associated tables compromises modularity. Using a framework like Dataforge, which takes a functional programming approach, improves code reusability.

source_name: "Sales"
raw_attributes:
  - ProductID int
  - Amount int

rules:
  - attribute_name: "Revenue"
    expression: "SUM([This].Amount) GROUP BY [This].ProductID"
  

DataForge yaml code

Dataforge lets you introduce “rules” attached to a source table, eliminating the need to create views heavily coupled with the table’s source code. The rule doesn’t directly reference the underlying table, promoting reusability among other sources.

Materialized views vs. temporary tables

While materialized views and temporary tables are objects stored on disk, materialized views make significant performance improvements:

  • Materialized views can be refreshed when underlying data is modified. Temporary tables have no refreshing mechanism and must be populated manually.

  • Materialized views persist until explicitly dropped. Temporary tables only exist for the duration of a session, limiting reusability.

Materialized views benefit from the ability to remain pre-computed and be refreshed incrementally for cases involving complex aggregations and joins on large amounts of data.

#4 In-memory cached tables

In-memory cached data tables are objects stored in RAM (random access memory). Since RAM access is considerably faster than accessing disk storage, these cached tables provide significant performance benefits. They are suitable for critical processes where speed is an essential factor. 

For example, Online Transaction Processing (OLTP) systems commonly use in-memory cached tables due to their quick-moving and high-volume nature. In these systems, frequently accessed data is marked as in-memory and stored in RAM. This data typically includes product availability, user accounts, payment confirmations, etc. Once stored in RAM, systems have low latency data access to this data, ensuring web applications are responsive and up-to-date.

The two most significant considerations to be kept in mind when using in-memory cached tables are:

  • Memory size: Using in-memory tables for potentially more extensive data than RAM storage can cause out-of-memory errors. There should be a justifiable reason for storing data as in-memory cached tables.

  • Stale data: If caches are not invalidated timely, systems could act upon stale data and produce incorrect results. An optimized invalidation strategy should be developed when dealing with in-memory cached tables. 

In data processing systems like Spark, you can achieve in-memory table caching by using DataFrames, built-in data structures that organize data into columns. Spark also provides functions to cache data, making in-memory table storage rather convenient. 

The revenue example from earlier implemented in Spark and Python:

  from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create a Spark session
spark = SparkSession.builder \
	.appName("InMemoryTableCaching") \
	.getOrCreate()

# Load the Sales data from a csv file into a DataFrame
sales_df = spark.read.csv("data/sales.csv", header=True, inferSchema=True)

# Perform the aggregation to compute Total Revenue
revenue_df = sales_df.groupBy("ProductID") \
	.agg(F.sum("Amount").alias("TotalRevenue"))

# Cache the resulting DataFrame in memory
revenue_df.cache()

# Trigger an action that computes and caches the DataFrame
revenue_df.count()  

# Query the cached DataFrame for TotalRevenue > 100
result_df = revenue_df.filter(revenue_df.TotalRevenue > 100)

# Show the results
result_df.show()

# Stop the Spark session
spark.stop()
  

A DataFrame is created to store initial data and another DataFrame to store the aggregated revenue. The function, revenue_df.cache(), then caches this DataFrame into RAM, making it easily accessible. The function revenue_df.count() computes the cached DataFrame before it can be queried because transformations in Spark (such as .agg()) are lazily evaluated to improve performance. 

Spark also lets you decide how to store your DataFrame. For example, instead of revenue_df.cache()you could use:

  • df.persist(StorageLevel.MEMORY_ONLY)to store the DataFrame in RAM. If the DataFrame size exceeds the available memory, then it is not cached but stored on disk instead

  • df.persist(StorageLevel.MEMORY_AND_DISK)to store as much data as possible in RAM and spill over the rest onto the disk.

  • df.persist(StorageLevel.DISK_ONLY)to store the entire DataFrame on disk.

These fine-tuning capabilities provided by Spark make it a valuable choice for data engineers looking to optimize in-memory caching.

Ultimately, in-memory cached tables are performance-oriented tools for optimizing data processing systems. They build on temporary tables, storing frequently accessed intermediary results while maintaining quick access times. You can choose how to involve these cached tables to balance storage trade-offs with retrieval speeds.

#5 Persisted views

Persisted views are a modern, abstract adaptation of standard database views. They provide the functionality of traditional views, along with additional features like data management, performance improvement, and pipeline management. Like materialized views, persisted views are nonvirtual and store data on disk. However, persistent views are more focused on real-time pipeline management. 

“Delta Live Tables,” a framework developed by Databricks, is a popular tool for improving materialized view management in Apache Spark. Delta Live Tables focuses on creating and managing views declaratively. You only have to specify how the view should be handled without worrying about the execution logic. This is done by annotating associated functions with  @dlt.table. 

Delta Live Tables’ declarative approach significantly simplifies complex logic, especially in ETL processes that tend to become verbose. Below is an example of Delta Live Tables with Python and Spark.

  import dlt
from pyspark.sql import functions as F

@dlt.table
def raw_sales_data():
	# Load raw sales data in delta format
	return spark.read.format("delta").load("/mnt/sales/")

@dlt.table
def filtered_sales_data():
	# Filter the raw sales data
	return (
    	dlt.read("raw_sales_data")
    	.filter(F.col("amount") > 0)
    	.select("product_id", "amount")
	)

@dlt.table
def total_sales_by_product():
	# Create a view that aggregates total sales by product
	return (
    	dlt.read("cleaned_sales_data")
    	.groupBy("product_id")
    	.agg(F.sum("amount").alias("total_sales"))
	)

@dlt.table
def high_sales_products():
	# Create a view for high-value products
	return (
    	dlt.read("total_sales_by_product")
    	.filter(F.col("total_sales") > 1000)
	)
  

Like the previous examples, views pre-compute the results needed for database querying. Delta Live Tables declaratively loads, filters, aggregates, and customizes the necessary datasets without worrying about the underlying execution logic. The most significant advantage of using Delta Live Tables is monitoring and managing pipelines on the Databricks user interface once they have been defined. You can check if each view performs its job accurately and efficiently. Common monitoring capabilities include:

  • Table/view status

  • Execution times

  • Data quality checks

  • Pipeline visualisation

  • Version control

Delta Live Tables UI in Databricks (Source)

Last thoughts

Temporary tables in SQL are handy tools that help modularize and optimize scripts. They provide intermediary result sets, reducing the need for multiple identical calculations. Modern advancements based on temporary tables are widely used in the data engineering landscape today. Tools like views, in-memory cached tables, and persisted views further improve the efficiency of temporary tables. Frameworks like DataForge and these tools can be used to modularize SQL queries and create a separation of concern between data tables and queries. 

Navigate Chapters:

Table of Contents

  • Summary
  • #1 RDMBS temporary tables
  • #2 Common Table Expressions (CTEs) and non-materialized views
  • #3 Materialized views
  • #4 In-memory cached tables
  • #5 Persisted views
  • Last thoughts