Advanced SQL Concepts
SQL Performance Tuning Best Practices
Multi-chapter guide | Chapter 4
In modern data-driven environments, efficient database performance is crucial to ensure the smooth operation of applications. The performance of your SQL queries dramatically impacts application efficiency. SQL performance tuning ensures efficient query execution using appropriate resources and minimizing overhead.
SQL performance tuning is a complex task requiring careful consideration of various factors such as query execution plans, indexing strategies, and memory management. Writing efficient SQL is much harder than writing functionally correct SQL.
This article outlines ten best practices that could be followed to ensure your SQL queries are efficient and perform optimally.
Summary of SQL performance tuning best practices
Best practice | Description |
---|---|
Utilize query execution plan | Use query execution plans (EXPLAIN) to identify inefficiencies and reduce costly operations like full table scans. |
Tune key memory parameters | Adjust parameters like work_mem, temp_buffers, and shared_buffers for optimal memory usage. |
Optimize query statements | To reduce memory and I/O load, avoid SELECT * and fetch only necessary columns. Simplify complex joins and reduce subqueries when possible. |
Efficient indexing & sargability (Search ARgument able) | Ensure proper indexing on frequently queried columns and write queries that use indexes efficiently by avoiding non-sargable conditions |
Monitor logical/physical reads & IOPS | Monitor logical/physical reads & IOPS to detect inefficient disk operations. Enable query caching and optimize query performance and memory configurations to minimize logical and physical reads. |
Avoid parameter sniffing | Avoid reusing execution plans built for specific parameter values, as it might be inefficient for different values. Use query hints or recompilation strategies |
Manage transactions | Use transactions judiciously to prevent locking and contention. |
Partition large tables | Partition large datasets for efficient querying and maintenance. |
Regularly update statistics | Keep database statistics updated for accurate query optimization. |
Adopt a declarative approach and utilize platform-specific features | Adopt declarative data transformation approaches to avoid complex, hard-to-diagnose SQL and stored procedures. Be aware of platform-specific features and utilize them to avoid unnecessary reoptimization efforts |
These SQL performance tuning best practices cover fundamental aspects such as query optimization, indexing strategies, and memory management. However, more specialized solutions may be necessary based on your specific application and SQL platform.
For example, transactional systems mainly deal with "reading a few rows and updating a few rows," while data engineering pipelines involve "reading many rows and writing or updating many rows." While indexing is beneficial in transactional systems, it can be detrimental in data engineering contexts because indexes must be updated or rebuilt with each pipeline execution, leading to increased processing time. Additionally, write operations in modern data platforms are much more expensive than reads, and optimization efforts should focus on minimizing the number of writes.
Due to such differences, traditional OLTP best practices do not always apply, and specialized approaches are needed to optimize SQL performance for data engineering tasks. One way to achieve this is by adopting tools like DataForge. Unlike traditional data transformation processes that persist intermediate data stages, DataForge reduces unnecessary writes, speeds up subsequent reads, and improves overall performance.
Let’s discuss the best practices in light of this consideration.
#1 Utilize query execution plan
Every SQL statement has an execution plan. The cost-based optimizer creates the plan using collected DB statistics. The key to SQL performance tuning is understanding how the database retrieves data by analyzing this query execution plan.
The Explain Plan is a visualization of this execution plan. Most tools, including Oracle’s Toad, use a PLAN_TABLE to present the Explain Plan steps. You can generate a graphical Explain Plan by clicking the Explain Plan button. A physical table needs to exist for these tools to be used.
Explain Plan in ToadⓇ for Oracle
Toad Explain Plan DBMS_XPLAN Display
You can view a detailed breakdown of a query's execution, including which indexes are used, how joins are performed, and whether scans are full or partial. By identifying operators that incur the highest costs, you can make adjustments to improve performance.
Also, look for any warnings or indexing suggestions to guide further optimizations. “Consider Index Suggestions” and “Warnings Many” database tools provide indexing suggestions and warnings related to the execution plan. Use them for effective optimizations.
#2. Tune key memory parameters
The configurations file, which manages the database server's configurations, contains memory parameters that control how queries utilize memory. Each RDBMS may use different names and settings for these memory management adjustments. For example, key parameters for PostgreSQL include work_mem, temp_buffers, shared_buffers, and effective_cache_size. These settings, when optimally configured, improve query performance by reducing disk I/O and leveraging available RAM:
work_mem
This parameter provides the amount of memory allocated for complex operations such as sorting and hash joins within a query. Optimal work_mem is generally calculated as:
Total RAM * 0.25 / max_connections
max_connections parameter specifies the maximum number of concurrent connections to the database server. Setting optimal values results in less disk-swapping and far quicker queries.
temp_buffers
This parameter allocates memory for temporary tables within a session. Increase temp_buffers when you expect to use temporary tables frequently within sessions.
shared_buffers
This parameter specifies the memory reserved for caching data blocks. It stores frequently accessed data and is typically set to around 25% of the system’s available RAM. For example, if your machine’s RAM is 32 GB, the recommended value for shared_buffers is 8 GB.
effective_cache_size
Set effective_cache_size to reflect the total memory available for disk caching, including shared_buffers and the operating system’s filesystem cache.
#3. Optimize query statements
Optimize your query statements by ensuring the following.
Avoid Select *
Fetch only necessary columns to reduce memory and I/O load
Inefficient query:
Optimized query:
Avoid SELECT DISTINCT
Avoid SELECT DISTINCT to remove duplicates from a query as it is resource-intensive. You can save computing power and deliver distinct results by selecting more specific filters or grouping fields in the query instead of using SELECT DISTINCT.
In case where DISTINCT Query is unavoidable, add indexes on columns used in DISTINCT clauses to enhance performance. Indexing allows the database to retrieve unique values more efficiently, reducing the load on memory and CPU resources.
Use joins instead of correlated subqueries
Use joins instead of correlated subqueries, as correlated subqueries run once for each row processed, leading to slower performance.
If a subquery is essential, consider converting it into a join or a Common Table Expression (CTE) for better readability and efficiency
Inefficient query—uses a subquery inside the SELECT clause, which executes separately for each row, resulting in multiple scans of the bonus_jan table.
Optimized query—with a CTE that calculates the average bonus for each position once and joins this result with the bonus_jan table.
Prefer inner joins over Cartesian joins
Use inner joins instead of Cartesian Joins to avoid producing unnecessarily large result sets. If your query requires more tables, consider breaking it down into smaller, simpler queries and using temporary tables or common table expressions (CTEs) to store intermediate results.
Avoid unnecessary sorting
Use ORDER BY operations only when absolutely necessary, and ensure the data is already indexed in a way that supports efficient sorting.
Match query and column data types
Always ensure data types in queries match the column data types, or use explicit conversions when necessary. Mismatched types can lead to implicit conversions, which slow down query execution and prevent indexes from being used effectively.
Inefficient query—Specifies an integer filter for the EmployeeNumber column with a varchar datatype. This caused an implicit conversion on the ‘table side’, meaning that the optimizer must scan every value of that column for every row and convert it to an INT, the datatype of the filter predicate.
Optimized query
The impact of implicit conversions on performance and index usage also varies between SQL platforms. In SQL Server, implicit conversions can degrade performance, as shown in the example above. However, PostgreSQL often handles type mismatches more efficiently and can still use indexes due to better support for type coercion during query execution.
#4. Efficient indexing & sargability
While indexes can speed up query performance, creating unnecessary or poorly planned indexes leads to significant performance issues, especially in data engineering workloads. Here are some essential practices to consider for efficient indexing:
Create indexes on frequently queried columns
Focus on columns often used in WHERE clauses, DISTINCT clauses, JOIN conditions, ORDER BY conditions, or GROUP BY statements.
Avoid indexing small tables or columns with low cardinality
Indexes on small tables or columns with limited unique values can cause unnecessary overhead in maintaining these indexes during data operations.
Monitor and manage indexes regularly
Indexes can become fragmented over time, affecting performance. Check for fragmentation regularly and rebuild or reorganize indexes as needed. Analyze the query execution plan to verify that indexes are being used as expected. Remove unused or underutilized indexes to avoid overhead.
Write Sargable queries to leverage indexes
Ensure your queries are sargable (Search ARGument-able). This means writing queries in a way that allows the database engine to use the index effectively.
Create an index on the 'order_date' column to speed up date-range queries
Inefficient query
Optimized query
Non-sargable conditions like LIKE '%value%', NOT IN, <>, and != can force full table scans. Rewrite queries to be sargable.
Inefficient query
Optimized query
Inefficient query
Optimized query
Using IN or UNION instead of OR for indexed columns can also help the database leverage indexes more effectively.
Inefficient query
Optimized queries
Minimize bookmark lookups
A bookmark lookup occurs when an index does not contain all the columns a query needs, requiring additional reads. Covering indexes that include all the columns referenced in a query can reduce these lookups. Many SQL platforms also offer the INCLUDE option in index creation, allowing you to add non-key columns directly into an index. Alternatively, consider rewriting the query to reduce additional column lookups.
Row vs. columnar storage
Use columnar storage (e.g., Parquet, Snowflake, BigQuery) for modern data engineering tasks. It improves compression, speeds up aggregate calculations, and allows efficient data retrieval and parallel processing. This is ideal for processing large datasets quickly. It is also ideal for unknown or unpredictable workloads typical for analytics and adhoc reporting. Columnar storage prevents overindexing with regular indexes, and provides equivalent of indexed access on all columns of the dataset.
Adhering to these indexing and variability best practices ensures your queries run more efficiently. However, always review specific best practices for your target RDBMS platform. Different platforms might have unique behaviors or optimizations for index usage.
#5. Monitor logical/physical reads & IOPS
Monitoring logical and physical reads and Input/Output Operations Per Second (IOPS) is essential for identifying inefficient disk usage. High IOPS values often point to excessive physical reads, suggesting the need for query optimization or improved memory configurations. Effective management of IOPS can lead to better performance and reduced strain on system resources.
Here are some strategies to consider:
Optimize query performance
Reduce physical reads by streamlining query execution, as explained previously. Efficient queries lead to fewer physical reads and lower IOPS.
Enhance memory configurations
Proper memory settings (e.g., increasing buffer cache size) help retain more data in memory, reducing physical I/O operations.
Enable query caching
Caching frequently executed query results in memory reduces the need for repeated logical and physical reads from the disk. It leads to faster response times and reduced IOPS.
Use materialized views
Consider materialized views for complex or frequently accessed queries. These pre-computed tables store query results, minimizing the need for repeated reads and reducing the workload on the database.
#6. Avoid Parameter Sniffing
Parameter sniffing occurs when a database reuses an execution plan initially built for specific parameter values, which may not be efficient for other values. This can lead to performance degradation if the original plan doesn't fit the new parameter inputs. To mitigate this, you can disable parameter sniffing or force database to recompile the query using different settings. This is specific to SQL server, but other platforms may have similar issues. For example you can create SQL Server stored procedures using the WITH RECOMPILE option.
Use the SQL Server Hint OPTION (RECOMPILE or OPTIMIZE FOR)
You can also:
Use dummy variables on SQL Server stored procedures
Disable SQL server parameter sniffing at the instance level
Disable parameter sniffing for a specific SQL server query
#7. Manage transactions carefully
Transactions are essential for maintaining data integrity but can introduce performance bottlenecks if not appropriately handled. Long-running transactions lead to locking and contention, affecting the overall system performance.
To avoid this:
Structure transactions to complete quickly, minimizing the duration of locks and reducing system contention.
Perform bulk inserts, updates, or other resource-intensive operations during off-peak hours to alleviate the load on the database during high-traffic periods.
Adjust isolation levels to balance performance and data consistency. Lower isolation levels may improve performance for read-heavy workloads.
#8. Partition large tables
Partitioning large tables reduces the amount of data scanned during queries and enables better parallel processing. It helps SQL operations run faster, especially on large datasets.
It is critical to select an appropriate partition key. Ensure the key distributes data evenly across partitions and aligns with the most frequent read and write patterns to maximize performance.
Too many partitions can cause unnecessary overhead—plan partitions based on data size and query requirements.
#9. Regularly update statistics
Database optimizers use up-to-date internal statistics to determine the best query execution plans. The optimizer may make poor decisions if these statistics are outdated, resulting in inefficient query execution.
Use database commands like ANALYZE in PostgreSQL or UPDATE STATISTICS in SQL Server to keep statistics current. Regular updates help the optimizer understand data distribution and select the most efficient plans.
CREATE STATISTICS for the Name and ProductNumber columns in the Product table.
On completion, the UPDATE STATISTICS statement is then executed.
Features such as TIME STATISTICS and IO STATISTICS can provide insights into query performance, revealing areas that may benefit from tuning or optimization.
#10. Adopt a declarative approach and utilize platform-specific features
Modern data platforms streamline optimization by automating traditional best practices, allowing data engineers to focus on strategic design. For instance, platforms like Snowflake utilize automatic micro-partitioning to manage data clustering without manual intervention, and Databricks offers features such as ingestion time clustering and adaptive query execution. Such techniques streamline query performance, reducing the need for manual partitioning, indexing, and other time-consuming tasks.
However, it is essential to be aware of these platform-specific features and avoid re-optimizing aspects that platform has efficiently managed. For example, manual partitioning on Databricks can often be less effective than the platform’s automatic clustering features, which dynamically manage data distribution and query performance.
In this context, adopting a declarative approach to data transformation can simplify SQL development and enhance overall project maintainability. Declarative configurations emphasize specifying what to do rather than how to do it. It tends to be easier to debug, reuse, and optimize. Unfortunately, SQL statements are only declarative in isolation, as multi-statement SQL script dictates the exact order of persistence, and tends to become monolithic.
DataForge is an open-source, next-generation data transformation tool that improves upon SQL’s declarative approach. Running on the Databricks platform, it leverages advanced optimizations to enforce standardized storage configurations, such as:
Medallion architecture
Optimal file sizing
Efficient table structures
Clustering
Partitioning.
DataForge creates a separation of concern between table structure and attribute operations, allowing each transformation to be truly modular and declarative. The below is an example of DataForge Core’s ability to decompose a ‘Products’ table into modular functions:
source_name: "Products"
source_query: "SELECT * FROM Products"
raw_attributes:
- product_id int
- product_name string
- average_cost decimal
- average_sell_price decimal
rules:
- name: "Round Average Cost"
expression: "ROUND([This].average_cost, 2)"
- name: "Truncate Product Name"
expression: "CASE
WHEN LENGTH([This].product_name) > 30
THEN CONCAT(LEFT([This].product_name, 30), '...')
ELSE [This].product_name
END"
Last thoughts
Majority of database performance problems stem from poorly optimized SQL queries. SQL performance tuning should not be seen as a one-time fix but as an integral and ongoing task. By consistently applying best practices, you can ensure that your SQL queries are efficient, leading to better application performance, reduced resource consumption, and lower operational costs. Platforms like DataForge can further simplify the process, helping you manage data engineering workloads with ease and efficiency.