Executive summary
In large data warehouse systems, it is critical to optimize query worklosystem utilization and minimize processing times. This paper describes the optimization of data warehouse system query workloads. The scenarios focus on IBM Smart Analytics Systems where DB2® software manages multiple database partitions in a cluster. The scenarios describe optimization methods that can help improve performance in a short time. The guidelines in this document might not apply to transactional applications.
The two main tasks for query optimization are as follows:
1. Provide the Db2 optimizer with rich statistical information. The Db2 optimizer uses statistical information to perform its analysis and find a good execution plan. This paper describes best practices for gathering statistics.
2. Determine the best query execution plan. If a query needs further tuning, you must provides a list of patterns to look for in explain plans. Each pattern is paired with recommended actions.
It is assumed that the reader of this document has a basic understanding of the design and administration of data warehouse databases with Db2 products.
Download the report to get started!
Introduction
The workloads on data warehouse systems can range from complex analytical queries processing terabytes ofdata to many short tactical queries requiring sub second response time. Updates to large volumes of data might have to be processed in parallel. Queries can be ad hoc, that is, the access pattern is not known in advance and might vary significantly from one query to the next. This paper examines some characteristics of data warehouse workloads in more detail and describes how the IBM Smart Analytics System design is optimized for these workloads.
The DB2 query compiler and its optimizer are a key component for executing each SQL statement efficiently.The best practices in this document help you get optimal execution plans.
This paper first outlines a summary of guidelines for defining the physical data model for the database andpartitioning the data. However, the main focus of this document is on the collection of statistics and the analysis of query plans.
The DB2 compiler uses a cost-based optimizer. Cost estimates rely on statistics about the data. Therefore it isimportant to gather rich statistics. This paper provides guidelines for collecting statistics with the RUNSTATScommand. These guidelines are complemented by recommendations for defining foreign keys and functional dependency constraints that help the optimizer better understand the data model.
If a query or data modification statement runs slower than expected, the execution plan that the compiler creates can help identify the operations that need further optimization. This paper describes how to extract and analyze execution plans that contain important information. This paper also provides a series of critical patterns that youmight see in an execution plan. Each pattern has an associated action that helps in improving response times.
Optimization considerations for data warehouse systems
Use the computing power of database partitions
Transactional workloads typically comprise many short reads and updates, mostly in sub second response time. Although the database might be large, a single transaction usually processes only a small amount of data. Data warehousing query workloads consist of queries that often process and aggregate large amounts of data. Queries tend to be more complex, for example, they can contain more joins
- Distribute data across database partitions to use parallelism for tasks that need a significant amount of resources.
- Collocate data that is joined together, that is, store it in the same database partition. Transmitting databetween partitions can be expensive in terms of network and other
Determine how to partition the database
In this section, partitioning refers to database partitioning, that is, distributing data across the database partitions using hashing. Practical advice is to keep small tables on a single database partition, for example, on the administration node of the IBM Smart Analytics System. The bigger tables should be partitioned across the data nodes.
Recommendation: Use the following guidelines as a starting point:
- Partition tables with more than 10,000,000
- Keep tables with 10, 000 - 10,000,000 rows on the administration node, without partitioning If these tables are frequently used, replicate them across the partitions on the data nodes.
- Keep tables with up to 10,000 rows on the administration node, without partitioning them. There is no need to replicate these tables because they can be sent quickly across the network at query run time.
Use collocation as a major driver for choosing distribution keys
What is a good distribution key for partitioning? A distribution key should have the following properties:
- It should distribute data and workload evenly across
- It should collocate rows that are joined
The first property addresses the fact that the run time of a query depends on the slowest partition. Thispartition is usually the one that has to process more data than others. Even distribution of data ensures that no partition contains much more data than others.
The second property is important because DB2 partitioning uses a shared-nothing architecture to scale out. If possible, collocate joined data on the same partition.
Otherwise, a large amount of data might have to be sent across the network, and that might slow down the query and limit scalability. This property is true for virtually any database system that uses a shared-nothingarchitecture for scalability. Collocation is less critical for smaller databases where all partitions can be kept on the same server and where database communication does not need a network.
In larger data warehouse systems, collocating data from different tables can be more important for performance than finding the optimal even distribution of data within tables. To balance the requirements for even distribution and collocation, follow this procedure for defining distribution keys
- Determine the biggest
- Check the data model or queries and locate equijoins between the large In a simple star schema, an equijoin might be the join between the fact table and a dimension table.
- Check whether some of the joined columns can be used for distribution, that is, if they distribute data well. Keep the number of columns as small as possible. Using a single column is best in most Aprimary key must include the distribution key, but there is no need to use all primary key columns for distribution.
- Optional: Distribute smaller tables if doing so enables collocation. For example, distribute a STOREtable by using its primary key if you distribute fact tables by using the STORE key.
- Optional: Add redundant columns to fact tables. For example, add a STORE column to a table thatcontains itemized sales transactions even if the market basket identifier alone might be a primary Adding redundant columns would enable a collocated join to other tables that are distributed by using the STORE column.
- For remaining tables, choose a distribution key that minimizes the skew in the distribution of Primary keys or columns with many different values are good candidates. A column should not havemany rows containing NULL or some other default value.
It is usually not possible to achieve collocation in all cases. In many cases, an inability to collocate data is not a problem. Focus on the biggest tables and the joins that are used most often or include many rows. For these tables, it is acceptable if collocation increases the distribution skew by 5 - 10%.
A collocated join requires tables to be in the same partitioning group, even if the tables are in different table spaces.
In the special case of a multi-dimension model such as a star schema with one fact table and multiple dimensions, the procedure for defining distribution keys is simple:
- Partition the fact table and the biggest dimension by specifying the dimension key as the distribution key. Choose another large dimension if the distribution is skewed.
- Replicate the other dimension
The previous procedures require tests to determine whether data is distributed evenly. There are two simple methods for checking this:
- Count the number of rows per partition, as shown in the following example:
-- Actual number of row counts per partition SELECTDBPARTITIONNUM(KEY),COUNT_BIG(*) FROM THETABLE TABLESAMPLE SYSTEM(10)
GROUP BY ROLLUP(DBPARTITIONNUM(KEY)) ORDER BY 2;
- Check the space allocation by using the ADMINTABINFO view, as shown in the following example:
-- Space allocation per partition
SELECT DBPARTITIONNUM, SUM(DATA_OBJECT_L_SIZE) SIZE_KB FROM SYSIBMADM.ADMINTABINFO
WHERE (TABSCHEMA,TABNAME) = ('THESCHEMA','THETABLE') GROUP BY ROLLUP(DBPARTITIONNUM) ORDER BY 2;
You can use the information that you collect to determine data distribution, as follows:
- The size of partitions, which is based on space allocation, is relevant because DB2 software fetches fullpages or extents rather than single The allocated space is an indicator of the I/O load on thepartition, and the row count is an indicator of the CPU load.
- Compare the size of the biggest partition with the average size of all partitions. A table is balanced if the size of each partition is close to the average size. The largest partition, however, limits the performance. For example, if a partition is two times larger than the average, a table scan might take twice the time of a table scan on the same data in a perfectly balanced A partition with much less data than average does not cause performance problems.
- Partitions using 10 - 20% more space than average and partitions using 10 – 20% more rows that average are normal. The table samples should not be too small because the differences in row counts mightbecome less For example, with a 5% sample, some partitions might report 10% more rows thanother partitions even if the table is balanced.
You can test the distribution of data with a new partitioning key by using a procedure such as the following one:
-- Create a table with the new distribution key CREATE TABLE TEMP.THETABLE_TESTKEY AS
( SELECT NEWDISTKEY FROM THETABLE TABLESAMPLE SYSTEM(10) ) DATA INITIALLY DEFERRED REFRESH DEFERRED
DISTRIBUTE BY HASH ( NEWDISTKEY ) IN THETABLESPACE; COMMIT;
UPDATE COMMAND OPTIONS USING C OFF;
ALTER TABLE TABLE TEMP.THETABLE_TESTKEY ACTIVATE NOT LOGGED INITIALLY;
REFRESH TABLE TEMP.THETABLE_TESTKEY;
-- Then check row counts per partition
When the number of rows per partition is about the same, the amount of data that queries fetch per partition might still vary because the selectivity of filter conditions might vary by partition. You can check the number of rows that are processed per partition by creating query plans with section actuals. For details, see the“Explain plans with actual cardinalities” section.
Make the database queries faster by avoiding indexes
SQL statements in short queries often use filter conditions that are selective, that is, they return few rows. The cost of table scans can be prohibitive. Regular row-based indexes address this workload well.
In data warehouse environments, however, queries tend to process larger portions of tables. Regular indexes are useful in data warehouse databases also, but there are important caveats:
- Index maintenance can take a considerable amount of time in large
- Indexes encourage the query compiler to use nested loop joins with index Without the index, the optimizer might choose a hash join that uses a table scan, which can be faster when many lookups are needed per query.
If queries filter on a certain column, run the queries without any changes to see whether the queries are fast enough. If they are not fast enough, follow these steps:
1. Consider using table If table partitioning is applicable, create no more than 200 hundred partitions per table.
2. Consider using multidimensional clustering (MDC). Do not make the dimensions too fine grained, because this can waste space in MDC blocks that are not full. If more than 20% of space is unused, make the dimensions more coarse grained by using generated columns
3. If the previous two techniques are not appropriate or do not sufficiently improve queries performance, create a regular index.
Database configurations
IBM Smart Analytics System software comes with many predefined configurations. This section provides information about some additional configurations that are generally useful when you are optimizing you system.
Recommendation: For simplified testing make MQTs eligible without checking the time of their last REFRESH.
DB2 UPDATE DB CFG FOR THEDB USING DFT_REFRESH_AGE ANY
Subsequent sections describe how to use the explain tool and Design Advisor to assist in optimizing queries. These tools need several tables. Create the tables in the SYSTOOLSPACE table space by using the following stored procedure:
CALL SYSPROC.SYSINSTALLOBJECTS('EXPLAIN','C','',CURRENT SCHEMA)
-- Uses SYSTOOLSPACE by default because the third parameter is not specified
The explain tool adds rows to the explain tables but does not delete old entries. You can clean up the tables with asimple DELETE statement, as shown in the following example:
DELETE FROM EXPLAIN_INSTANCE WHERE EXPLAIN_TIME < CURRENT DATE -
2 MONTHS;
-- Cleans up other explain tables automatically by using foreign keys
The compiler might take several seconds to optimize a query before the execution starts. The db2batchcommand with the -i complete parameter reports compile times, that is, prepare times. If they are too long, for example, longer than a fraction of the total run time, consider reducing the effort spent by the optimizer by issuing the following command:
DB2SET DB2_REDUCED_OPTIMIZATION=YES
There are situations where the compile time for complex queries might be very short, for example, less than0.01 second, and it doesn’t always mean that the optimizer reduced its optimizing efforts. This is the case for example if a query is in the package cache.
If there are many SQL statements that are almost identical except for the values of literals, the statementconcentrator might reduce overall compile time. This method can work well in transactional databases but it is not recommended for data warehousing, where distribution statistics are relevant. Queries with different constants might need different query plans.
Recommendation: Do not use the statement concentrator in a data warehouse system.
Physical data modeling
This section highlights some key recommendations that are important for physical modeling in data warehouse databases.
For more general information about physical data modeling, see Best Practices: Data Life Cycle Management: http://www.ibm.com/developerworks/data/bestpractices/lifecyclemanagement/
For more details on table partitioning and MDC, see Database Partitioning, Table Partitioning, and MDC: http://www.redbooks.ibm.com/abstracts/sg247467.html
For information about data compression, see Best Practices: Deep Compression:http://www.ibm.com/developerworks/data/bestpractices/deepcompression/.
Recommendations:
- Define keys of dimension tables as not null, even if you do not formally set the keys as primary keys. It is a good idea to also define all joined columns as not null. You can define columns as not null easily, as shown in the following example:
ALTER TABLE THETABLE ALTER COLUMN THECOL SET NOT NULL
If the query compiler recognizes that a column cannot have null values, more optimizations become possible.
- Use the same data type, length, and precision for pairs of columns that you use in equijoins. Otherwise, the compiler does not consider hash joins.
- Define primary keys for all dimension Define primary keys for fact tables only if needed.
- Before defining a primary key, create a corresponding unique index with a declarative This avoids lengthy names for system-generated indexes.
Foreign keys
A foreign key relationship is a relationship between two tables where a set of columns in one of the tables is a primary key and all values of the corresponding columns in the other table occur in the key columns. Such constraints that are defined by foreign keys are sometimes called referential constraints. In a data warehousingenvironment, foreign key relationships exist typically between the primary key columns of dimension tables and the corresponding foreign key columns of the fact tables. In a normalized data model, the foreign key relationships occur as references to lookup tables. They can also occur between fact tables. For example, a table with itemized sales data might have a foreign key referencing a market basket table that has one entry per transaction.
Recommendation: If the database applications ensure the consistency of foreign keys, declare the foreign keys as NOT ENFORCED ENABLE QUERY OPTIMIZATION.
The main purpose of foreign keys is to guarantee the integrity of a database. In the data warehouse context, foreign key relationships are generally ensured by the applications that write data into the database. Verifying foreign key relationships again in the database causes unnecessary performance overhead. If the process that writes the data guarantees that the foreign keys are valid, define the relationship in the database as NOT ENFORCED.
Foreign keys can give the DB2 optimizer valuable hints as the estimated cardinality gets more precise and redundant joins can be eliminated. Include the ENABLE QUERY OPTIMIZATION option in the SQL declaration of the foreign key. This option is the default.
You must define the referenced column as primary key, or it must have a UNIQUE constraint. You must also define the column as NOT NULL, and it must have a unique index.
The following example shows the ALTER TABLE statement for creating a foreign key with the options described previously. The statement specifies that all values of the PROD_ID column of the SALES_FACT table occur in the PROD_ID column of the PRODUCT_DIM table.
CREATE UNIQUE INDEX PRODUCT_DIM PROD_ID ON PRODUCT_DIM(PROD_ID); ALTER TABLE PRODUCT_DIM ADD PRIMARY KEY (PROD_ID);
ALTER TABLE SALES_FACT
ADD CONSTRAINT FK PROD_ID FOREIGN KEY (PROD_ID) REFERENCES PRODUCT_DIM (PROD_ID)
NOT ENFORCED
ENABLE QUERY OPTIMIZATION
Foreign keys give the DB2 optimizer valuable information for simplifying queries. Because there is no performance overhead if you specify them with the NOT ENFORCED option, you should define all foreign key relationships in a database. This assumes that the keys are consistent, that is, they have been checked before the database is updated. If the foreign keys are not consistent, queries might produce incorrect results.
Download the full report for more on data modeling, statistical views, tools, and using explain plains to identify potential problems#Db2