The DB2 optimizer exploits the catalog information to choose an access path for a given SQL statement, and RUNSTATS is what populates the catalog. It’s important, therefore, to collect a good base of statistics to give the optimizer a fighting chance of choosing well-performing access paths for your SQL.
Here we will discuss some common mistakes that are made when running RUNSTATS.
Mistake One: Collecting Statistics on Indexes and Tablespaces at Different Times
It’s common to collect inline statistics as part of a REORG. When a tablespace is REORG’d, all indexes are also rebuilt and thus statistics collection will establish a consistent view of the tablespace and its associated indexes. However, index REORGs may also be scheduled separately to rebuild a disorganized index. For index REORGs, collecting inline statistics can result in inconsistencies if data volumes have changed since the last RUNSTATS on the tablespace or other indexes for that table. Such inconsistencies could impact the access path choice since the optimizer doesn’t attempt to resolve statistics conflicts between related objects.
It’s therefore recommended to only execute RUNSTATS (or inline stats) on the tablespace and associated indexes at the same time, and not to collect only on an individual index during REORG.
Mistake Two: Misunderstanding NUMCOLS Specification for Index RUNSTATS
I often see the following usage of NUMCOLS as part of RUNSTATS on an index:
INDEX(ALL) FREQVAL NUMCOLS 5 COUNT 10
This will collect the top 10 most frequently occurring values for the combination of the first five columns of an index. The mistake, however, is the assumption that RUNSTATS will cascade the collection of all combinations—such as frequencies on the first four columns, the first three columns, the first two columns and finally the first column. This is incorrect, as you only get what you ask for with this specification. You ask for five columns—you get five columns.
The solution is to specify NUMCOLS 4, then NUMCOLS 3, NUMCOLS 2 and NUMCOLS 1. However, this takes us to the next mistake–where such statistics are rarely used.
Mistake Three: Excessive NUMCOLS Collection as a Default
For those who understand the syntax for multicolumn frequencies, they often use the following default:
RUNSTATS TABLE (ALL) INDEX (ALL) KEYCARD
FREQVAL NUMCOLS 1 COUNT 10
FREQVAL NUMCOLS 2 COUNT 10
FREQVAL NUMCOLS 3 COUNT 10
FREQVAL NUMCOLS 4 COUNT 10
FREQVAL NUMCOLS 5 COUNT 10
This is itself a mistake–simply because multicolumn frequency statistics are rarely useful and thus default collection for all indexes is excessive. While these are efficient for RUNSTATS to collect, they add unnecessary rows to the catalog (SYSIBM.SYSCOLDIST) and increase bind/prepare time since the optimizer must read them in and process these.
For the optimizer to take advantage of multicolumn frequencies, the query needs equal predicates with literals on all columns in that group. So the index with COL1, COL2, COL3, COL4, COL5—you need five equal predicates, all with literals (no host variables or parameter markers). This isn’t common.
There will be some situations where multicolumn frequencies are beneficial, but it’s rarely worth collecting multicolumn frequencies by default.Instead, collect them if an individual query is identified where they will benefit. If you choose to stop collecting multicolumn frequencies, then read on to the next mistake.
Mistake Four: Failing to Clean up or Recollect 'One-Off' Statistics
When additional frequency, histogram or other colgroup statistics are collected, it’s common to cease collection if there was no observed benefit. The question then is: did you remove them? Unfortunately, the answer is often no.
When RUNSTATS executes, it only replaces existing statistics with new values; it doesn’t clear out other statistics collected on that tablespace or index. For example, if you collect NUMCOLS 2 COUNT 10 but the next RUNSTATS execution doesn’t specify this clause, then those statistics will remain from their prior collection. This leads to statistics becoming stale and that can lead to access path issues.
The simple way to determine if there’re older (potentially stale) statistics is to query SYSIBM.SYSCOLDIST looking for distinctly different STATSTIMEs for the same table. One exception: Since frequency statistics are stored as a percentage of the data, then these only become stale if the stored percentages are no longer representative. It’s feasible to collect single column frequencies once, and not recollect them regularly, so be careful cleaning up single column frequency entries (TYPE=’F’ and NUMCOLUMNS=1).
DB2 11 RUNSTATS delivers a new RUNSTATS option, RESET ACCESSPATH, to reset statistics on the objects to -1. This is an effective way to remove old statistics, but ensure you follow up by recollecting statistics with your standard RUNSTATS options. Alternatively, frequency statistics can be removed by specifying COUNT 0 (e.g., if you previously collected NUMCOLS 5 COUNT 10, you can remove those specific statistics by using NUMCOLS 5 COUNT 0). This approach doesn’t apply to histogram or other colgroup statistics. Another alternative is to simply issue a DELETE statement to remove the unnecessary rows from SYSIBM.SYSCOLDIST.
Tooling such as IBM InfoSphere Optim Query Workload Tuner can help identify stale or conflicting statistics, and can also identify additional statistics that would benefit your workload. DB2 11 for z/OS also helps identify statistics that are missing based upon your SQL and also statistics conflicts that exist. As mentioned in mistakes four and four, conflicts are often because of statistics being collected at different times, which points to an error in your procedures, or because statistics were collected and never recollected or deleted. But you shouldn’t wait until they introduce a statistical conflict before you address these.
Terry Purcell is the lead designer for the DB2 for z/OS Optimizer at IBM's Silicon Valley Lab and is recognized worldwide for his expertise in query optimization and performance with DB2 for z/OS.