I recently wrote about the most common mistakes that are made with respect to RUNSTATS collection in DB2 for z/OS
. The responses were positive, but looking into this further, I see that these mistakes are being made more frequently than first thought.
Part of my focus has been on “what NOT to collect,” whereas some customers have been asking for clarification on “what to collect.” Why am I focusing on the opposite of what customers are asking?
The primary reason is because DB2 11 for z/OS (available since October 2013) already provides recommendations in the catalog table SYSIBM.SYSTATFEEDBACK or explain table DSN_STAT_FEEDBACK on what statistics may benefit your workload. This is based upon query explains, static BIND/REBIND or dynamic SQL prepares, and is therefore specific to your applications. It’s not possible to determine a standard set of RUNSTATs options for all tables and all workloads without knowing what SQL statements are being executed. That’s the benefit of this DB2 11 for z/OS enhancement: it targets the recommendations based upon your SQL.
There’s some work in DB2 11 that’s required to convert the DB2 statistics recommendations into actual RUNSTATS commands. Although the longer goal is for DB2 to simplify this processing.
Therefore, DB2 already has an answer to the question of “what to collect.” Now I will answer “why” I care about “what NOT to collect.”
Identifying Unnecessary Statistics
There’s no process in DB2 today to identify existing statistics that are unnecessary for the optimizer. Additional statistics may increase RUNSTATS collection cost, add extra rows to the catalog and may also increase BIND/prepare time since the optimizer must read in all available statistics to determine what is and isn’t useful for each query.
There are two categories of statistics to call out:
- Multi-column FREQVAL: Commonly collected from an index using FREQVAL NUMCOLS n COUNT n
- Previously collected statistics that were never re-collected: Often a once-off collection of histogram or other COLGROUP statistics that were trialed to resolve an issue, but never recollected or removed.
Multi-column frequency statistics collected via an index are simple and cheap to collect as a default due to the simplicity of the syntax and efficiency of collection via the index. However, multi-column frequencies are only exploited by the optimizer for equal predicates containing literal values—e.g., if a multi-column frequency is collected on COL1, COL2, COL3, then this would only be utilized by the optimizer if there is a WHERE clause that has equals (‘=’) on all three columns. Given that the majority of static SQL utilize host variables, and dynamic SQL use parameter markers, then the optimizer is unlikely to take advantage of frequency statistics for those. While it’s possible that multi-column frequencies may provide value for some SQL, this likelihood decreases as the number of columns being collected increases, meaning that a two-column frequency is more likely to be exploited than a three-column frequency. Those frequencies on four, five or 10 columns are unlikely to have any SQL that will exploit these.
Regarding “old” statistics: these typically remain because they were initially collected as a trial or a mistake, and it’s misunderstood that they aren’t subsequently removed or overwritten. Maybe you heard that DB2 9 delivered histogram statistics and decided to collect them, but didn’t see any improvement and stopped collecting them. The risk is that these statistics become stale over time, and thus can negatively impact the optimizer by misrepresenting the data at the current point in time.
Since DB2 11 takes steps to automate and simplify what statistics to collect, it’s important to note that this is first built on a foundation of statistics that already exist. This enhancement integrates with the RUNSTATS profiles initially delivered in DB2 10, and the first step to exploitation is typically to execute RUNSTATS with SET PROFILE FROM EXISTING STATS. The problem, as implied above, is that this will inherit all of the unnecessary statistics that currently exist, whether you wanted those statistics or not.
How do you know if you are currently collecting multi-column frequency statistics unnecessarily? It’s easy to determine that you are collecting multi-column frequencies, but not as easy to determine if there are queries that will benefit. But once again, it’s not common that multi-column frequencies are exploited.
The following SQL will identify if you have multi-column frequencies, and also whether those statistics show that the data is skewed (identified by column TOTAL_SKEW) or evenly distributed (TOTAL_EVEN). Why is this interesting? There isn’t value to any frequency statistics that show the data to be evenly distributed, since that is what the optimizer will assume if those statistics were not collected. Therefore, if there are many multi-column frequencies being collected that count as “TOTAL_EVEN” in the below SQL, then you are certainly collecting these unnecessarily.
SELECT NUMCOLUMNS, SUM(TOTAL_EVEN) AS TOTAL_EVEN
,SUM(TOTAL_SKEW) AS TOTAL_SKEW
,SUM(CASE WHEN 1.1/CD2.CARDF >= CD1.FREQUENCYF THEN 1 ELSE 0 END)
,SUM(CASE WHEN 1.1/CD2.CARDF >= CD1.FREQUENCYF THEN 0 ELSE 1 END)
FROM SYSIBM.SYSCOLDIST CD1, SYSIBM.SYSCOLDIST CD2
WHERE CD1.NUMCOLUMNS > 1
AND CD1.TYPE = 'F'
AND CD2.TBOWNER = CD1.TBOWNER
AND CD2.TBNAME = CD1.TBNAME
AND CD2.TYPE = 'C'
AND CD2.NUMCOLUMNS = CD1.NUMCOLUMNS
AND CD2.COLGROUPCOLNO = CD1.COLGROUPCOLNO
GROUP BY CD1.NUMCOLUMNS, CD1.COLGROUPCOLNO) AS X
GROUP BY NUMCOLUMNS
ORDER BY NUMCOLUMNS
The second category I have identified are old statistics, and again there is an SQL below that will allow you to identify these. The target is any histogram, multi-column cardinality (COLGROUP) or multi-column frequencies. I don’t look for single column frequencies only because some customers control collection of these separately from regular statistics collection if the frequencies are not changing over time. The below SQL will find statistics (other than single column frequencies) that don’t correspond to the most recent tablespace or index RUNSTATS execution.
SELECT TYPE, NUMCOLUMNS, TBOWNER, TBNAME, NAME
, MIN(STATSTIME), COUNT(*)
FROM SYSIBM.SYSCOLDIST CD
WHERE STATSTIME < CURRENT TIMESTAMP - 1 MONTH
AND (TYPE IN ('C', 'H') OR NUMCOLUMNS > 1)
AND NOT EXISTS
FROM SYSIBM.SYSINDEXES I
WHERE I.TBCREATOR = CD.TBOWNER
AND I.TBNAME = CD.TBNAME
AND I.STATSTIME = CD.STATSTIME)
AND NOT EXISTS
FROM SYSIBM.SYSTABLES T
WHERE T.CREATOR = CD.TBOWNER
AND T.NAME = CD.TBNAME
AND T.STATSTIME = CD.STATSTIME)
GROUP BY TYPE, NUMCOLUMNS, TBOWNER, TBNAME, NAME
ORDER BY TYPE, NUMCOLUMNS, TBOWNER, TBNAME, NAME
Reduce the Unnecessary
Multi-column frequencies are often being collected by default, but typically provide minimal value to the optimizer. The first SQL above helps identify if your environment has multi-column frequencies, and whether those statistics show that the data is skewed or evenly distributed. If you find that these statistics exist, and this data is evenly distributed, then the decision is easy for stopping collection of these. If the data is skewed, then the decision is not as clear.
However, I would again argue that the optimizer exploits such statistics in minimal scenarios, and this decreases as the number of columns increase. If you’re unsure as to whether you can remove these, then I would suggest removing any frequencies for greater than three columns, and revisit this if any access path regressions surface. If you choose not to continue collection, then it’s important to remove the statistics that you will not recollect (e.g., to remove a five-column frequency, you can use NUMCOLS 5 COUNT 0 in the appropriate RUNSTATS job, or issue a DELETE FROM SYSIBM.SYSCOLDIST with an appropriate WHERE clause to ensure the correct statistics are removed).
Removing old/stale statistics is also critical, and while frequency statistics can be removed by specifying COUNT 0, histogram or COLGROUP (TYPE=’C’) rows must be deleted.
Remember, the goal here is to reduce unnecessary statistics. Otherwise DB2’s exploitation of RUNSTATS profiles will inherit these additional statistics that can increase RUNSTATS collection cost and also impact BIND/PREPARE performance.
Terry Purcell is the lead designer for the DB2 for z/OS Optimizer at IBM's Silicon Valley Lab and is recognized worldwide for his expertise in query optimization and performance with DB2 for z/OS.