Rajib Sarkar (rssarkar)
STSM Db2 Support
April 26, 2012.
I work for DB2 Support and work for the APD ( Advanced Problem Determination ) team. In my daily job I get to work on lot of real life performance issues encountered by customers using DB2 on LUW, do crash analysis I thought of sharing my experiences through this blog so that those who read this are aware of a support analyst's perspective of an issue and hopefully will be able takeaway something useful from it .. :-)
My first topic for the blog is "Threshold Data Capture for Performance issues" ..
Generally, what happens is that customer faces a performance issue and then calls up support to open a PMR and then the analyst provides some script or data capture instructions for the next recurrence. What if, we can get a lightweight script running on the system and when the issue happens is able to capture the data automatically ? This will help both the customer and DB2 support .. for the customer, they don't have to worry about being online to collect the information when the issue happens ( and it may so happen, that the window for data capture is lost by the time the DBA logs in ) and for DB2 support, it will be able to gather the right data at the right time to do quick analysis of the underlying cause and save time for every one.
In our latest DB2 versions, we are incorporating the "threshold data capture" via db2fodc options. These options were rolled in the DB2 Version 97 FP5. The threshold data capture is based either on CPU spikes or memory spikes or connection spikes. This greatly reduces the complexity of gathering the right data at the right time. The new option under db2fodc is -detect. The multiple sub-options are detailed below
-cpu ( triggers script db2cos_threshold and puts data in FODC_CPU_<timestamp>_<memberNumber> in DIAGPATH) -- Collect CPU specific performance data
-memory ( triggers script db2cos_threshold and puts data in FODC_Memory_<timestamp>_<memberNumber> in DIAGPATH) -- Collect memory info to diagnose problems like no free memory, high swap utilization, paging or memory leak
-connection ( triggers script db2cos_threshold and puts data in FODC_Connections_<timestamp>_<memberNumber> in DIAGPATH ) -- Collect connection info, diagnose problems like spike in number of applications in executing, compiling or new connections being denied etc.
There are two more options also available, which are useful to detect CLP or upgrade issues.
-clp ( triggers script db2cos_clp and puts data in FODC_Clp_<timestamp>_<memberNumber> in DIAGPATH ) -- Collect environment and configuration info, troubleshoot problems related to instance creation.
-preupgrade ( triggers script db2cos_preupgrade and puts data in FODC_Preupgrage_<timestamp>_<memberNumber> in DIAGPATH ) -- Collect performance related info before a critical upgrage ( such as catalog info etc. ) or upgrading an instance or updating to next fixpack. Helps troubleshooting problems which might occur after the upgrade or update.
An example usage of this would be, say a customer is encountering perf issues on a ISAS system with 32 nodes spread across 4 physical server and notices that its accompanied by CPU spikes say for example, the usage goes to 90%. The -cpu option can be used to trigger data capture say a little before it hits 90% so that DB2 support gets some good data on what was happening on the system when it was spiking up. We can deploy this script on all the physical servers by invoking it just from one physical server using the following command:
db2fodc -detect -cpu idle"<=85" triggercount="3" interval="1" iteration="1" -member all
This will now keep watching the system and moment the CPU usage goes above 85% for 3 consecutive seconds ( triggercount * interval option ) on any of the physical servers it'll call db2cos_threshold and capture the data.
Under the -cpu option, it can watch for runqueue as well and it can be combined with CPU usage to trigger data capture.
I would suggest exploring the different options and finding out how it can be used on your system to troubleshoot performance issues. The scripts it triggers are all optimized for the specific option its called under and will collect the most pertinent info right at the beginning so that we don't lose the window of opportunity to get good data.
What if the you are on a DB2 version which is less than v97fp5 ?? Well, yes, you cannot use the db2fodc option, but there's help .. I've written a Perl script called watchThreshold.pl which pretty much does the same thing as db2fodc -detect and can watch for CPU, pageouts, strings in any file, package cache growth and connections spike. This also provides a way to capture data continuously leading up to the trigger, which can help a support analyst to find out the tipping point which subsequently causes the issue.
The different options are:
-idle <value> ( Trigger when %idle reaches below this threshold )
-user <value> ( Trigger when %usr reaches above this threshold )
-sys <value> ( Trigger when %sys reaches above this threshold )
-qrun <value> ( Trigger when run-q value reaches above this threshold )
-qblock <value> ( Trigger when block-q value reaches above this threshold )
-qoperator <val> ( Override default comparison operation for Queue values )
-cpuoperator <value> ( Override default comparison operator for CPU values )
-condition <and|or> ( Can combine CPU and run-q output. ( Default: or ) )
-avm <value> ( This is applicable only for AIX platforms and will trigger when avm reaches above <value> )
-po <value> ( This is also applicable only for AIX platforms and will trigger when po reaches above <value> )
-watchFile <filename> ( File to watch )
-string <string1> [ -string <string2> ... -string <stringn> ]
-matchtype [1|2] ( 1 = match one string at a time ( or condition ), 2 = match all strings provided ( and condition ). Default:1 )
-connections <num> ( This option will check number of appls. in UOW-Executing status and trigger data collection based on that )
-connecttype <string> ( User can define the connection type for the threshold e.g. UOW-Executing, Compiling etc. default:UOW-Executing )
-numtimes <value> ( Number of consecutive seconds the trigger is breached before it will start data collection
This is to reduce the false positives as much as possible. This is applicable
for -vmstat and -pkgcache option only. ( Default: 5 seconds ) )
-command <commands> ( If don't want to run script, can use this option to trigger data collection )
-before <commands> ( Run command(s) before it starts to watch for trigger )
-scriptname <script> ( Trigger script )
-scriptargs <args> ( Args to be passed to the script )
-sleeptm <time> ( Sleep time in mins before waking up again to watch for trigger ( Default: 5 mins ) )
-iter <iterations> ( Number of times it should watch for trigger before exiting ( Default: 1 ) )
-datasleep <sleepTime> ( Sleep time of continuous data collection thread ( Default: 5 minutes ) )
-fork ( Fork the continuous data collection thread )
-fileos <osCmdFile> ( Run the commands in file for continual data collection )
-maxTime <value> ( Run this script for <value> HOURS and exit even if trigger is not breached
( Default: run indefinitely till trigger ) )
-killCallout <value> ( Kill the callout script if not completed within <value> mins( Default: 60 mins) )
-daemon ( Run the script as a true daemon )
-purge [<value>] ( If continual data collection is asked for, the data can be purged after <value> hours (Default: No purge) )
-email <emailid> ( Email after script has triggered and completed data collection )
( For multiple emails, use -email <emailid> -email <emailid> )
-h [ examples ]
The -h examples option shows a few usage examples of the script. This script is very lightweight and can be run without any issues on a production system for long period of times.
Another good usage of this script is that you can just rename the script ( since it checks whether its running on the system or not .. only 1 copy of it is allowed to run at any time ) and run it under user root to trigger OS data collection ( which can be done as root only e.g. AIX trace or iptrace etc. ) at the same time so that now, you can have the OS data and DB2 data which can be co-related. Of course, this script is generic and can be used for any other purpose as well for trigger based data gathering .. the only options which are specific to DB2 are the package cache and connection spike trigger.
An example usage would be:
Customer is complaining of very short CPU spikes on their Linux system running DB2 v95fp8 which slows down their apps for that period and then goes away.
As an analyst, I would like to see some historical data leading up to the issue and when it happens collect some more data so that I can compare before and after data and see what new stuff came to the database to cause the spike. I'll then deploy the script as:
watchThreshold.pl -vmstat -idle 85 -iter 1 -numtimes 3 -scriptname collect.ksh -fork -fileos contData -daemon -purge 2 -email email@example.com
The script will run as a true daemon ( no controlling terminal ), keep a watch on the CPU usage and collect data every 5 minutes and when the issue happens run the script collect.ksh, exit out and send an email as well. Also, for the continuous data collection keep only the last 2 hours worth of data.
The contents of contData ( which will collect data continuously -- every 5 mins ( this is the default. If you want to do more frequently, use the -datasleep option ) is:
vmstat -a 2 > vmstat.log &
iostat -xmt 1 5 > iostat.txt
top -b -d 5 -n 2 > top.txt
db2pd -edus -rep 2 2> db2pd_edus.txt
db2pd -age > db2pd_age.txt
db2pd -latches > db2pd_latches.txt
db2pd -alldbs -apinfo > db2pd_apinfo.txt
db2pd -alldbs -active > db2pd_active.txt
db2pd -alldbs -dyn > db2pd_dyn.txt
and collect.ksh ( when the actual CPU spike happens )
vmstat 1 60 > vmstat.trigger.$TSTAMP &
vmstat -a 1 60 > vmstat.a.trigger.$TSTAMP &
iostat -xmt 1 20 > iostat.trigger.$TSTAMP &
top -b -d 1 -n 4 > top.trigger.txt.$TSTAMP &
db2trc on -i 128M -t >> $LOGFILE &
db2pd -latches -rep 1 20 -file db2pd_latches.trigger.txt.$TSTAMP >> $LOGFILE &
db2pd -edus -rep 1 20 -file db2pd_edus.trigger.txt.$TSTAMP >> $LOGFILE &
db2pd -age -rep 1 20 -file db2pd_age.trigger.txt.$TSTAMP >> $LOGFILE &
db2pd -alldbs -apinfo all -rep 1 10 -file db2pd_apinfo.trigger.txt.$TSTAMP >> $LOGFILE &
db2pd -alldbs -active -rep 1 10 -file db2pd_active.trigger.txt.$TSTAMP >> $LOGFILE &
db2pd -stack all -rep 30 2 >> $LOGFILE &
db2trc stop >> $LOGFILE
db2trc dmp db2trc.dmp.$TSTAMP >> $LOGFILE
db2trc off >> $LOGFILE
db2pd -alldbs -dyn -file db2pd_dyn.trigger.txt.$TSTAMP >> $LOGFILE &
db2trc flw db2trc.dmp.$TSTAMP db2trc.flw.$TSTAMP -t
db2trc fmt db2trc.dmp.$TSTAMP db2trc.fmt.$TSTAMP
I've used this script for many customer issues with CPU spikes, pageouts, package cache usage spikes, connection spikes and have found it to be very useful to get the right data at the right time, which is the key to solving any perf issue.
Hope you find it useful as well .. :-)