Rajib Sarkar (rssarkar)
STSM Db2 Support
April 20, 2013.
I deal with PMR escalations in my daily job and I can't say how many times I've seen that the system has slowed down and a CPU spike was observed during the slowdown. In such cases, data collection at the right time i.e. when the CPU has spiked up is the most important thing. If you miss the data collection during the spike but still go ahead and collect, it's not going to help much as we would not know what drove the CPU up.
Now, this is where things get a bit complicated, the data collection depends on what "kind" of CPU spike is observed i.e. is it a %usr CPU spike or %sys CPU spike? The data collection is different in each of the cases .. When %usr CPU spikes up it means that system is spending time in executing application code which in most cases is DB2 but when %sys CPU spikes up it means that system is spending time in kernel code and not able to execute any application code.
I generally follow a simple principle .. i.e. when %sys CPU is high, I want OS internal data to see what call is driving up the CPU and also collect some DB2 information i.e. stacks ( db2pd -stack all ) and db2pd -latches ( to see if there's any latch contention ). For OS data, I generally collect:
AIX:
1) OS trace -- perfpmr.sh -x trace.sh 30 ( or separately start the trace command as user root ).
Solaris:
1) dtrace script to see which system call is being called the most ( you can get a lot of samples using Google )
Linux:
1) systemTap script to find out which system call is being called the most OR
2) strace -c of the db2sysc pid ( only if we are fairly certain if DB2 is driving it ).
If say, DB2 is driving the %sys CPU high 99% chances are that its latch contention within DB2 code. So, we have to find out what code path is driving that contention up. The stacks help pinpoint that. The db2pd -latches output also helps in that respect.
From OS trace ( AIX ), a curt report tells what kind of system call is driving up the %sys cpu and if its say latch contention then the splat report from the trace can show which address is having the highest contention on which can be matched with the db2pd -latches output.
The main thing is to get the "right data at the right time" otherwise the data collection is useless and won't help get to the root cause of the issue.
If %usr is high, generally I concentrate on what kind of queries were being run on the DB. With the new MON interface of v9.7 it becomes so easy now where you can use the monreport module to get the highest CPU consumers fairly quickly or you can write your own queries to query the table MON_GET_PKG_CACHE_STMT and other MON tables to find out the top CPU consumers. If you are not on v9.7 then it becomes a bit tougher, there's some sysibmadm routines such as top_running_sql which can be used but you can use application snapshot and dynamic sql snapshot to narrow down those queries and see if tuning them helps reduce the CPU consumption.
This was a quick guide on what to with CPU spikes .. I hope you found it useful!
#Db2