Db2 Tools for z/OS

 View Only

OMEGAMON XE for Db2 PE - Adaptive Thresholds for Anomaly Detection by Paul Kenney

By CALENE JANACEK posted Wed June 09, 2021 02:42 PM


A new  feature of IBM OMEGAMON for Db2 Performance Expert on z/OS called thread level anomaly detection was released March 2021 as a Technical Preview.  It is now available.  PTFs UI74612 and UJ05166 need to be applied to start using.  For details on how to set up, configure and use, please check out the library entry.

Performance thresholds are static and don’t take-into-account changes over time, and one size doesn’t fit all. Threshold checking to detect expensive Db2 threads often use too much GCP CPU which means they aren't running as zIIP eligible processing. 

They are tracked through metrics like CPU time, Elapsed Time and Get Pages. Setting these thresholds is manual and more art than science. The user must guesstimate each metric and then define and create different thread groups explicitly. In addition, static thresholds do not change (adapt) as workloads change and as a result, there can be too many false positives.

The value of Adaptive Thresholds is ease of use.  This is accomplished by removing the guesswork and reducing the manual effort of identifying and creating each thread group manually. Determination of “normal” is tailored for each thread grouping.  You will now have reliable problem detection that reduces false positives and are more likely to investigate threads identified as out of range.

Adaptive thresholds will learn what is normal over time for different thread groups and users no longer need to create each thread group manually.  There is a new E3270 UI panel to show the threads that consume resources beyond normal.  This new function is turned off by default.  When you decide to turn this feature on, the introduced footprint is rerouted predominantly to the zIIP processor.    

 Note:  this feature is not generating any alerts based on thresholds at this time. Anomalies are identified based on the adaptive thresholds and are then reported in the E3270 UI. 

How do you know if a Db2 thread is using too many resources? 

How do you draw the line between normal and problematic? 

Consider a CICS® transaction. It makes sense to use less than 5 seconds of Db2 Elapsed time and to set a threshold to detect larger values. Now consider a Batch Program. Odds are that the CPU Time, Elapsed Time and Get Pages of a Batch Problem will be much larger, so a threshold of 5 seconds for elapsed time will not work.

The challenge becomes this: you want to trigger an exception for a CICS transaction that runs too long, but at the same time you don't want to trigger an exception for a Batch program, which under normal conditions you would expect to run for a long time.

This is where the second key and very powerful part of the performance monitoring challenge arises: Anomaly detection using adaptive thresholds via Machine Learning and Artificial Intelligence.

In OMEGAMON for Db2, Machine Learning and Artificial Intelligence provide an initial learning period during which time the metrics of Db2 thread executions are recorded and grouped by specific Thread Identity fields. Once the system has learnt about the metrics through a configurable number of thread executions, the Db2 threads that match the execution group are measured against previously learned metrics to look for Anomalies. An Anomaly would be a thread that is outside the learned range by greater than the tolerance value.  Learning continues if the value is within a reasonable range based on the discard tolerance value.

Thread Level Anomaly detection uses existing data collection mechanisms in OMEGAMON Db2 and further leverages the collected data using machine learning algorithms. From a consumption perspective, the logic that continuously calculates the respective metrics and updates the thread group data, is entirely zIIP-eligible, and therefore does not contribute to the general-purpose CPU consumption.

Thread level anomaly detection using adaptive thresholds lets you  set ‘smart’ thresholds based on what you have learned from experience. This will reduce the number of ‘false positives’ where threads are flagged for exceeding a threshold but are not truly an error.  This way you can concentrate your performance tuning efforts to the specific threads that are causing performance problems.





Fri July 29, 2022 11:21 AM

Can someone say, maybe edit the blog, to say what happens after an anomaly is detected?  If there some message in SYSLOG, some indication on TEP, an email?  In SMSz, can an automation fire in SA?  Great blog!!

Fri July 29, 2022 11:20 AM

I definitely agree with Mac's comments, infusing AI into our zAIOps products is a great advance for customers and IBM!

Fri August 06, 2021 09:42 AM

really good example of infusing AI into our software to drive AIOps