AIOps on IBM Z - Group home

Know your anomalies and be friends with them! Part 2

  


In part one of this series, we discussed the challenges of anomaly detection, and false positive could creep up very often for very good reason.   Improving the algorithm and being selective on data and scope could reduce the false positive rate, but it's not going to get rid of them.  The final count of false positives is still up to many impossible or uncontrollable factors.    In this blog, we will discuss how false positive is perceived in anomaly detection and how to manage them.

Anomaly and False positive:  to trust or not to trust! 

 

"False Positive" is defined by Oxford Language Dictionary as "a test result which incorrectly indicates that a particular condition or attribute is present."   In the context of IT Operation, false positive commonly interpreted as "an anomaly is detected when there wasn't any concern".  

False positive impacts the IT operation shop by taking time away from other tasks.   It is especially difficult when many existing anomaly detection techniques don't provide enough trail to explain the anomaly calculation, nor does it provide the next step.  Very often, the anomaly is left as "unsolved mystery". 

When a detected high anomaly doesn't follow by a problem, the most common response from the Anomaly Detection Product Team is that "high anomaly doesn't mean a problem".  But many existing products use anomaly detection in the problem avoidance use case, isn't that contracting with the prior statement?  

My view to the unsolved mystery and contradicting statement relies on:

  • What should you expect from anomaly detection?
  • What's the process to manage anomaly detection result?

What to expect from anomaly detection?

 

When should you expect a high anomaly?  When there is a problem that's already been fixed by automation process automatically?  When a user manually changed some configuration on an application?  When there was a flash sale on a website, and triggered a lot of credit card transactions?   It could be some or all of the above, depending on the characteristics and algorithm used during anomaly detection.

During Part 1 of this series, we discussed different characteristics, examples and considerations during Anomaly Detection.  Most of these characteristics or algorithms DO NOT directly mean a problem.   Instead, Anomaly Detection algorithm tries to model (the characteristics and algorithm) how a problem will look like, and report high anomaly when the incoming data has strong correlations to the characteristics in model.  These characteristics correlations are not "black and white", instead they are measured in many levels.  These many levels are often reported to the user as "confident level" and can be interpreted as "probability" that "the anomaly will be followed by a real concern".  If a user found that 80% confident level generates too many alerts, the user might change the alert threshold to 85% confident level.

These behaviors are very similar to traditional, non-machine learning based detection.  These traditional detections often use a static condition or threshold.  For example, an IT shop want to be alerted if the disk storage is 80% consumed, and this 80% will be used as the static condition.   When the user gets too many alerts from this 80% threshold and it didn't result in a problem, they user might start to ignore this specific alert or change the static threshold to 85%.   Isn't it very similar to machine learning based Anomaly Detection?

The Saga of "False Positive" and "False Negative"

 

False negative has significant impact to the user because it takes away time from other important work, and it eventually cause "alert fatigue" where user ignores the alert from the anomaly detection product.    Because of this, minimizing False positive becomes one of the primary goals in many Anomaly Detection products. 

There is a risk to this goal.  Because anomaly detection is often probability based, a reduction in "false positive" could means a higher chance of false negative (falsely identifying something as negative).

The False Positive and False Negative are a "two-edged sword" that an anomaly detection product need to manage.   The good news is that a carefully selected anomaly detection criteria, characteristics and algorithm give an uneven advantage to lower the false positive while the chances of missing true positive doesn't increase dramatically.

What's next?

 

Now that we talked about what to expect from anomaly detection product and how to interpret the false positive, we are going to discuss how to manage the anomaly detection results next.  When using the suite of software from IBM Z AIOps and following a tailored process, it's possible to take advantage of machine learning and anomaly detection while minimizing impact of false information.