AIOps

 View Only

Leverage Anomaly Detection without worrying about noise

By Neil Boyette posted Tue July 26, 2022 10:36 AM

  
Authors: Neil Boyette & Ian Manning

Anomaly Detection has been around for a long time. But what is an Anomaly? More importantly, how can you adopt Anomaly Detection in your business and why can't we just get rid of false anomalies?
Anomaly Detection is the process of learning the normal pattern of behavior (using AI) and then indicating when it has changed significantly, or something unexpected has occurred.  The indication is a called an "anomaly" but really it is just an Event that tells you about a significant change or abnormal behavior that has occurred.
An Anomaly can be an early indication of an emerging problem, but it can also indicate a condition that is simply different than what has been previously observed. For example: a server is using more memory than normal. This may not be problematic, but the question is: why is it using more memory than normal?  
Using many sophisticated approaches, and based on thousands of data points, Anomaly Detection will know when a problem has not been seen before. In IBM CloudPak for Watson AIOps, we currently have two types of data on which we do Anomaly Detection:  Logs and Metrics.   The platform automatically combines 7 different algorithms for metric anomalies and several for log anomaly to be confident in the anomalies it finds.  It could be that the server has been re-purposed and that is expected behavior - the new normal.  However, It could also be that someone is mining BitCoin on it, or there is a memory leak - and both possibilities are worth investigating. As such, a valid anomaly is simply a confirmed deviation from what is expected.
Valid anomalieslike high memory usageare often dismissed as false positives.  False positives occur when the algorithm creates a notification about a change that doesn't necessarily warrant attention - it is not interesting, or not significant enough to prompt action.  This is a key point, as a valid anomaly can still be not interesting and thus be easily considered as noise. Too many false positives in this sense and anomaly detection becomes useless since.  As such, IBM CloudPak for Watson AIOps separates out detecting anomalies and noise reduction, so that the SRE gets the right information at the right time.
In IBM CloudPak for Watson AIOps, you have the ability to analyze all your logs and metrics using anomaly detection. But you also have control over the conditions in which you want to get notified.  Anomalies and other events are first de-duplicated. This way,  you don't have the constant drone of, for example, a warning that "memory has been found to be abnormally high", then 5 secs later again - it is still high, then 5 sec later again, etc. Instead, such notifications are combined into a single alert. These alerts undergo a further series of noise reduction algorithms: from different ways of correlating (grouping using co-occurrence, topology, scope etc.),  augmenting them with topology and automatons (which could auto-resolve the alert), to, finally, a set of policies indicating importance. 
The end result is that each (anomaly) event, can have several different outcomes:
  • The event can be combined with previous occurrences during de-duplication. Here, the alert is updated with the fact that the event is still occurring through, for example, a count or a last occurrence time-stamp.
  • The event is associated with an alert, which in turn is available for introspection, but isn't deemed important enough to automate or interrupt the SRE.
  • The event is associated with an alert that in turn does have automations associated with it. This can auto-resolve the problem ordepending on where the client is on the adoption curvecan be manually approved and executed.
  • The event is associated with an alert that in turn isn't important enough to spawn its own story; however, is related to one, as context. This allows the SRE to quickly understand the context without interrupting them.
  • The event is associated with an alert that in turn is important enough to spawn its own story; as such it is augmented with many different contexts and insights, allowing an SRE to understand and take action right away.
This range of outcomes allows each SRE to customize the behavior to suit their needs. Circling back to the abnormal memory usage example; IBM CloudPak for Watson AIOps would treat the single anomaly in many different ways, depending on whether there is a (related) problem occurring, whether it is part of an important application, whether the anomaly was already reported, etc. In short, looking at anomalies in context with other Alerts, Topology, and other useful information allows you to be notified on business critical Alerts, and yet see the full context which will help you reduce Mean Time to Know (MTTK) and help you solve the issue quickly.
Learn more about IBM CloudPak for Watson AIOps.
0 comments
113 views

Permalink