AIOps on IBM Z - Group home

Know your anomalies and be friends with them! Part 1

  


Introduction

Anomaly detection could be a useful tool if an enterprise understands how it works and what it means.   This series of blogs will discuss what is anomaly detection, the characteristics of it, and how to apply them to different use cases.

Anomaly detection is a technique that learns from the normal behavior during the training process, and anything different from the normal behavior during the scoring (or inference) is considered abnormal.   This technique is useful because "we don't know what we don't know."   Let's consider an example:

  • The common way to detect a computer problem on a Computer System is based on a known narrowly defined problem pattern.  For example, a message ID might indicate a corruption on the storage device, or a threshold of 90% indicating memory is over-used.   Obviously, if you didn't know a specific problem could happen, then you wouldn't know there is a need to define the problem pattern.   Furthermore, many problems don't have an easy to define problem pattern.

  • When using anomaly detection, the normal behavior of the Computer System is fed into the anomaly detection algorithm.  Anything different from the normal behavior will be considered abnormal.   With this approach, one wouldn't need to know all the possible problem patterns.

Anomaly detection technique could also apply to malware detection, credit card fraud detection, intrusion detection and many other use cases.   In the rest of this blog, we will refer to computer software anomaly detection, but the general concept could apply to other domains.

Anomaly Detection Characteristics

Let's dig a little deeper into anomaly detection and the normal behavior.  The following list provides a few examples and characteristics of anomaly and normal behavior, from there, you can expand and imagine other situations that are applicable to your environment.

  • There are many pattern types that can represent normal behavior.
    • There are many types of patterns.  For a single metric, it could be the upper/lower bounds of the value, the trends, the rate of change.   For multiple metrics, it could be the relationship of the value, the relationship of their trends.  There can be as many types of Pattern as you can imagine.
    • Patterns can also be observed from the calculated metrics, such as the relationship of anomaly level between multiple KPIs, the rate of change of abnormal level.   For example, in a Java workload, whenever the heap utilization is abnormally high, the CPU utilization will also be abnormally high.

  • An anomaly detection algorithm is only going to learn the type of pattern(s) that it is created to learn.
    • There is no anomaly detection algorithm(s) that can magically learn every possible pattern types.
    • For example, boundary detection algorithm for a single metric will only learn the normal range of a metric.   It will NOT learn the relationship between multiple metrics or the trends of a metric.   If the goal is to learn both boundary and trends, then one or more algorithms tailored for boundary and trends will be needed.
  • The normal pattern or KPIs to be learned should be bases on the use case or specific goal.  
    • There are many possible patterns and many KPIs available.   Not every pattern and KPI are useful for the use case, and not every pattern and KPI are as important as the others.  Detecting an unnecessary pattern and KPI for an unrelated use case could trigger unnecessary false positives.  Furthermore, each additional KPIs or problem patterns to be detected means additional "data crunching", which will take more time and affect time sensitive use cases.  Therefore, there could be 100s of patterns and 1000s of KPIs to learn, only a small subset of pattern and KPI will be analyzed in practice.
    • For example, when our goal is to analyze trends of Covid 19, looking at the KPI such as blood sugar level of the population wouldn't help.    In this example, the blood sugar data wasn't useful for the covid 19 trend use case.
  • For a single metric, multiple pattern types might be needed for data from different time periods.   
    •  Computer systems or software go through different phases of processing such as initialization, ramp up, stabilization, shutdown.   Each of these phases might require a pattern with different parameters, or a different pattern type.   Unfortunately, it's fairly challenging to know the exact timing of different phases of processing.  And, different application might have different phases, and the phases might have different duration.
    • For example, during startup of a Database, the memory utilization might slowly increase as it caches SQL statements based on the workload.   An hour later, these caches will be stabilized when the Database get to a steady running state.   During the first hour, it might be more appropriate to use a trend-based pattern to measure the rate of memory utilization increase.   After the first hour, it might be more appropriate to use a boundary check.

  • The age and seasonality of data complicates the learning of normal behavior.  And, complex pattern learning mechanism very often involves tradeoffs.
    • Data from twelve months ago might not be good representation of current workload.  But for certain characteristics, the current workload is best represented using a pattern from twelve months ago.
      • For example, a website's business has been increasing steadily and the CPU consumption on the computer system has increased 200% in twelve months.  Using the CPU consumption from twelve months ago to generate a boundary pattern will not work for today's workload.   Very often, the most recent one or two months is preferred when creating the boundary pattern.  On the other hand, patterns such as Black Friday, Christmas can only be captured from twelve months ago.  
    • In additional to the annual pattern, there could be a monthly task, weekly task, middle of the month task and other type of tasks.   These tasks stretched to a longer period will require longer period of data for training. 
      • For example, learning a monthly pattern from one month of data will be impossible.    It will require at least 3 or 6 months of data to come to a high confident pattern.  But data from 3-6 months ago might not be a good representation of today's workload.   In this example, sub-optimal tradeoff might be needed.

  • Computer systems running production workload often have unnatural intervention.
    • Business computer system very often has maintenance, security update, new workloads.    Furthermore, human user or automation software could make changes to the system or workload very dynamically.  These unnatural interventions could affect both the training and scoring.     Ideally, these unnatural interventions should be reported to the anomaly detection software.   Unfortunately, capturing these unnatural interventions could be very tedious, and omission or mistake happens very often in practice.
    • For example, during a flash sale on a website, the resource consumption could surge.  For load balanced infrastructure with automation support, additional systems could be provisioned on-demand, and workloads might be shifted between systems.  The data from flash sales is not a good pattern to represent normal behavior during training.   If this data is analyzed during scoring, it would trigger high anomalies.

What's next

With the complexity and scale of today's workloads and the amount of data they generate, machine learning is necessary to zip through vast quantities of data to keep IT operation healthy.

As you can see, it's almost impossible to learn a perfect "normal pattern" because the algorithm and data fed into the algorithm cannot perfectly represent the normal behavior.    All of these are going to cause false positives and false negatives, where the algorithm might miss normal pattern or incorrectly taken an abnormal pattern as normal.  When using the suite of software from IBM Z AIOps and following a tailored process, it's possible to take advantage of machine learning and anomaly detection while minimizing impact of false information.   Please stay tune for Part 2 of this series, and I will share the details on how to mitigate false information.

Comments

Tue November 02, 2021 01:33 PM

Nice blog Patrick, a lot of ways to think about anomalies!!  Well done!!