AIOps

 View Only

Fostering Continuous AIOps Learning with Minimal User Efforts of Human-in-the-Loop

By Veeramani Nambi posted Fri June 04, 2021 05:22 AM

  

Blog Authors

      

@Xiaotong Liu          @Anbang Xu              @Pujitha Kara

Introduction
 

 

Any Artificial Intelligence (AI) lifecycle does not end when the first model is initially deployedAn AI model must continuously improve over time by learning from the mistakes it makes. The model evolves with each iteration of the feedback loop. In addition, when the AI models are given the benefit of learning from user feedback – on top of the initial models – the model will do a much better job of predicting outcomes more accurately. 

 

In this article, we present our practices on how to update our Log Anomaly Detection models using feedback from Site Reliability Engineers (SREs) in Cloud Pak for Watson AIOps.  

 

Anomaly detection from logs is one fundamental Information Technology Operations management task. It aims to detect anomalous system behaviors and find signals that can provide clues to the reasons and the anatomy of a system’s failure.  Unlike typical feedback requirements on individual model predictions, we design a computational approach to minimizing user efforts to get sufficient feedback on a large amount of data. Unlike typical incremental training with user feedback, we can update our model in real time. 

 

Overview of the Log Anomaly Detection Pipeline 

 

The Log Anomaly Detection Pipeline consists of two subsystems: Off-line Training and Runtime Inference. The offline Training subsystem focuses on log parsing, feature engineering and anomaly detector training. The inputs to this subsystem are log data, a pre-trained language model, and a pre-defined error dictionary. We train a log parser to convert unstructured textual log messages into a structured format. We extract count vectors of templates and error classes and embedding vectors from log messages as feature vectors. 

 

To model the system behavior over time, we group the log messages of every 10s time window based on the logging timestamp. We then average the feature vectors of logs within each time window to form the representative feature vector of every time window. We train unsupervised machine learning models from feature vectors in training data that is collected when the system was running in normal condition.  Next, we build an ensemble model of the models by using count vectors and embedding vectors.  

 

The output of this subsystem is trained log anomaly detection models at multiple levels, which are saved to the storage repository Data Lake. The Runtime Inference subsystem checks if an anomaly occurs for a given time window during runtime at multiple levels. The input to this subsystem is the trained models from the Off-line Training subsystem in the Data Lake, as well as new logs in a streaming fashion through Kafka data streams.  

 

Our system will predict anomaly if the feature vector of a new time window is sufficiently different from the normal distributions learned during training. The output of this subsystem are anomalies detected from the logs at the log-line level, at the component level and at the application level. The detected anomalies are notified to users through ChatOps. 

   

 

Continuous Learning with User Feedback 

 

Our system provides users with the Slack UI to examine the logs associated with a detected anomaly, with explainability on why an anomaly is predicted for a certain time window. For example, the Slack UI displays the expected counts of templates as well as the observed counts of templates in the time window. The UI also shows the raw logs that are associated with a specific template for a microservice component, so that users can gain insights into why the model makes the prediction. 

 

Depending on the input data and quality of the models trained, too many false alarms could be generated. The output could be flooded with less severe anomalies, of little importance, or the end user might only be interested in only finding anomalies from particular components. The end user should have the flexibility to control all those cases.  

 

Unlike typical incremental training with user feedback, we aim to update our models in real time. However, the volume and the variety of logs generated in real-time poses significant challenges for collecting feedback on individual model predictions. To minimize user efforts to get sufficient feedback on a large amount of data, we design a computational approach that converts users’ high-level feedback into updates of models’ parameters. For example, a prediction threshold is required to determine whether an anomaly should be predicted in the AI models. Rather than requiring users to provide feedback on each individual model prediction, they can specify a time period that they considered as normal, and our system will automatically update the prediction threshold using the data within this specified time period to reduce false alarms and improve true detection rates. Users can also provide feedback on the severity threshold of the detected anomalies so that non-severe anomalies will not be predicted by the models. They can also disable a particular model if the model was not properly trained.  

 

 

 

To update the model in runtime without training, we implement the following in Cloud Pak for Watson AIOps: 

 

Update Threshold: Given a time period considered as normal by the SRE for a microservice component, we extract all the anomalies in that time window. We compute the 95% of the prediction errors extracted from anomalies and update the model threshold for the corresponding model. Hence any new anomalies will be generated using the updated threshold. 

 

Disable Model: For a given microservice component, we can disable the model by increasing the model threshold to a high value, thereby no anomalies will be generated for that particular service. 

 

Update Severity: We have a default severity threshold. All anomalies with a severity level below the threshold will be filtered out, making the model more conservative when generating anomaly predictions. 

 

Example 

 

     We evaluate the effectiveness of our feedback API for updating Log Anomaly Detection models by using an open-source microservice application. The application is the user-facing part of an online shop that sells socks, which contains many microservice components including management of the user account, catalog, cart, order, shipping, and so on. The training data was collected when the application was running in normal conditionwith simulated user flows for 8 days, and similarly for the normal test data. The abnormal test data with ground truth anomalies were obtained by injecting faults into specific components, resulting in interrupted and abnormal user flows.   

 

For a given set of models and data in inference, the pipeline generated 144 anomalies without any tuning.  

 

  • Updating Severity 

After setting the severity to 2, the anomalies generated reduced to 83, eliminating 61 lower severity anomalies. 

 

  • Update threshold 

After updating the threshold, the anomalies generated are reduceto 117, removing 27 false alarms for the given service "carts". 

 

  • Disable model 

After disabling the model, the generated anomalies reduced to 107, eliminating all anomalies from the given service "users". 

 

Conclusion 

 

      AI models are not perfect on Day 1: they need to improve continuously over time. It is important to keep users in the loop and leverage their feedback to improve the models. IBM Cloud Pak for Watson AIOps provides capabilities to update the Log Anomaly Detection models at runtime via feedback from SREs. We reduce user efforts to get sufficient feedback on a large amount of data and adopt use feedback in real time to improve the quality of the AI models. 

​​​
0 comments
82 views

Permalink