Authors: Lu An, Xiaotong Liu, Andy Tu
Is waiting a week to collect required training logs too long?
Are you fearful that the collected training logs might be contaminated by error logs?
Are the hours or even days taken for the initial model to be trained causing great annoyance?
No need to worry about such problems any more. Statistical baseline log anomaly detection has arrived in IBM Cloud Pak® for Watson AIOps 3.2 release. In this article, we present our new algorithm and how the LAD is able to immediately provide predictive insights without off-line training.
Why statistical baseline log anomaly detection?
Anomaly detection from logs is a fundamental Information Technology Operations management task. It aims to detect anomalous system behaviours and find signals that can provide clues to the reasons and the anatomy of a system's failure. One typical method is to collect enough training logs during the system's normal operation period and learn the log templates, and then detect anomaly based on the distributions among these templates.
After log training in Cloud Pak for Watson AIOps, the AI system learns the patterns of the normal behaviour, and thus it is able to provide predictions on streaming logs. This method was available in the AI Manager LAD before the 3.2 release. However, the limit was that it might take days or weeks before the first model was ready to be deployed. Reasons for this:
- Log templates learning often requires customers to provide one week's worth of training logs without incidents.
- The log templates learning process takes hours or days to finish depending, on the size of the datasets.
- Sometimes, the customer may not know if the training logs they provided are pure normal logs.
In order to solve the above issues, we have introduced statistical baseline log anomaly detection in Cloud Pak for Watson AIOps. This can immediately provide predictive insights without off-line training data for WebSphere types of logs. For other types (non-WebSphere) of logs, statistical baseline log anomaly detection can provide predictions after 30 minutes. This algorithm greatly reduces the Time-to-Value of Log Anomaly Detection in AIOps.
How does Statistical baseline log anomaly detection work?
Entity Extraction:
The IBM WebSphere Application Server is a flexible, security-rich Java server runtime environment for enterprise applications. Each WebSphere log contains a designated message ID or a log level, or both. We have prior knowledge on which message ID's and log levels are indicators of abnormal system behaviours. During the log data preparation stage, if the logs are identified as from WebSphere, such message ID's and log levels will be extracted out and processed to build the statistical baseline log anomaly detection WebSphere model.
For all other non-WebSphere types of logs, during the log data preparation stage, we will extract error codes, exceptions types, and identify if a log message indicates erroneous system behaviour based on symptoms or negative dictionaries. Such error entities are extracted out to build the statistical baseline log anomaly detection entity based model. Some examples of aforementioned error entities are shown below:
- error code: "404", "500", "503"
- exception: "java.lang.IllegleStateException"
- error log message: "Starting thread to transfer block blk_-1649334 to 10.251.71.16:50010. The process is dead after 5 tries"
Additionally, during the log data preparation stage, embedding vectors representative of the log messages are extracted from the logs to build the statistical baseline log anomaly detection-embedding based model.
Statistical baseline log anomaly detection Model Update
After the customer connects streaming logs to AI manager in Cloud Pak for Watson AIOps, LAD will start running with an empty statistical baseline log anomaly detection model. The first statistical baseline log anomaly detection model should be ready in 30 minutes based on the historical data within the previous 30 minutes. The model contains all the statistical baseline log anomaly detection entities' and embedding vectors' statistical metrics, which reflect what the statistical distribution should be during a normal operation period. Such models are updated periodically by the LAD model updater, and the models are computed cumulatively for all applications so that the latest model reflects the statistic distribution for all periods it has seen.
Of course, we would like the statistical baseline log anomaly detection models to only remember statistical distribution during normal periods. Thus, an automatic skipping rule is introduced for the LAD model updater. If during the last period there were too many detected anomalies, then that period will be tagged as an incident period, and all logs during that period will not be used for statistical baseline log anomaly model detection updates. Such a mechanism can avoid biased statistical baseline log anomaly detection models without manual human interference.
Statistical baseline log anomaly detection
For WebSphere types of logs, LAD is able to predict –even with the empty statistical baseline log anomaly detection model– since it is very confident of knowing which message ID or log level indicates an error. For general types of logs, LAD is able to provide predictions, once the first Statistical baseline log anomaly detection model is ready. As the predictions for general logs are based on both entities and word embedding vectors, if either the entities' distribution or the embeddings' distribution for the inference data shows significant difference from that of the latest models, then LAD will send alerts for this data.
Conclusion
Training data is not always available or sufficient, and identifying if the training data is contaminated, or not, may need a lot of human effort. It is important to introduce an online algorithm in the LAD which can provide predictive insights without off-line training data and learn a system's normal behaviours gradually. In the meantime, it is vital for the LAD system to be smart enough to automatically identify incident periods and avoid potentially biased models.
All such key features are now included by enabling statistical baseline log anomaly detection in LAD for Cloud Pak for Watson AIOps 3.2.
#CloudPakforWatsonAIOps#3.2.0#ProductCapabilities#loganamoly#newfeatures#cp4aiops