Authors: Xiaotong Liu, Rama Akkiraju
Following the improvements to Log Anomaly Detection with the previous release of Cloud Pak for Watson AIOps (see details here) we have now further enhanced this feature with additional functionality, outlined below.
Training Experience Enhancement
In IBM Cloud Pak® for Watson AIOps 3.3, we have added new support for users to select appropriate training data and exclude data from any incident periods, with finer degrees of control than whole days. For example, models can now be trained on data within a single day with one or more anomaly windows in it. Or on data less than a day, but spanning two days with anomaly windows in each day. Or on data including multiple days starting and ending at specific times with anomaly windows in multiple days. The normal or abnormal windows can be specified at the granularity of days, hours, and minutes.
We have also made training progress more transparent by providing status callback during pre-check, training, and post-check in the back end, so stalled jobs can be cancelled earlier and restarted with better criteria.
False Alarm Reduction
To avoid noise in the ChatOps or the Web Console, we have improved the log anomaly pipelines through:
- Better Feature Learning for Natural Language Models. The “catch-all templates" were reduced by an optimized Drain template learning algorithm to create more diverse templates. The number of unknown templates during log anomaly model training were also reduced by the improved fuzzy clustering algorithm with multi-regex matching.
- Better Log Processing. An enhanced log comprehension feature was added to parse special log formats with flexible options, such as flattening JSON objects in logs by removing opening and closing braces or extracting and filtering out the JSON objects in logs.
- Better Decision Making on Alerting. False alarms caused by boundary issues were identified and suppressed via an extended expected range within the log anomaly detector.
- Real-time Statistical Model Tuning. The confidence threshold was tuned to make RSM-Embedding model less chatty than in the previous release, which helped reduce false alarms at high severity levels.
- Human in the Loop. Users can now control what is sent as an incident to an SRE from the alerts detected by the log anomaly pipeline via user-defined policies.
One of the key differentiators in IBM Cloud Pak® for Watson AIOps is its AI capabilities. Here are a few directions we aim to double down in the next iterations of IBM Cloud Pak® for Watson AIOps:
- Scale up the log anomaly inference pipeline to process customer logs of very large volumes.
- Make log anomaly alerts more explainable, understandable, and actionable.
- Enhance user experience for Out-of-the-box WebSphere logs.
- Optimize the time to value for Log Anomaly Detection models via automatic training data selection.
- Enable log anomaly detection for more languages.
- Optimize our pipeline for high availability, backup, and recovery.
- Provide users the ability to promote alerts to incidents based on defined criticality of resource group or application.