View Only

New Improvements to Log Anomaly Detection in Cloud Pak for Watson AIOPs Release 3.2

By XIAOTONG LIU posted Wed January 19, 2022 02:05 PM


Authors: Xiaotong Liu, Rama Akkiraju

Any Artificial Intelligence (AI) model’s lifecycle hardly ends when the initial version is deployed for the first time. Each AI model must continuously improve over time by learning from mistakes it makes. In this article, we describe a few technical challenges in developing Log Anomaly Detection (LAD) pipeline and how we address them in IBM Cloud Pak® for Watson AIOps 3.2 release.

Reduced Time-to-value of Log Anomaly Detection

The sooner an AI model can be ready to use, the better mean-time-to-detect we can achieve. We are shifting from offline algorithms to online algorithms to minimize data requirement for model training. Previously, we needed at least 1 week of log data or 1 million log lines to ensure the quality of the LAD models. The training process can take at least 4~6 hours for over 100 million log data. In the IBM Cloud Pak® for Watson AIOps 3.2 release, we introduce a new statistical baseline log anomaly detection offering to reduce the time-to-value of LAD from days and hours to as little as 30 minutes.

Improved Processing of Complex Log Formats

Logs generated in the IT Operations domain are different from natural language texts, as the vocabulary of the IT Operations domain is quite unique. For example, logs can contain a mix of the date, the time, the pod id, the level of logging, the component where the system runs, and the content of log message. Our LAD pipeline trains a log parser to convert the unstructured log message of a log line into a structured format, also known as templates. The count vector of such templates, aggregated within a fixed time interval, is one of the representative features of logs we use to train LAD models.

In practice, the quality of LAD models largely depend on how well log templates can be learned. A general-purpose log parsing algorithm may not be able to process log data in all kinds of formats. To adapt to the various expectations of our clients on different data, we have introduced new capabilities to parse logs in any of the following special formats:

  1. XML format. If the logs in the XML format have no whitespace, the whole log message was considered as a token and resulted in poor template abstraction as each log line becomes one template, which also added tremendous performance overhead when matching logs to the templates. To address this, we added smart tokenization to deal with “<>” in XML format so the logs can be parsed more effectively and efficiently into informative templates. For example, see the XML format snippet below that has no spaces. Also to be noted is that it has no unstructured data. In the future, we will address how to effectively deal with log messages that are structured.

"@message": "<context><Machine>NYTJX3092</Machine><Engine>43577</Engine><RunRequestId>5ndaf4098-4164-4233-93fd-8a9979b9a652</RunRequestId><JobId>88950c69-ea7a-4d62-af3e-062db2dbc207</JobId><PersistedJobId>98250c69-ea7a-4d62-af3e-062db2dbc207</PersistedJobId><Treename>QFlow</Treename><Workflowlevel>3</Workflowlevel><Metrics><Metric JobId=\"956450c69-ea7a-4d62-af3e-062db2dbc207\" source=\"Operator\" type=\"Timer\" name=\"Anonymous\" value=\"1734\" start=\"5249275442538947590\" end=\"5249688642556287590\" /></Metrics></context>"

  1. Nested JSON format. Log aggregators sometimes send the required “message” field buried inside as a JSON within the main JSON (nested JSON). The general-purpose log parser might behave erroneously because it parsed a structured message information incorrectly and could end up remembering dates and hostnames as part of the templates, which it should not. To extract the actual message from nested JSON logs, we provided users with an option to customize a mapping for the message that indicates the nested attribute name of the message as opposed to the original string message. The mapping can also support a mixed format of string type value of message as well as JSON type value of message. An example of a nested JSON in which UnknownHost exception is buried inside. 

{ "_source": {
"_host": "orders-574bd5f458-8s67p",
"_logtype": "customapp",
"_tag": [ "k8s" ],
"_file": "/var/log/containers/orders-574bd5f458-8s67p_sockshop-fault_orders-556c2c2e18fae099c8485d0aaebdee61b7fa43bd29c63d2af115e20b23deb9d7.log",
"_line": "2021-09-23 22:22:35.535 WARN [orders,,,false] 7 --- [tion/x-thrift})] z.r.AsyncReporter$BoundedAsyncReporter : Dropped 20 spans due to UnknownHostException(zipkin)",
"_ts": 1632435755535,
"_app": "orders",
"pod": "orders-574bd5f458-8s67p",
"namespace": "sockshop-fault",
"containerid": "556c2c2e18fae099c8485d0aaebdee61b7fa43bd29c63d2af115e20b23deb9d7",
"node": "kube-c2r6nc1w0jbgqmgdk60g-aieffective-default-0000014a",
"_ip": "",
"_ipremote": "yy.yyy.yy.yyy",
"level": "WARN",
"message": "[orders,,,false] 7 --- [tion/x-thrift})] z.r.AsyncReporter$BoundedAsyncReporter : Dropped 20 spans due to UnknownHostException(zipkin)" }

  1. Various Date and Time formats. Inside the log message, date and time are often being logged in different formats. Log parser could fail to recognize the whole datetime string as parameters but keep partial dates in templates that led to date-time memorization and failure in matching the templates to new logs. To avoid this issue, we introduced new natural language processing rules to extract date and time in various formats so they have the right representation in the final templates. For example, one of the problems was that DateFormatter was not handling date time with six milliseconds digits as it was expecting three millisecond digits. In 3.2, DateFormatter is made more robust and can ignore millisecond digits after three digits.

“start_time='2021-09-18T22:35:16.576677Z’,” Vs. "date":"2021-09-18T22:48:43.982+0000"

Language Support for Log Anomaly Detection

The process of language enablement for an AI model is a lot more involved than globalization (translating the user interface). It covers everything from data gathering, data preparation, AI training, providing language specific natural language processing and dictionaries and providing a means for running language specific instances of the same AI model.  We added German and Spanish language support for LAD, in addition to English: log data of German or Spanish can be processed and used to train LAD models, which will be able to make anomaly predictions on new log data in the target language.


Noise Reduction: Reduced False Alarm Rate

Developing and delivering an AI model that’s accurate, reliable, and unbiased is undeniably a challenge. The model needs to work on various test cases, datasets, and environments. To reduce false alarm rate while preserving accurate true anomaly detection rate, we took the following approaches:

  1. Fine-tuning individual models and prediction aggregation. The LAD pipeline now enables both natural language models and statistical baseline models by default and will make an ensemble decision. Each set of models can be enabled or disabled respectively. Each individual model is fine-tuned to balance precision and recall (different metrics for measuring model quality) regarding the ensemble decision.

  2. Optimizing alert severity rules and thresholds. Severity levels (SEV-k) are a measurement of the impact an incident has on the business. Typically, the lower the severity number, the more impactful the incident: a SEV-1 incident is a critical incident with very high impact; a SEV-2 incident is a major incident with significant impact, a SEV-3 incident is a minor incident with low impact, and a SEV-4 incident is usually an alert irritating a customer but does not impact overall system function. We adjust the severity levels of the detected anomalies based on the error information available in logs, and which LAD models are triggered, so only the anomalies of high severity will alert site reliability engineers.

  3. Handling user load variation. LAD is designed to catch statistically significant differences between expected and actual log representation vectors. We introduced vector normalization within each time window to account for differences in log volumes due to user load variance, so changes in the number of users will not trigger false alarms.


Future Work

While we made significant progress in delivering a more robust and reliable LAD model in the recent release, the journey of continuous improvement of our LAD pipeline never stops. Here are a few things we are actively working on for future iterations.

  • Enrich ensembled models for different LAD algorithms to work hand in hand to further enhance the accuracy of the model.

  • Expose precision-recall tradeoff knobs to improve the transparency of LAD algorithms to gain trust of AI.

  • Improve log comprehension so better representation can be learned from a mixture of formats.

  • Seek and leverage user feedback to fine-tune individual LAD models and prediction aggregation.

  • Customize our models to handle seasonality of log volumes and maintenance windows better.

  • Differentiate anomalies, alerts and incidents to tell a meaningful incident story.

  • Correlating alerts to golden signals, service level objectives, and error budgets for better separating alerts from incidents and for better incident prediction .

Stay tuned for exciting additional improvements we are working on for the next iterations of IBM Cloud Pak® for Watson AIOps.