By
Rama Akkiraju, IBM Fellow, CTO AIOps,
Xiaotong Liu, Senior Data Scientist, Manager, AIOps
Collaborators
Mudhakar Srivatsa, Amitkumar Paradkar, Prateeti Mohapatra, Jae-Wook Ahn, Sarasi Lalithsena, Meenakshi Madugula, Neil Boyette, Jiayun Zhao, Gargi Dasgupta, Karan Karuppiah, and Rakesh Mohan.
Following our recent posts on Why logs are important [1], and Why Log Parsing and Processing are hard [2], in this blog we offer strategies for parsing logs to effectively detect anomalies during IT operations management.
Introduction
Information Technology (IT) logs are events generated by software systems during the execution of a program in production environments for problem detection and diagnosis in IT operations management. Logs contain information about errors, exceptions, warnings, informational events, and other diagnostic information. Logs are semi-structured machine-generated data. They can come in many formats, structures, languages, and large volumes. These multi-dimensional attributes of logs pose many challenges in parsing and processing logs. However, since they contain valuable diagnostic information, it is important to mine them for insights.
Until recently, IT operations administrators and Site Reliability Engineers (SREs) have been looking manually in logs for diagnostic information via text search strings. While Log Analysis tools and products help by aggregating logs and enabling search via proprietary query languages, and allow users to write custom rules to trigger events when specific thresholds exceed, they tend to be static and require maintenance of those rules.
Log anomaly detection (LAD) aims to detect anomalous behavior in IT logs automatically using Machine Learning (ML) techniques. Increasingly, more and more tool vendors are starting to incorporate ML-based anomaly detection in their Log Analysis and AIOps products. However, log anomaly detection is a hard problem. It requires parsing of arbitrary formats of logs, extracting meaningful information/entities from those logs, training ML models to learn normal log patterns so that they can detect anomalies, and explanation of detected anomalies. While techniques such as DRAIN [3] have been widely used for log parsing and feature generation for downstream log anomaly detection, they don’t take the variables in logs into account (more details about constants and variables in logs are described later in this article). Not considering variables in logs misses important diagnostic information and thereby impacts the quality of downstream anomaly detection. For complete log comprehension, both constants and variables in logs must be harvested.
In this article, we present some state-of-the-art approaches to derive insights from logs. We are actively exploring some of these techniques in our labs for potential inclusion in the future releases of Cloud Pak for Watson AIOps. However, please note that this is not a product roadmap article. Our primary goal in this article is to suggest future directions for log analysis for better anomaly detection in IT operations environments.
Log format recognition and parsing: Logs can be written in plain text, XML, JSON, plain old java object (POJO), or any other format. Sometimes logs can have embedded JSONs within XMLs and vice versa. This complicates the processing of logs significantly, especially when XML and JSON structures are embedded inside each other. Extracting the needed entities to prepare features becomes more difficult as they get buried inside arbitrary levels and layers. To address this problem, one must build parsers to identify and parse standard formats such as XML, JSON, plain text, POJOs, etc., in logs so that the key entities can be extracted from them properly.
Format recognition of well-known logs: Standard middleware products, operating systems, and infrastructure have well-defined and published schemas for logs. For example, apache, Syslog — Linux and network vendor variants, mongo, Websphere, Redis, Elastic, db2, etc. have well-defined log formats. Therefore, when these formats are recognized, known schema definitions can be followed to extract specific entities of interest. In Cloud Pak for Watson AIOps, as of the 3.2 version, Websphere logs are processed out-of-the-box. More is on the way.
Tokenization: Logs can be written in many natural languages depending on who wrote the software program (e.g. English, German, Italian, Spanish, Japanese, etc). This means that the Natural Language Processing (NLP) software used for processing log messages must be able to detect the language and parse it accordingly with suitable tokenizers and dependency parsers to extract features from them.
Entity Recognition: One aspect of log parsing includes identifying entities such as IP addresses, port numbers, date-time stamps, UUIDs, etc., that occur frequently in logs. These can pose significant parsing challenges. Regular expressions, shallow semantic parsing, and dictionaries come in handy in identifying entities accurately. Depending on the format of logs, applying a suitable entity recognition technique is critical to extracting entities correctly. Separate entity recognizers might have to be built for each kind of entity extraction once an entity is identified of a certain kind. This requires both entity type identification/classification and then entity extraction by applying a suitable extractor. For example, in our prior article ‘Why Log Parsing and Processing are hard’ [2], we presented several examples of date-time stamp format variations. One needs a date-time stamp entity recognizer to specifically deal with different types of variations. Similarly, one needs an IP address, port number, and other message code recognizers as each is an entity of its own. While some entities come in multiple formats, others tend to be more standardized. In either case, specialized entity recognizers will come in handy for accurate entity extraction for downstream processing.
Log Enrichment: Each log can be enriched with metadata that is useful for downstream tasks such as anomaly detection. An example of enrichment is classifying a logline as an error, exception, informational, latency-related, saturation-related, traffic-related, etc. Supervised machine learning algorithms are often employed to classify a logline into these categories. One way to do this would be by collecting enough labeled data on logs. This requires a Site Reliability Engineer (SRE) subject matter expert (SME) to label the data. We consider such activities human-in-the-ai-loop activities. Rule-based approaches are an alternative to these enrichments. Rule-based approaches don’t need data labeling. However, SMEs must specify rules for what entities to look for and under what conditions a logline can be classified as an error, exception, informational, etc. Each approach comes with its own pros and cons. Depending on the domain and availability of labeled data, or SME time, specific techniques can be chosen.
Log Templatization: Log templatization is about clustering logs of a similar kind together and assigning them a template/group Id. One popular algorithm for log parsing and templatization is the Drain algorithm [2], which employs a fixed depth tree parsing approach. In Drain, when forming a log template, constants are retained, and variables are ignored. For example, in the logline “received block blk_ID_2345987 of size 89456873 from 10.432.34.12” block Id, blk_ID_2345987 and block size 89456873 are variables, and the phrases ‘received block’, ‘of size’, and ‘from’ are constants. When this logline gets templatized it would look like “received block <*> of size <*> from <*>”. A more sophisticated version of templatization would recognize the entities of the variables and would templatize as follows “received block <ID> of size <NUM> from <IP Address>”, thereby paving the way for a better explanation. The counts of these templates, known as count vectors, become features for downstream anomaly detection. When loglines of similar kinds arrive, they are grouped together, and the count of such log templates can be used as vectors in time-series algorithms for anomaly detection. While the Drain algorithm works well overall, it fails to consider critical information that may be contained in the variables of logs. The variability in the variables contains useful diagnostic information. For example, in the above logline example, if the size of the variable ‘block’ varies beyond the normal range, that could be an indication of an anomaly. Ignoring that misses that useful information. Therefore, other techniques have to be developed to derive structured features out of the unstructured loglines wherein, the pattern variations of variables are also considered, in addition to the count vector features that consider constants alone, in the downstream task of anomaly detection. In a full log parsing approach that considers both constants and variables, key-value pairs are extracted from the variables in log lines. For example, ‘block-ID’ would be a key and ‘blk_ID_2345987’ would be its value. Similarly, ‘block size’ is the key and ‘89456873’ would be the value. These key-value pairs can then be efficiently stored in databases for efficient downstream tasks of anomaly detection. The constant part of the log templates can be stored separately for full reconstruction of logs when needed for audit or explanations.
Feature Extraction: Feature extraction is the process of deriving structured features from unstructured logs. Different kinds of features can be extracted from logs. Some of them are noted below.
- Word embeddings: Extract word embeddings for the natural language words in a logline. The sequences of these word embeddings form time-series data.
- Count vectors of log templates: The counts of the templates of each kind form the feature vectors. The process for templatization is briefly discussed in the previous item.
- The variables in loglines: Variables that are captured as the values of the key-value pairs extracted using natural language processing techniques as mentioned in the ‘Log Templatization’ section of this article.
Human inputs for data type identification in log parsing: How does the system know the correct data type of a variable to apply the right algorithm to detect anomalies? Is the variable an IP address, a queue length, a date-time stamp, an HTTP status code, a byte count, or something else? Knowing the data type helps in applying appropriate algorithms for detecting the variations in that variable. For example, one can apply a metric-based anomaly detection using z-score variance for variables such as queue length, and byte count type of variables. For a counter type of a variable, taking the first derivative followed z-score might be more appropriate. For HTTP status codes and IP addresses, pre-canned regular expressions can be used to do exact matches. Therefore, to apply the right kind of algorithm, one must know the data type of the variable correctly. This can be achieved either automatically or with human input. The point to note here is that this needs to be done only once for each log template. Once the data type is properly identified either automatically or with human guidance, this information can be stored and appropriate algorithms can be selected for anomaly detection. In our experience, trying to automatically detect the data type of each variable might be hard and might require too much log and system context. This is a case where asking a human for input is much more time-efficient and expeditious to accomplish the task rather than trying to guess every data type automatically. Therefore, it is critical to incorporate appropriate user interfaces to take user inputs during log parsing.
Once the log features are identified, these features can now be used for anomaly detection.
Detecting anomalies from log features
Any variations of these identified features from the learned normal ranges are considered anomalies. Since log data is real-time streaming data, typically, time-series algorithms are used for anomaly detection. Anomaly detection can be applied to a single variable/feature (univariate) or multiple variables/features at once (multi-variate). We list several algorithms that can be applied for anomaly detection on the time-series features.
1. Univariate Time-series Algorithms: Apply different algorithms such as Robust-bounds, Flat-line, variant/invariant, Granger, Finite Domain, Predominant Range, and Discrete Values. Several of these algorithms are implemented in IBM’s Cloud Pak for Watson AIOps already. These algorithms are detailed further in [3].
2. Multi-variate Algorithms: Principal Component Analysis (PCA), Real-time Statistical algorithm [4], or other deep learning algorithms such as LSTM can be applied to the derived features to detect anomalies. More details on how these algorithms are implemented in Cloud Pak for Watson AIOps are available in [6].
Strategies for dealing with log volumes
IT logs are expensive to process and can pose significant infrastructure requirements if the volume of logs to be processed is large (e.g., TBs of data/day). We suggest that companies be judicious about which and how many application stacks they should monitor and how to start and expand to derive insights from logs, given that log processing is infrastructure intensive. Below, we list some best practices for IT operations managers for leveraging logs in IT operations management that make economic and business sense.
1. Monitor IT logs of your critical customer-facing applications (tier-1) in real-time for proactive incident detection: Typically, tier-1 applications are the most critical applications and therefore it is recommended to use real-time log monitoring for those applications. This means that the required amount of infrastructure needs to be allocated to process logs in real-time.
2. Monitor IT logs of your non-critical applications (tier-2) for asynchronous near real-time proactive incident detection: For tier-2 applications, near real-time detection of anomalies might be good enough. That is, if an issue were to occur, it might be acceptable to notify that issue within a few minutes rather than within seconds.
3. Leverage logs for incident diagnosis and explanation for non-critical internal applications (tier-3): For Tier 3 applications, instead of trying to proactively detect anomalies from terabytes of logs continuously, it might be advisable to use logs in a diagnosis use case. In the diagnosis use case, the log anomalies detector is invoked for only a subset of resources in the application stack that is identified as a probable cause for an already detected incident. The incident may have been detected using metric monitoring systems or application performance monitoring (APM) systems. In this setup, a select set of logs that correspond to a specific time window associated with the time of the incident are analyzed to detect anomalous patterns. In this approach, the log anomaly detector doesn’t take up the burden to detect anomalies and incidents in real-time on all the resources in each application context. Logs are primarily analyzed to diagnose the source of the problem for an already detected incident. This would reduce significant infrastructure investments required to process logs of all applications and infrastructure stacks and still allow SREs and CIOs to derive insights from logs for better issue diagnosis and resolution.
Conclusions
IT logs are an important source of information in IT operations management. However, deriving insights from logs is a hard problem because logs are not often standardized, come in many formats, and are voluminous. As a follow-up to our prior article in which we discussed what makes log parsing hard [2], in this article we presented some approaches to parsing logs effectively and preparing structured features from logs to derive useful insights from them. These features then form the basis for downstream tasks such as anomaly detection, prediction, diagnosis, and incident explanation. We also discussed some best practices for IT operations managers for dealing with large volumes of logs so that they can harvest insights from logs in an economically viable manner.
References
1. [Ganti R. et al 2021] Why Logs are Important? https://community.ibm.com/community/user/aiops/blogs/raghu-kiran-ganti1/2021/11/30/why-logs-are-important?CommunityKey=6e6a9ff2-b532-4fde-8011-92c922b61214
2. [Akkiraju et al 2022] Why is log parsing and processing hard? https://medium.com/ibm-cloud/why-is-log-parsing-and-processing-hard-1e72bac55712
3. [He P et al 2017] He P., Zhe J., Zheng Z., Lyu M. Drain: An online Log Parsing Approach with Fixed Depth Tree. 2017 IEEE 24th International Conference on Web Services. https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf
4. [IBM Operational Analytics Predictive Insights Documentation] Time-series algorithms in IBM’s Metric Anomaly Detection Component: https://www.ibm.com/docs/en/oapi/1.3.6?topic=concepts-algorithms
5. [Lu A et al 2022] Lu An, An-Jie Tu, Xiaotong Liu, Rama Akkiraju. Real-time statistical log anomaly detection with continuous AIOps learning. In the proceedings of the ACM International Conference on Cloud Computing and Services Science (CLOSER) 2022.
6. [Xiaotong et al 2022] AIOps Explained — Log Anomaly Detection: https://www.youtube.com/watch?v=DWkFMWi3GHY
#AIOps#CloudPakforWatsonAIOps#ITOperations#LogAnamoly#LogAnalysis#ingest