Why Log Parsing and Processing is hard
By
Rama Akkiraju, IBM Fellow, CTO AIOps
Xiaotong Liu, Manager, Senior Data Scientist, AIOps
Following our recent post on Why are logs important [1 ], this blog explores why logs are hard to parse and process.
Information Technology (IT) Logs are events written by software systems during the execution of a program. Logs contain information about errors, exceptions, warnings, informational events, and other diagnostic information such as database query statements and the time an event has occurred. Logs are useful in detecting and diagnosing problems with IT systems in IT operations environments.
Log anomaly detection (LAD) aims to detect anomalous behavior in the logs produced by IT systems. Log parsing helps in extracting features from logs, which typically serve as the first step toward downstream log analysis tasks such as log templatization, log clustering, and anomaly detection. However, log parsing and processing are not easy.
In this article, we illustrate various aspects of logs that make parsing hard. In the follow-on article, we present approaches for parsing and processing them for useful insights in an economically viable manner.
Log parsing – why so hard?
IT application and system logs are semi-structured machine-generated data. They can come in many formats, structures, languages, and in large volumes. These multi-dimensional attributes of logs pose many challenges in parsing and processing logs. In addition, if many business applications and systems are to be monitored in real-time for anomalies and application performance, then IT logs can be expensive to process. In large volumes, they can pose significant infrastructure requirements.
To better understand the complexities of log parsing, let us look at some of the challenges in parsing because of structural and format variations in logs:
- Semi-structured: Logs typically have structured and unstructured portions. Structured portions may include timestamps, the name of the resource that is writing the logs, hostnames, IP addresses, errors, warning codes, etc. Unstructured portions may include the log message and query statements. Often, the formats of logs are not well-documented, thereby making it necessary to apply natural language processing (NLP) techniques to detect key entities. Deriving specific entity mentions such as hostnames, IP addresses, names of resources, time stamps, and error & exception codes, which might contain useful diagnostic information, requires a reasonable semantic understanding of log messages to extract features for downstream tasks such as anomaly detection and incident explanation. Here are some examples of logs containing structured and unstructured portions.
- Varied formats: Logs can be written in plain text, XML, JSON, plain old java object (POJO), or any other format. Sometimes logs can have embedded JSONs within XMLs and vice versa. This complicates the processing of logs significantly, especially when XML and JSON structures are embedded inside each other. Getting to the entities needed to prepare features becomes more difficult as they get buried inside arbitrary levels and layers.
- Log as an XML: <XML><DATA>Connecting to outbound queueSOLQXYZ.beo11.LIVE2</Data></XML>
- Log as a JSON: _line=“{\“date\”=\“20210805T1503\”, \“message\”=\“Processed the order in 23ms.\”}”
- Date-time stamp variations: Logs don’t often follow standardized formats for Date-time stamp printing. Unfortunately, Some programmers might choose to write dates and times in custom formats. Also, for software that is part of custom-written or proprietary legacy systems, date-time stamp formats could be significantly different from the standard formats that packaged software (e.g., software middleware products such as Db2, Oracle DB, SAP software, etc.) might use. Some date-time stamp format variations are listed below. These variations pose interesting challenges for natural language processing (NLP) software that detects entities such as date and time.
- “127.0.0.6 - - [15/Sep/2021:10:48:30 +0000]”
- “start_time='2021-09-18T22:35:16.576677Z’,” "date":"2021-09-18T22:48:43.982+0000"
- Languages (e.g. English, German, Japanese,..): Logs can be written in many natural languages depending on who wrote the software program. They can be in as many languages as the business is conducted. This means that the NLP software used for extracting log messages must be able to detect the language and parse it accordingly with suitable tokenizers and dependency parsers to extract features from them. Complicating the language support problem further, some logs are written in a mix of languages. For example, the second example below shows a log message that is a combination of German and English. Tokenization gets trickier when non-space delimited based languages are mixed space delimited languages. For example, when English and Japanese or English and Chinese are mixed, parsing those logs gets even more complex than parsing logs written in space-delimited languages such as English and Spanish or English and German.
- Double byte languages (Japanese): ランタイム・プロビジョニング・フィーチャーが使用不可になっています。 すべてのコンポーネントが開始されます
- Mixed languages (e.g., German and English): Das gemeldete problem ist: error serializing java object.
- Poorly written logs: Sometimes logs are written poorly and in a cryptic manner. For example, consider the log phrase ‘POST/orders 500’. According to the subject matter expert, the number 500 in the below log implicitly meant ‘HTTP error code 500’, which is a critical error that needs to be immediately brought to the attention of an administrator. However, in the absence of the phrase ‘HTTP error’, it is virtually impossible to distinguish it from a general number 500. We would like to refer to some of them as ‘read my mind’ logs.
- ‘POST /orders 500’ Vs.‘POST /orders HTTP error 500’
- Confusing and Conflicting logs: Human programmers who design log messages are susceptible to making mistakes. When log formats are not standardized, they can design confusing and conflicting log messages. For example, in the below log message from one of the kernel applications, we noted all three log levels at once - info, exception, and error. This makes it difficult to discern if this is an information-oriented, error-oriented, or exception-oriented log message.
- "php.INFO: User Deprecated: Since symfony/http-kernel 5.3: \"Symfony\\Component\\HttpKernel\\Event\\KernelEvent::isMasterRequest()\" is deprecated, use \"isMainRequest()\" instead. {\"exception\":\"[object] (ErrorException(code: 0): User Deprecated: Since symfony/http-kernel 5.3:["
- Log volumes: Apart from the structure and format variations, log volumes can pose challenges in parsing as well. As the number of business applications and systems that need to be monitored in real-time for anomalies increases, the volume of IT logs that need to be processed also increases, thereby posing increased infrastructure requirements. Sometimes, log volumes can run up to several terabytes of data per day. This increases the cost of infrastructure needed to process these logs thereby making the total cost of ownership of having a solution for log-based anomaly detection economically unattractive.
Conclusion
IT logs are an important source of information in IT operations management. However, deriving insights from logs is a hard problem because logs are not often standardized, come in many formats, and are voluminous. In this article, we discussed what makes log parsing hard. In the next article, we present techniques for harvesting insights from logs in an economically viable manner.
References
- [Ganti R. et al 2021] Why Logs are Important? https://community.ibm.com/community/user/aiops/blogs/raghu-kiran-ganti1/2021/11/30/why-logs-are-important?CommunityKey=6e6a9ff2-b532-4fde-8011-92c922b61214
#CloudPakforWatsonAIOps#AIOps#LogAnamoly#logging#LogAnalysis