By: Anindya Neogi
Originally Published On: June 25th, 2016
A “Log analysis” tool is incredibly useful in the hands of application dev / support and IT Ops users who want to make sense out of logs to detect and diagnose an app or infra problem. When we talk about log analysis – we actually mean “machine data” – which is not just logs, but logs, traces, events, tickets, transactions records etc.
There are quite a few vendors building log analysis solutions on either proprietary or open source stacks. Clearly this is a growing market, especially in conjunction with other IT Service Management solutions.
The technology stack
The core stack is built around a set of technologies – (a) collecting any machine generated data, (b) parsing to structure the data into records and attributes, (c) indexing / storing the data for search / query and (b) flexible data navigation and visualization. Over the last few years, the core stack has become increasingly commoditised through open source projects. The primary players are Apache Solr and ElasticSearch. Both are distributed search engines built on the same Lucene text indexing technology, hence using the same data query syntax but slightly different API abstractions for search and administration.
ElasticSearch quickly brought together the open source machine data collection / parsing and visualisation projects, LogStash and Kibana, respectively and created the stack called ELK. The pre-integrated stack has rapidly become very popular among machine data analysis community, especially with Elastic providing support and services for the stack. In parallel, LucidWorks pulled together LogStash, Solr and Banana (a port of Kibana on Solr APIs) to create the SLK stack. There are no clear winners from the technology standpoint between these stacks – both widely used with large fan followings but with ELK having a larger mindshare in the user community. Both are integrated with the same collection, parsing, indexing, and visualisation technologies, with only difference in the search engines , APIs, and administration.
Any open source machine data analysis stack faces a set of serious challenges in a production environment – Cloud or on-prem. These challenges are very different from typical enterprise search because the data and use cases are different irrespective of the deployment model, e.g.
- Machine documents are small log or event records – e.g. 200 byte logs as opposed to say, average 4KB web pages, which creates more stress on any data pipeline.
- Machine data arrives at a high rate with thousands of log files streaming in real-time, as opposed to periodically crawled large static text documents
- Users expect low latency from ingestion of a log / event record and it appearing in a search result, because an operator will need the relevant data for analysis as soon she gets the problem alert or the solution needs to generate an alert
- Users expect at least a perception of fast response time searches with results dynamically updating in the UI for fast diagnosis
- Users expect various keywords to be extracted from the results with frequency counts computed in real-time. These are used for drill down navigation in the results to find the root cause of a problem.
Both the open source stacks provide the basic technologies to get users started with machine data analysis. But there’s a lot more to be done to make the complete solution production ready. The IBM Operations Analytics – Log Analysis product picked the SLK stack to develop a machine data analysis solution. Our assumption was that any solution we build needs to be consumable in multiple delivery models – SaaS, PaaS, on-prem or hybrid without changing the core product. In addition, it needs to be consumable for integration with other Service Management solutions – e.g. Service Desk, Event Management, Application Performance Management.
There is a key set of capabilities we had to build on the open source stack to provide end user value —
- Domain insights: The open stack enables users to perform general data ingestion, query and visualization. We enable domain experts to create “insight packs” that can add intelligence into the solution that is their core Intellectual Property – for e.g. what data to ingest, how to format it, what are the patterns to watch for in the data etc. Insight packs can also be integrations with Service Management solutions and Log Analytics to serve an end user value. Examples are Netcool Operations Insight and Service Desk integrations to ingest, search, analyse events and tickets.
- Data source administration: In a large environment, there may be thousands of data sources of various types. We need to manage the collection process, identify gaps and errors, make sure they are mapped to right data types etc. The analysis is as good as the data we ingest.
- Data tiering: Machine data is highly temporal and may need to be stored for several months. There are time based partitions or tiers built on basic indices so that we can manage the data, query workloads, and machine resources better. For e.g. recent data can be queried fast but with high resources. Long term archival data takes less resources but susceptible to slower query response. This is also where we built an efficient integration between our data pipeline and a Hadoop cluster to store long term data for analysis.
- RBAC: Once different data sources and “Insight packs” are configured, we need to provide access control based on roles on data and artefacts that use the data. An RBAC system provides an interface to control users, roles, and permissions. In the long term this RBAC system can share state between other integrated solutions – e.g. Application Performance Management (APM), when they have to work together.
- Search scaling: Just as large data sets need to be tiered for resource efficiency, a large search across a huge volume of data also needs to be carefully controlled and managed so that users don’t blow up resource consumption. Imagine hundreds of users running queries over various time scales and drilling down into the results, coupled with a real-time UI.
- Alerts: As data is ingested, in real-time we need to watch for known problem patterns (rules) and trigger alerts. These rules can be part of an imported domain “insight pack”. In future, I’ll discuss how we’ll leverage more open technologies, such as Spark, for real-time log analysis to create insights and alerts.
- API Abstraction: Even though we built the solution on a specific technology stack, we created an API abstraction to make the core indexing, search, administration layers technology agnostic. This enables us to switch to the best of breed open technology quickly.
None of these capabilities are available OOTB in an open source stack. Besides these capabilities, what also matters is the invaluable operational expertise in tweaking, tuning, best practices to manage an open source platform. In a nutshell, it is not easy to just download and make any open source stack an enterprise ready machine data analysis solution.
Given an extensible, robust and scalable machine data platform, our next focus is to build differentiators around more “analytics” on the data. In my next blog, I will talk about what are the Service Management use cases where it needs deep data analysis, beyond search and visualization, and how we solve the problem on an Enterprise grade log analysis platform. I will also discuss how additional open technologies, such as Spark, will help in real-time analytics with the machine data integrated with the Search platform.