Originally Published On: June 25th, 2016 By: Anindya Neogi
A “Log analysis” tool is incredibly useful in the hands of application dev / support and IT Ops users who want to make sense out of logs to detect and diagnose an app or infra problem. When we talk about log analysis – we actually mean “machine data” – which is not just logs, but logs, traces, events, tickets, transactions records etc.
There are quite a few vendors building log analysis solutions on either proprietary or open source stacks. Clearly this is a growing market, especially in conjunction with other IT Service Management solutions.
The technology stack
The core stack is built around a set of technologies – (a) collecting any machine generated data, (b) parsing to structure the data into records and attributes, (c) indexing / storing the data for search / query and (b) flexible data navigation and visualization. Over the last few years, the core stack has become increasingly commoditised through open source projects. The primary players are Apache Solr and ElasticSearch. Both are distributed search engines built on the same Lucene text indexing technology, hence using the same data query syntax but slightly different API abstractions for search and administration.
ElasticSearch quickly brought together the open source machine data collection / parsing and visualisation projects, LogStash and Kibana, respectively and created the stack called ELK. The pre-integrated stack has rapidly become very popular among machine data analysis community, especially with Elastic providing support and services for the stack. In parallel, LucidWorks pulled together LogStash, Solr and Banana (a port of Kibana on Solr APIs) to create the SLK stack. There are no clear winners from the technology standpoint between these stacks – both widely used with large fan followings but with ELK having a larger mindshare in the user community. Both are integrated with the same collection, parsing, indexing, and visualisation technologies, with only difference in the search engines , APIs, and administration.
Any open source machine data analysis stack faces a set of serious challenges in a production environment – Cloud or on-prem. These challenges are very different from typical enterprise search because the data and use cases are different irrespective of the deployment model, e.g.
- Machine documents are small log or event records – e.g. 200 byte logs as opposed to say, average 4KB web pages, which creates more stress on any data pipeline.
- Machine data arrives at a high rate with thousands of log files streaming in real-time, as opposed to periodically crawled large static text documents
- Users expect low latency from ingestion of a log / event record and it appearing in a search result, because an operator will need the relevant data for analysis as soon she gets the problem alert or the solution needs to generate an alert
- Users expect at least a perception of fast response time searches with results dynamically updating in the UI for fast diagnosis
- Users expect various keywords to be extracted from the results with frequency counts computed in real-time. These are used for drill down navigation in the results to find the root cause of a problem.
Both the open source stacks provide the basic technologies to get users started with machine data analysis. But there’s a lot more to be done to make the complete solution production ready. The IBM Operations Analytics – Log Analysis product picked the SLK stack to develop a machine data analysis solution. Our assumption was that any solution we build needs to be consumable in multiple delivery models – SaaS, PaaS, on-prem or hybrid without changing the core product. In addition, it needs to be consumable for integration with other Service Management solutions – e.g. Service Desk, Event Management, Application Performance Management.