Co-Authors: Ragu Kattinakere, Varsha GS, Patrick Butler
Site reliability engineers (SREs) spend a lot of their time going through logs to find the exact cause of an anomaly or an incident. Your applications logs may include several anomalies created in a short span of, say 10 minutes. With each anomaly resulting in an event, it becomes difficult to pinpoint the exact cause of the anomaly.
With the release of Cloud Pak for Watson AIOps 3.3, IBM expands the concept of "stories" in AIOps. Stories represent the context around an issue which is currently severely impacting operations. This includes all alerts that are related to the issue and information about how the affected resources are related. The creation and evolution of stories is informed by alerts.
Grouping events into stories makes debugging easier and quicker, by reducing noise. With the 3.3 release, there are three grouping methods IBM Cloud Pak for Watson AIOps uses to group events in one story. There is a default policy on how all three methods should interact to produce the best results possible for SREs.
Illustrative example :
An application with three services, has multiple events within 30 minutes. The diagram below depicts the three service events in Green, Orange and Red respectively.
Let's look at how each grouping method helps in grouping these into logical groups or stories.
Scope based grouping
"Creates a new story for all events related to the same resource occurring in every pre-configured time window."
Consider an anomaly has affected your application for the past half an hour. In a scenario where the events are not grouped, there would have been 100+ events for each service of your application. Going through all these events can take a lot of time, which also increases the downtime of your application.
Scope based grouping groups the events for each service within a pre-configured time window, thus significantly reducing the large number of events the SRE would have to go through, while also highlighting only the affected services.
With scope-based grouping, our previous example now appears more organised:
If two resources are impacted in a preconfigured time window, SREs will only see two stories instead of numerous events from those resources because of this grouping technique. As we will see next, this method is combined with other grouping methods to drastically reduce the number of stories SREs have to look at overtime.
Temporal based grouping
"Group an event to a story based on if it had occurred together with other events in a pre-configured time window in the past."
Most times, an anomaly in a service can be identified by looking at the events of other services. There is an undeniable pattern to each anomaly which the SRE uses for figuring out the root cause. While this pattern can be identified by going through a lot of anomalies over time, real time logging usually does not leisure the time needed for understanding the pattern of an anomaly.
Take for example, an application which has two services, Orders and Shipping and the events of these two services are dependent on each other. If Orders service is affected and it stops logging instead of logging errors, the SRE might see an anomaly in another dependent service such as Shipping. With Temporal grouping SREs need not spend time manually investigating these patterns.
This is why temporal based grouping can be so useful.
With about three months of logging data, IBM Cloud Pak for Watson AIOps can detect very complex patterns andgroup the events of multiple services that are related by these complex hard to detect patterns.
The first time an anomaly occurs, it results in multiple events from different services. Temporal grouping observes these events over a pre-configured time window. The next time the anomaly appears, the pattern is matched, and if the same events occur again in the time window, a policy is dynamically created. This policy helps Temporal grouping method group events that have the same source and are caused by the same anomaly.
Temporal based grouping helps eradicate noise and refines the events to provide only one story, which points to the root cause of the anomaly or often the anomaly itself.
If we observe the previous illustration, the events of service Green and Orange appear together. The Temporal method finds this pattern and tries to improve the grouping so SREs can look at less number of stories yet have more related events in them.
With these types of pattern identified the events are now grouped in a more systematic manner to help the SREs zero in on the cause of the issues.
In the above illustration, recognising the pattern, temporal grouping gives the SRE only three stories to look at, within a 30 minute time window.
Topological based grouping
"All the events, related to a user-specified set of resources or services are grouped to a single story."
If we clearly know the dependencies of a service, it is easy for an SRE to find the service causing the incident. But with large number of services in a single application, it often becomes difficult to identify independent and dependent services.
Topological based grouping helps with these scenarios especially when the dependencies are very complex in a large enterprise IT system.
Referring to the earlier example of two services, Orders and Shipping, let's add another service Carts to it. Carts, Orders and Shipping are dependent on each other. A customer will first add an item to cart, order it, which will eventually be shipped. Keeping this in mind, the services dependency can be noted as follows : Shipping is dependent on Orders, Orders is dependent on Carts. If an anomaly occurs in Orders, it can be deduced that the request made from Carts could have been invalid and that Carts might be responsible for the anomaly in Orders service.
AIOps allows SREs to identify dependent services and create a template that groups them together. The Topological grouping technique then makes use of this resource grouping to group all the events of all the services in this group together.
This way, any anomaly in dependent services generate only one story and independent services however continue generate individual stories as required. This greatly reduces the number of stories an SRE has to investigate.
Topological based grouping also allows an SRE to define groups irrespective of the topology level (pod/service/deployment), making the job of finding affected services and finding effective solutions easier.
In our above example, the SRE decides to create a template, based on topologically connected services.
The defined template considers all the three services, as one single application.
With this new information, topological grouping helps create a single story, which can combine all the service events as the SRE has defined.
IBM Cloud Pak for Watson AIOps 3.3 can be customised to suit your enterprise by enabling or disabling or configuring grouping algorithms to generate fewer stories enabling SREs to focus their efforts on the real issues and greatly improves the time required resolve issues.
For more information about grouping in IBM Cloud Pak for Watson AIOps, go here.