Why the Watson AIOps Event Manager is a game changer

The release of the new Event Manager, based on Netcool technology, is one of the three pillars of the Watson AIOps solution. It has taken Operations Management and AIOps to the next level: from the all-new, completely redesigned UX, to the dramatic advancement in event correlation; that combines machine learning, local knowledge, and topology into a single, unified correlation engine. So, what’s “wow” about the new Event Manager? Let’s take a closer look.

ALL-NEW ANALYTICS & CORRELATION ENGINE

There are many different ways to skin the proverbial cat. So too with event correlation, this can also be done in different ways. Netcool Operations Insight has previously centred around two main ways to do event correlation: analytics-based and scope-based. Analytics-based event correlation (also referred to as temporal correlation) uses machine-learning algorithms to analyse the event history, looking for events that always suspiciously occur together. It then automatically groups these events together if they occur again in future. Scope-based grouping works on the basis of grouping events that occur “at the same place, at the same time”. It allows the user to define the basis for what the scope should be. For example, the scope might be the geographical location the events have come from, or it might be a logical grouping, such as a line-of-business or application group.

The Event Analytics engine has been completely rebuilt for Netcool Operations Insight 1.6 as a cloud-native application and is updated for the Event Manager release. Although there were techniques available that allowed you to combine event-grouping capabilities, Event Manager now does this automatically, and creates “super groups” for you. A key difference is that an event can now be a member of different types of groups simultaneously. It is the events that are members of more than one group that defines how super-grouping is done: groups that have overlapping members are automatically combined.

This is a big deal. Although there are many ways that event correlation can be done, experience has shown that no one method will work for every scenario. Analytics and machine learning techniques consume data in an attempt to detect patterns and thereby predict future eventualities. But what if a given scenario has not been seen before? Similarly, scope-based correlation is excellent at automatically grouping together events that occur at the same place. But what if the incident affects multiple places, or happens between places? How do you link these groups of events together?

Watson AIOps Event Manager leverages multiple event correlation techniques, and uses them collaboratively. By leveraging multiple event correlation techniques simultaneously, Event Manager is better able to more completely correlate events into more comprehensive groups. Not only does more complete and accurate correlation reduce the number of event rows presented to operators, it has also proven in the field to significantly reduce Mean-Time-To-Repair (MTTR) as well as the number of trouble tickets created. A large North-American cable provider successfully used these techniques to reduce ticket counts by 75%. They also calculated their average MTTR dropped by around 63% by having events correctly correlated together.

Watson AIOps Event Manager leverages multiple event correlation techniques, and uses them collaboratively. … A large North-American cable provider successfully used these techniques to reduce ticket counts by 75%. They also calculated their average MTTR dropped by around 63% by having events correctly correlated together.

In the following image, you can see a group of nine events in the Events view. In the grouping columns, four of the events are marked with a clock icon, indicating they have been grouped via a temporal (or analytics-based) correlation. Three of the events are marked with a Venn diagram icon, indicating have been grouped by a scope-based grouping correlation. Seven of the events are marked with a topology icon, indicating they have been grouped by a topological correlation. Five of the events span multiple groups, hence those groups have been merged together into this group of nine events, forming a so-called "super group".

Super group

This feature is big news: now you don’t have to choose which correlation technique to use; you can instead use all of them together.

ALL-NEW UX

The Event Manager solution retains the legacy Netcool Operations Insight WebGUI for backwards compatibility however adds an all new User Experience. The new UX has a fresh new look, and is based on the latest web UI technologies. It has a cleaner, more intuitive feel, and seamlessly brings together all the various correlation, analytical, and functional capabilities.

From here, the user can see extensive information about each grouping of events. First, the correlation columns on the right shows how each event has come to be included in the group. Some events are members of multiple groups and clicking on the grouping buttons shows more detail. Second, the Topology column shows which of the resources represented in the group of events are found in the topology. The Seasonal column shows if any of the events exhibit any Seasonal characteristics (appearing at a predictable time of the day or day of the month). The Runbook column shows if any of the events have any Runbooks available for execution. Runbook Automation is a new capability in Event Manager and provides a means whereby SMEs can author Runbooks for commonly occurring issues that have well-defined resolutions. The Probable Cause column shows each event's probable cause score. The higher the score, the more likely the event is of being the cause of the current problem. The event or events with the highest score are marked in blue, to indicate they are the highest. In this example, the excessive CPU usage on the hypervisor is the likely probable cause of the current issue, and is in-turn causing the issues on the underlying dependent systems.

ANALYTICS PRESENTED IN-CONTEXT

Event Manager provides transparency to the Event Analytics by providing contextual detail of the analysis. The screen shot above shows the summary information presented to users when any of the temporal grouping buttons are clicked. By clicking on "More information", the user can then inspect the full historic occurrences of all the previous groupings seen, as shown in the example below.

Being able to see the context behind machine learning and AI methodology is essential for gaining trust in the actions it takes. This is why Event Manager provides extensive visibility of such screens to give so-called "explainable AI".

Similarly, if the user clicks on the Seasonal button for the "Weekly backup" event, they can inspect the nature of its Seasonality.

NO RELIANCE ON THE REPORTER DATABASE

In Netcool Operations Insight 1.5 and earlier, the Event Analytics engine worked off the contents of the REPORTER database to do its analysis. This meant when an analytics configuration was run, the Event Analytics engine would pull the selected dataset from the target database, do its processing, and then tabulate the results. In practice, this caused some issues for some clients. First, poorly performing databases, or ones that contained enormous amounts of event data, were taking inordinate amounts of time to analyse. This also resulted in time-out issues in the UI when users queried the system for historic event occurrences. Second, many REPORTER databases are not deployed in a best-practice manner. In some cases, columns required by the analytics were missing or used for other purposes. This meant the analytics could not run, or may yield no meaningful results.

The new Event Analytics engine in Event Manager and also in Netcool Operations Insight 1.6 now has no reliance on the REPORTER database. It takes a direct feed from the ObjectServer and stores internally everything it needs to do analytics and drive the UI. In practice, this makes it all work much faster, and dramatically improves the user experience. Note that it is possible to prime the new event analytics engine with historic event data, if you have it. Event Manager and Netcool Operations Insight 1.6 both come with a utility that can be used to ingest data from your REPORTER database, to get off to a fast start. It is no longer a required component at deployment time however, and it does not rely on it subsequently.

DEPLOY FIRST or REVIEW FIRST

The default temporal grouping deployment model for Event Manager is to activate Event Analytics and automatically group events that it has learned historically always occur together. This mode of operation is called “deploy-first”, where groupings are automatically deployed as they are discovered by the Event Analytics engine. Alternatively, the system can be set up so that all discovered groupings must first be validated before they are allowed to perform automatic event grouping. This alternative mode of operation is called “review-first”. This can be configured in the Analytics configuration.

We saw earlier that one of the new features of Event Manager is to provide transparency of the analytics to the users. Users can see why events are grouped together, as well as drill down into the historic occurrences of the grouping. In both deploy-first and review-first modes of operation, users can also get access to the found groupings via the More Information link in the Events view. In the Policies, an administrator can review the Live groupings and the Suggested ones. In deploy-first mode, all groupings are automatically made “live” unless a grouping is rejected by an administrator. In review-first mode, all groupings are first suggested and must first be validated before they are made live. As before, these groupings can be assessed, the previous historic occurrences examined, and then the grouping approved or rejected. This would allow anyone looking at an occurrence of the group of events in the incident view later to see that this grouping had been specifically approved and validated.

If the system is running in review-first mode, the system helps a user with the review process by automatically ranking the groupings in the Suggested view. Factors that affect a grouping’s ranking include: when the grouping was last seen, what the maximum Severity of the events was, the number of events in the group, and the number of times the grouping has been seen. This is very helpful in leading the user to the groupings that would bring the highest value to the business. If approved, groupings move from the Suggested box to the Live box.

AWARD-WINNING TOPOLOGY VIEWER

One of the jewels in the crown of Event Manager is its award-winning Topology Viewer.

With the ever increasing trend towards more dynamic, cloud, and multi-cloud environments, being able to visualise how your environment is connected in a single pane of glass has become a vital element in being able to support and manage it. The Topology Viewer capability in Event Manager has a number of key capabilities that make it essential for the management of dynamic environments.

OBSERVERS

Just as traditional Netcool Probes are for the collection of events, the job of the Observers is to collect topology data. The library of topology ingestion Observers includes specific off-the-shelf ones designed to connect to specific types of topology source, like Kubernetes or VMware, and generic ones that can be used to ingest custom topology data, like from file or the REST API.

DYNAMIC

One of the principal design elements of the Topology Viewer was that it needed to be able to consume and depict topology data in real-time. Many of the Observers plug directly into dynamic orchestration systems, consume the topology changes published by the target system, and update the topology in Netcool on-the-fly. This feature is essential in allowing an operations team to effectively manage a highly dynamic environment. Not only does an operator need to be able to see how things are connected now, they need to be able to “go back in time” to see how things were connected at the time the events occurred.

TIMELINE AND DELTA

The Netcool Topology Viewer stores received topology data, so that a user can “go back in time” and view how the environment was connected at a previous point in time. By switching on the DELTA view, a user can also see what has changed. This is an essential capability for both troubleshooting a current issue, as well as for doing a debrief after a major outage. DELTA mode can be enabled and pins positioned on two different points in the timeline to show what changed between those two points in time.

MERGED TOPOLOGIES

Another feature of the topology viewer is the ability to stitch or merge together topology parts that have originated from different sources. An example of this might be a hybrid environment whereby some parts of the managed environment reside in an on-premise containerised environment and other parts in a public cloud environment. The various parts may collectively make up a service and have mutual dependencies, hence being able to visualise all the parts simultaneously as well as the connectedness is vital to troubleshooting any potential problems. Topology parts from multiple different sources can all be stitched together in a similar manner to provide visualisation of the entire estate, on a single pane of glass.

TOPOLOGY-BASED EVENT CORRELATION

Correlating events coming from highly dynamic environments where resources are created and destroyed on-the-fly in a more-or-less random fashion is very difficult, if not impossible, without a view of the topology at the time the events occurred. In such environments, trying to deduce event relationship based on an analytical assessment of the historic event data is of limited value. After all, it is very difficult to predict how events should be correlated together in future, by looking at how events occurred together in the past, if those events came from topology that no longer exists!

One of the Event Manager event correlation capabilities is its ability to perform topology based event correlation. This feature allows users to define topology templates which can then be used to define event correlation boundaries. For example, a topology template may be defined to have a specific collection of resources or a method for defining a collection of resources. If event correlation is enabled for this via the toggle, events occurring within the defined topology will automatically be correlated together.

The topology-based event correlation works in conjunction with the scope-based and analytics-based correlation capabilities in a collaborative fashion, to create super groups. This capability enables clients to finally close the loop on many of the edge case correlation scenarios that may not have been possible to elegantly solve before.

Consider the following diagram:

In this scenario, the cause of the outage is due to a link going down between two parts of the environment, causing events to be generated in multiple places. The analytics-based correlation has grouped together the three events circled in orange, since this is a known grouping based on previously observed events coming from the same application nodes. Similarly, the scope-based correlation has created two groupings of events based the events’ respective geographic locations. With these two correlation mechanisms alone, two incidents would be created.

Watson AIOps Event Manager is unique in that it leverages multiple powerful event correlation techniques, simultaneously and collaboratively, allowing far greater and more accurate event correlation than any single approach method can do alone.

Enter: topology-based event correlation. Due to a predefined topology template that defines the links in the environment, the topology-based event correlation has correctly correlated the four events coming from the four resources circled in green. Since there is overlap between this topology-based grouping and both of the scope-based event groups, the event grouping engine will automatically merge the events covered by all these groups into one super group, thereby making it easy to create just the one ticket for the link down failure.

INDUSTRY LEADING EVENT CORRELATION

As we know, scope-based event correlation will correlate events within the same scope, however make the scope too large and you risk correlating events together incorrectly. Analytics-based event correlation leverages machine learning capabilities to learn what events have historically occurred together, however in many cases not every event scenario possible has been seen before enough times for validation. Topology-based correlation allows us to define connectivity templates that define how alarms from specific types of connected things can be correlated.

Experience in the field has taught us that, while all three approaches are very powerful, no one approach alone will fulfill every possible use-case. Watson AIOps Event Manager is unique in that it leverages multiple powerful event correlation techniques, simultaneously and collaboratively, allowing far greater and more accurate event correlation than any single approach method can do alone.

OTHER NEW CAPABILITIES

Netcool Operations Insight 1.6 comes with a number of other new capabilities to make managing your Hybrid Cloud environment easier.

INBOUND EVENT INTEGRATIONS

New in Event Manager are new inbound event integrations that will help you to quickly set up inbound event integrations with just a few clicks. This set of off-the-shelf integrations are being built-on and expanded continuously.

The new inbound event integrations include:

RUNBOOK AUTOMATION

A runbook is essentially a set of instructions or steps that can be followed to resolve a problem. A “manual” runbook is a set of manual instructions that an operator might carry out themselves with no automation involved. For example, it might involve copying and pasting commands to run that would resolve an issue, such as reset a link on a switch. A “semi-automated” runbook is one that provides a set of steps that are initiated by the operator, but that automate the execution of the step in each case. For example, they might be presented with a button that connects to the switch and resets the link, when clicked on. In either case, runbooks are contextually-sensitive to the event they were launched from, and the runbook instructions and steps are populated based on the selected event. An example of a semi-automated runbook is shown below:

At the end of each runbook is the option for the operator to give feedback on the runbook – for example, did it work or not? Did any of the steps fail? If a runbook author sees that their runbook was used 100 times in the past month, and was always successful, they might look to make the runbook fully automatic. This means the runbook will run without any user intervention and introduce self-healing elements to the environment. The beauty of this approach is that a fully-automated resolution can be tried and tested organically in production by real users, before it is let loose on the environment. Human interaction is needed only if the runbook fails to resolve the problem. Hence a great deal of resource can be saved by automating many of the mundane repetitive corrective tasks that use up a lot of operators’ time. In this way, clients can gradually evolve their environments to an increasingly self-healing state.

DEPLOY ON CLOUD

Watson AIOps can deploy onto OpenShift in its entirety, or in a so-called "hybrid" manner, where some of the components run on OCP and others run on traditional virtual machines. In both cases, the VMs and the OCP deployment can be deployed into your own datacenter or be hosted on the cloud.

SUMMARY

Watson AIOps Event Manager brings a wealth of new capabilities and new technology that make it more intelligent and more powerful than ever. Its all-new user experience provides a more intuitive incident-centric way of working, that makes problem determination quicker and easier, and that gives transparency to the analytics done under the covers. Its ground-breaking new correlation capabilities – scope-based, analytics-based, and topology-based – work collaboratively to enable event correlation in ways not possible before.

All this helps to drive down ticket counts, Mean-Time-To-Know, and Mean-Time-To-Repair. It includes the new Topology Viewer that is absolutely essential for visualising and managing highly dynamic and multi-cloud environments. It provides simple UI-driven event integrations into third party hybrid-cloud applications. It now includes Runbook Automation, that allows for the organic development of self-healing environments.

Watson AIOps raises the bar yet again, and takes AIOps to the next level.

Blogs

Why the Watson AIOps Event Manager is a game changer

By Zane Bray posted Fri April 09, 2021 09:21 AM