Large Netcool deployments around the world typically have multiple Aggregation ObjectServer pairs to hold the current event set, because a single Aggregation ObjectServer is not sufficient. The best practice upper limit for the maximum number of standing events in an Aggregation pair is 100,000, and large customers, particularly telcos, have many more events than this. This presents a problem for those larger customers, with more than one Aggregation ObjectServer pair (called "OMNIbus datasource") wanting to upgrade their environment to Watson for AIOps Event Manager to take advantage of the newer capabilities, such as topology, advanced event correlation, and Runbook Automation. This blog outlines a solution to overcome this problem.
In a nutshell, events are fed from their source datasources up into a COMBINED Aggregation ObjectServer pair, so that the converged event set can be enriched with Runbook data, analysed by event analytics, processed by topology, and grouped by the advanced event correlation engine - all at a single point. This enrichment data is then propagated back to the originating event, while any events created by Event Manager in the COMBINED datasource are propagated to a designated bottom-level datasource. The events are retained on the combined system for a relatively short time only, to ensure that the total number of resident events doesn't get too large. This is a key reason this solution is viable, since the total number of events still shouldn't ideally exceed 100,000 on this combined ObjectServer pair, to avoid performance issues.
For many years now, Netcool/WebGUI has supported cross-datasource rendering of parent-to-child relationships. This is where a parent event may reside in one datasource, and the children events reside in other datasources. So long as the `ParentIdentifier` is set correctly, and the Relationship is added to the current View, the Event Viewer will automatically render parent-to-child relationships correctly in the UI. Cross-datasource relationship rendering is enabled by editing the Relationship (eg. the default "IBM Cloud Analytics" View), and checking the box labelled: "Enable Multiple Datasource Aggregations". This solution leverages this capability, since the synthetic parent events generated by Event Manager will all reside in a designated datasource, and the underlying events could potentially reside on any of the datasources.
Since the groups of events are potentially spread out across the datasources, the parent event management functions and housekeeping is handled by Netcool/Impact, since it can reach into all of the datasources. The WebGUI server will include the plug-ins to enable it to talk with Event Manager and, since the enrichment of the child data is propagated back to the source events, actions like running Runbooks, and viewing temporal grouping details, will work in the same way as if it was all happening on a single datasource.
As mentioned above, Watson for AIOps Event Manager works by connecting to a single Aggregation ObjectServer pair. The Aggregation ObjectServer pair is either contained within the OpenShift deployment (so-called "Cloud" deployment) or resides outside of OpenShift (so-called "Hybrid" deployment). This solution is based on a hybrid deployment of Event Manager, and is the most common large scale deployment for Event Manager, at the time of writing.
Reference documentation link: Watson for AIOps Event Manager (a.k.a. Netcool Operations Insight) hybrid deployment architecture
NOTES AND CAVEATS
There are a few caveats to the solution, as follows:
- The Cloud Native Event Analytics (CNEA) engine has a maximum event processing rate which may or may not be able to keep up with the event rate of events coming from all the sources. You can throttle the events that you are feeding into the COMBINED system however by modifying the filter on the Gateways that feed it.
- The solution adds up to 32 fields to each of the existing lower-level Aggregation ObjectServer datasources. The maximum number of fields an ObjectServer can have in any given table is 128, including the events table (
alerts.status). Hence if your existing system has more than 96 fields in the source ObjectServer events table already, you will have to drop some unused custom fields first, before implementing this solution.
- The solution retains the events at the combined level for a specified amount of time only - ie. default of 1 hour. When these events are cleaned from the COMBINED system, the topology will clear any associated event status against any matching resources. You should therefore tune the retention period to be as long as possible (set the property
HKPurgeEventsThreshold in the
master.cea_properties table), so that the total number of standing events doesn't exceed 100,000, so that the topology resources show the event status for as long as possible.
- You will need to create a Netcool/Impact Datasource definition for each ObjectServer where events may originate, based on each one's `ServerName`, including each primary and each backup Aggregation ObjectServer. For example, you may adopt the suggested naming convention:
AGG_B_2, etc. You therefore need to create a separate Datasource for each individual primary and backup ObjectServer. This is important because the Netcool/Impact policies automatically look for each child event in a Datasource by the name of the value stored within the
ServerName field of each event. The combined Aggregation ObjectServer pair should be your "defaultobjectserver" Datasource.
The solution configuration package is available via an IBM Tech Note.