The event grouping capability of the Netcool Event Manager in Watson AIOps is the zenith of event correlation and adds enormous business value to any operations management environment. Many organisations around the world leverage the event grouping capabilities of Netcool to dramatically reduce ticket numbers, thereby achieving significant operational cost reduction, and reduce Mean Time To Repair (MTTR).
The clearest application of event grouping to cost-reduction is in the area of ticketing; rather than raise multiple tickets on a per-event basis, instead raise a single ticket for a group of events. This means all the information relating to a single incident is kept together, and consolidated into a single ticket. This makes triage and resolution easier. The goal is: one incident equals one ticket.
Three main types of event grouping
There are three main types of event grouping offered in Netcool: scope-based grouping, analytics-based grouping (also known as temporal grouping), and most recently, topology-based event grouping. These event grouping capabilities work together collaboratively to effectively group events that relate to the same problem. Events can be members of more than one group and, where multiple groups share common events, the groups are automatically combined to form so-called "super groups".
Scope-based grouping leverages local and domain knowledge and centres around defining what scope means in your environment, and then grouping events based on a combination of that scope in conjunction with a defined time window. Analytics-based grouping looks at the event history to determine which events historically occur together, and then leverages these insights to group the events together if they occur in future. Topology-based event grouping lets the user define collections of resources in the topology - either specific items or patterns - and groups events that occur within those groups of resources within the same time window.
This blog focuses on scope-based event grouping, how to define scope, and how to tune and customise the resulting groups.
Scope-based event grouping
Scope-based event grouping works based on the principle that if I get a collection of events from the same place at the same time, then I should group those events together. In this context, “same place” would be my defined scope, and “same time” would be a time window that defines how long I receive events for any given problem in my managed environment. These two elements combine together in the scope-based grouping mechanism as a basis for doing the grouping.
What does scope mean?
The goal of event grouping is to correlate and group events together that relate to the same problem. Hence the scope of any problem is like a boundary that encircles all the events that relate to the problem. This might be a geographic boundary, or it might be a logical one, or it might be based on a combination of elements, such as Node and Service. Once there is an understanding of what is meant by scope, defining scope in various scenarios quickly becomes clear.
Geographic boundary example
Widgetcom is a wireless provider and has cell sites dotted all over the country. If there is an issue at a cell site, for example, a power failure, they might see events from multiple different systems at that site. For example, some events may relate to the building management systems like generators and air conditioning systems, and some might be from equipment housed in the building, such as telco switches. If there is such an issue, they will typically see a burst of events from the various different systems over a 10 minute period. In this scenario, the “scope” of the incident is the physical cell site location, hence using the geographic location as the scope for grouping makes sense, along with a time window setting of 10 minutes.
Logical grouping example
Businesscorp is a large-scale enterprise and their Watson AIOps solution supports many lines of business within the organisation. Each business unit owns a set of applications that run on a number of dedicated physical and virtual machines. Both the applications and the servers are heavily instrumented in terms of monitoring and tend to generate a lot of events. If there is any sort of issue on one of the machines, this usually manifests itself as alerts coming from either the applications, or the servers, or both, and these alerts tend to flow into Event Manager in bursts. Typically all events for a given issue will come in within a five-minute window. In this scenario, the “scope” of the problem would be the line of business, hence using the line-of-business ID as the scope for grouping makes sense, along with a time window setting of 5 minutes.
How does the time window work?
Scope-based grouping offers two time window options: a fixed time window, or a dynamic one. When the first event for a given scope occurs, the clock starts ticking. If the time window is fixed, then the group will be closed after the defined number of seconds has elapsed, regardless of whether more events are continuously streaming in for that group. Once a group is deemed closed, no further events will be added to it. If more events are received by Netcool after the group has closed, a new group will be created. If, however, the time window is dynamic, the expiry time for the group is extended each time a new event is received for the scope, effectively keeping the group open. In most cases, a dynamic time window is preferred over a fixed one, as it makes the system more flexible in its receipt of events.
When defining a scope in any scenario, it is important to keep in-mind the significance of the time window. In the geographical boundary example above, not all alarms that ever come from the same cell site all relate to the same problem, of course, but all the alarms from the same cell site in the same time window probably are. Hence the scope does not necessarily have to be as granular as one might think. Sometimes a bigger “net” with a smaller time window can work best. Depending on the scenario, a little bit of testing will quickly help you arrive at a suitable value for both the right scope and time window to use.
Which fields do I set?
Scope-based event grouping is most easily set up using the portlet in the WebGUI (see below). If you want to set it up in your Probe rules however you can set the following fields as follows.
The field where the scope is defined is called: ScopeID VARCHAR(255)
. Event grouping will occur automatically if the field ScopeID
is set to a non-null value. If ScopeID
is not set for an event, then the grouping automation will not act on that event, and that event will remain un-grouped by scope-based event grouping. Note that it still may be grouped by temporal or topology-based grouping (or both).
The field where the time window is defined is called: QuietPeriod INTEGER
. The time window is called QuietPeriod
because it primarily defines the time window in terms of the amount of seconds that pass, after which no further events are received; that is, the period after which “it all goes quiet”. This is the dynamic time window scenario. Note that when the fixed time window is used, the same field is used, and the expiry time for the group is also defined as: group creation time + QuietPeriod
. Unlike the dynamic time window, the expiry time for the group is never extended for a fixed time window, and represents a hard cut-off.
How does it work?
When an event is received into Event Manager, if theScopeID
field is set, the grouping engine will note its arrival and register the creation of a new group forming for that scope. If the theScopeID
field is not set, it may subsequently be set via one of the scope-based event grouping policies defined via the UI, discussed later.
In Event Manager, the scope-based event grouping works in conjunction with the temporal and topology-based event grouping capabilities. The grouping engine will only create a synthetic parent event when any grouping has more than one event. This ensures the system will not create groups of one. Note that this is different to previous versions of Netcool Operations Insight scope-based event grouping where groups were created immediately on insertion. When a synthetic parent event is created, any child events are linked to this parent via the ParentIdentifier
field.
The expiry time for the group will be set as: group creation time + the QuietPeriod
of the incoming event. If the QuietPeriod
is not set in the incoming event (ie. is zero), then the grouping automation will use the global default property instead: CEAQuietPeriod
from the master.cea_properties
table. If the ScopeID
value starts with the string: “FX:”, then a fixed time window is used. In any other case, a dynamic time window is assumed.
How can I try it out?
You can activate scope-based grouping by simply setting theScopeID
field in your Probe rules file or via a pre-insert ObjectServer trigger to a non-null value. Alternatively, you can set the scope via the scope-based grouping portlet in the UI. The Scope Based Grouping portlet can be accessed via the Insights menu in the WebGUI:
Once you have opened the portlet, you can create policies that define scope for different sets of events:
OLDER VERSIONS
If you have an older version of Netcool/OMNIbus or Netcool Operations Insight and not have this portlet in your WebGUI setup, you can follow these steps:
- Install scope-based event grouping from the OMNIbus extensions directory (
$OMNIHOME/extensions/eventgrouping)
- Open the Netcool Administrator tool and create a new insert database trigger on the
alerts.status
table
- Add an
if-elseif
construct in the trigger to set ScopeID
, depending on the event type
NOTE: Installation instructions on how to install scope-based event grouping can be found in the IBM Knowledge Center here (opens in new window).
NOTE: Even if you have the portlet, you may still need to install scope-based event grouping from the OMNIbus extensions directory, if you have not already done so.
The following is an example ObjectServer database trigger that sets up ScopeID
for incoming events - not required if using the UI above:
CREATE OR REPLACE TRIGGER widgetcom_set_scopeid
GROUP widgetcom_triggers
PRIORITY 1
COMMENT 'Sets ScopeID on incoming events'
BEFORE INSERT ON alerts.status
FOR EACH ROW
WHEN get_prop_value('ActingPrimary') %= 'TRUE' and new.ScopeID = ''
begin
-- SET ScopeID BASED ON Location FOR Class 100 EVENTS
if (new.Class = 100) then
set new.ScopeID = new.Location;
-- SET ScopeID BASED ON AlertGroup FOR Class 200 EVENTS
-- ALSO SHORTEN GLOBAL QuietPeriod TO 5 MINUTES
elseif (new.Class = 200) then
set new.ScopeID = new.AlertGroup;
set new.QuietPeriod = 300;
-- ELSE SET ScopeID BASED ON Node
else
set new.ScopeID = new.Node;
end if;
end;
go
NOTE: The above code can be placed into a file and ingested into your Netcool/OMNIbus ObjectServer via the nco_sql
command:
$OMNIHOME/bin/nco_sql -server AGG_P -user root -password netcool < set_scopeid.sql
Where should I set my ScopeID?
The ScopeID field can be set anywhere - but the most common places are:
- The Scope Event Grouping portlet - found in the Insights menu in WebGUI (preferred)
- ObjectServer trigger (as per the above example) - if the
ScopeID
value is contained within the incoming event data
- Probe rules file - again, if the
ScopeID
value is contained within the incoming event data
- Netcool/Impact policy - if the
ScopeID
needs to be looked up in an external system, like a CMDB
Note: if the ScopeID
is set after insertion into Event Manager, it is essential that events are processed in the order they were received - eg. order by Serial ascending.
Identifying and leveraging priority child event information
Event Manager event grouping includes a capability to identify the priority child event in a grouping, and then propagate elements of that child event up to the parent event. There are four built-in options for selecting the priority child event:
- Choose the event with the highest
CauseWeight
- Choose the event with the highest
ImpactWeight
- Choose the event with the first
FirstOccurrence
(ie. the first event entering the group)
- Choose the event with the last
LastOccurrence
(ie. the last event entering the group)
NOTES:
- The priority option is global and only one can be in-use at a time.
- The details of the highest priority event are remembered, even if that child event clears and is deleted. The stored details are only replaced if a child event enters the group that has a higher priority.
- If
CauseWeight
and ImpactWeight
are set in any child events, the highest value in each case will automatically propagate to the respective fields in the parent event.
- Using the highest
CauseWeight
to identify the priority child event option tends to be the most popular choice, amongst Netcool practitioners worldwide. It requires the additional step of defining cause weights for the incoming events, in order to be of any benefit.
- More information on event weighting and standard templates can be found in the IBM Knowledge Center here (opens new window).
CUSTOMTEXT FIELD
One of the fields each event in Event Manager has is the CustomText
field. For each “real” event, this field should be populated with any data that needs to be propagated up to the parent event, if this event were to be identified as the priority child event. For example, CustomText
contents might include the concatenation of certain key fields or custom fields from the child event.
For example, I might augment my insert trigger further to set the CauseWeight
and CustomText
, based on a couple of custom fields:
-- SET ScopeID BASED ON Location FOR Class 100 EVENTS
if (new.Class = 100) then
set new.ScopeID = new.Location;
set new.CauseWeight = 1000;
set new.CustomText = new.WidgetCharField + ':' + to_char(new.WidgetIntField);
...
Once the priority child event has been identified based on the one of the four criteria, the CustomText
field from the priority child is automatically copied to the CustomText
field of the parent event. The CustomText
field in the parent event can then be included in the parent event’s Summary
field (enable the property SEGUseScopeIDCustomText
– see notes below), or sent over to a ticketing system to provide additional detail around the probable cause of an incident.
Customising scope-based event grouping
In OCP-based deployments of Netcool Operations Insight or Event Manager, scope-based event grouping is customised by modifying properties in the master.cea_properties
table. Each record in the master.cea_properties
table has the fields: CharValue (VARCHAR(255)
and: IntValue INTEGER
. Only one of the two fields will be used in each case, depending on what the property is for. For example, the CEAQuietPeriod
property or a boolean type property will make use of the IntValue
value, whereas a property specifying a prefix label for the synthetic event’s Summary
field will make use of the CharValue
value.
The various properties come pre-set with out-of-the-box values and are documented in the IBM Knowledge Center here (opens new window).
PRIORITY CHILD EVENT SELECTION
The following properties relate to what defines the priority group child event:
CEAPropagateTextToScopeIDParentCause
: set IntValue
to 1 (default is 0) to specify that the priority child is based on highest CauseWeight
value
CEAPropagateTextToScopeIDParentImpact
: set IntValue
to 1 (default is 0) to specify that the priority child is based on highest ImpactWeight
value
CEAPropagateTextToScopeIDParentFirst
: set IntValue
to 1 (default is 0) to specify that the priority child is based on first FirstOccurrence
value
CEAPropagateTextToScopeIDParentLast
: set IntValue
to 1 (default is 0) to specify that the priority child is based on last LastOccurrence
value
NOTE: If more than one of the above options are selected, the grouping automation will default to the above order of precedence, the CauseWeight
being the highest precedence.
If any of these options are selected, the CustomText
field of the priority child event will be automatically propagated to the CustomText
field of the parent event. The CustomText
field value of the parent will not change unless a new child event with a higher priority value enters the group. Additionally, even if the child event with the highest priority subsequently clears and is deleted from Netcool/OMNIbus, the CustomText
and priority information about that child event will still be retained in the parent event, and only updated if a new child event with higher priority subsequently enters the group.
ACTIVATE JOURNALING OF CHILD EVENTS
A very useful capability of scope-based event grouping is the automatic journaling of child event details to the journal of the parent event. This feature provides a mechanism to capture the forensic history of the events that have passed through the group, which is particularly valuable if the underlying events are transient or are flapping. This capability is very useful for viewing a forensic listing of the child events both from the Event Viewer in Web GUI, as well as from the ticket work log. Automatic journal propagation from a ticketed event is an out-of-the-box feature of Netcool Gateways.
NOTE: The automatic journaling of child events is disabled by default since it will create journals in the ObjectServer, and thus induce an element of loading into the system. It is up to the customer therefore to enable this feature and do due diligence load testing on a non-production system, prior to use.
The following property relates to the creation of journals in the parent events that contain details about the child events:
CEAJournalToScopeIDParent
: set IntValue
to 1 (default is 0) to activate journalling of child events to parent event
CEAJournalServerNameServerSerial
: set IntValue
to 1 (default is 1) to include each child event’s ServerName
and ServerSerial
fields in the journal detail
CEAJournalNode
: set IntValue
to 1 (default is 1) to include each child event’s Node
field in the journal detail
CEAJournalSummary
: set IntValue
to 1 (default is 1) to include each child event’s Summary
field in the journal detail
CEAJournalAlertKey
: set IntValue
to 1 (default is 1) to include each child event’s AlertKey
field in the journal detail
CEAJournalCustomText
: set IntValue
to 1 (default is 1) to include each child event’s CustomText
field in the journal detail
Can I do sub-grouping?
Scope-based event grouping occurs for those events where a non-null value is set in the ScopeID
field. It is possible to cause child events to be sub-grouped under the ScopeID sub-group by setting a non-null value in the SiteName
field.
NOTE: The sub-grouping field is called “SiteName” due to legacy reasons. For practicality, it is better to think of it as “SubGroup” instead.
If sub-grouping is required, the SiteName
field must be set either at the same time or before ScopeID
is set. Because grouping occurs the moment the ScopeID
field is set (as a database trigger), SiteName
will only be taken into account at that time the grouping is done. If it is not set at the time the grouping automation fires, it will assume sub-grouping is not required for this child event. Note that a ScopeID
group may contain both direct child events as well as sub-groups, hence some child events in a ScopeID
grouping might have SiteName
set and be under a sub-grouping, and some might not and be direct children events of the ScopeID
parent event.
There are a number of properties in the master.cea_properties
table that relate to handling sub-groups. They follow along the same lines as the properties described above and are documented along with them in the same place on IBM Knowledge Center here (opens new window).
Delay ticketing for event groups
Customers typically want to leverage the priority child propagation feature of scope-based event grouping, so that the parent event reflects elements of the highest priority child event. Often however, it can take some time for the priority child event to arrive into Netcool. Many customers therefore delay the ticket creation off the parent event to allow time for the priority event to arrive, so that they can set up the parent event suitably, prior to ticketing.
If this approach is taken, the ticketing integration can be configured to only act on parent events that are of at least a certain age. Since the parent event FirstOccurrence
and LastOccurrence
fields always reflect the first and last occurrences respectively of the underlying children instead of the first and last occurrence of the parent event itself, it is not ideal to use either of these fields in the ticketing filter, in order to delay ticketing of the parent event. A more suitable field to use for this purpose instead is: InternalLast
. Since the parent event is only ever inserted once and never deduplicated, the InternalLast
field reflects the true first and last occurrence of the parent event.
An example ticketing filter that only tickets parent events that are older than 5 minutes therefore is:
AlertGroup = 'CEACorrelationKeyParent' and InternalLast < getdate - 300
NOTE: Synthetic parent events in Event Manager can be identified by: AlertGroup = 'CEACorrelationKeyParent'
.
Other notes
ON-PREMISE VERSUS OCP-BASED SYSTEMS
If your system is not OCP-based, that is, it is not using the new Cloud Native Event Analytics engine to do event grouping, then the properties table mentioned above will be called master.properties
instead of master.cea_properties
. Similarly, the properties mentioned above will be prefixed by SEG
instead of CEA
.
UPDATE YOUR REPORTER HISTORIC EVENT ARCHIVE SCHEMA
If your Netcool system has been in-place for a long time, it is possible that the new fields added by scope-based event grouping and IBM Netcool Operations Insight capabilities are not present in your Netcool REPORTER historic event archive database. If this is the case, they should be added to both the database schema as well as the Gateway mapping. This will be important later when the Event Analytics capabilities are applied to the groupings done by scope-based event grouping. Use the following steps to add these columns to your REPORTER database and Gateway mapping:
Summary
On the face of it, leveraging scope-based event grouping may seem to some like a lot of work to implement it. In practice, the reality is the opposite, and the potential returns are high. Ultimately scope-based event grouping is enabled simply by setting ScopeID
and can be done most easily by creating one or more policies in the scope-based event grouping portlet. Many customers around the world who, once they understand how the grouping mechanism works, and what scope means, very quickly have preliminary and meaningful grouping working in under an hour. The rest of the configuration, event weighting, priority child identification, etc. are all optional tuning and customisation tasks. The many properties and tuning options exist simply to allow customers to tune the resulting grouping and the look and feel of the resulting parent event. Each property and option in fact exists because of specific feature requests from customers.
And the relatively small amount of work to set up scope-based grouping is worth it. Most customers see upwards of a 70% row reduction in their Event Viewers, and enjoy the tremendous benefits of having events organised by problem - bringing order to the chaos. From a financial standpoint, the savings are clear. One large North American communications provider recently reduced their ticket numbers by 75%, by leveraging scope-based event grouping. This was achieved by applying scope-based event grouping, and then only auto-ticketing off groups, rather that off individual events. This allowed them to have one ticket per incident, and have all of the related event detail recorded in that same ticket, courtesy of the journalling feature. Another wireless telco customer cited a saving of just under 1M USD per annum due to savings in ticket creation, and over 3M USD due to reduced MTTR, by leveraging scope-based event grouping. The case for its use and application therefore is clear.
#AIOps
#cp4waiops
#incident
#Netcool
#Operations
#watsonAIOps