View Only

Watson for AIOps Event Manager customisable probable-cause analysis

By Zane Bray posted Mon September 26, 2022 04:40 AM

Watson for AIOps Event Manager comes with a built-in probable-cause analysis engine. It works on child events that have been grouped via topological correlation by analysing corresponding resource relationships within the topology as well as doing an analysis of the event itself. While probable-cause analysis in Event Manager is fully automated and requires no configuration, it is often desirable to be able to tune probable-cause analysis or add additional criteria.

A new feature in Event Manager (also known as Netcool Operations Insight version 1.6.6) is the ability to customise probable-cause analysis. It is packaged as two Netcool/Impact policies, each with its own respective Policy Activator Service, and is disabled by default. Users can simply start these Services running using the default settings, or tune it before use. Note that it does not replace the internal probable-cause analysis engine; rather it augments it, and runs in addition to the built-in engine.

The two Policies and Services can be found in Netcool/Impact under the project: EventManagementAnalytics

The customisable probable-cause analysis engine carries out additional probable-cause analysis on all child events, and takes into account the following criteria:

  • the Severity of the event
  • if the event is the first one chronologically amongst its peers
  • if Network Manager has marked it as a Root Cause event
  • based on keywords in the Summary field - users can also add their own keywords to the list
  • whether scope-based event grouping CauseWeight should be taken into account

Probable-cause analysis is deferred for each event until it is at least 60 seconds old (configurable). This is to allow the internal probable-cause analysis engine to process the event first, and also for Network Manager to do a Root Cause Analysis (RCA) of the event. This Impact-based probable-cause engine then augments the internal one by adding additional probable-cause "boost points" (to the CEAEventScore field) based on the criteria listed above. The more criteria boxes it ticks, the more booster points it receives, increasing its likelihood that it will bubble to the top and end up being the most likely probable-cause event in the group. Where two events end up with the same probable-cause score, Event Manager simply marks both events as potential probable-causes (ie. with blue coloured bars). In this case, the event whose details get propagated to the synthetic parent event is indeterminate. This is acceptable, since either one could legitimately be the probable-cause, and so choosing either one to headline the group is logical.

Probable-cause is represented in the Event View as a percentage bar, relative to the other events in the group. Hence it doesn't matter what the actual integer value of CEAEventScore is, only what it is relative to the other events in the group. As long as a consistent approach is taken to all events being processed, probable-cause analysis will be done correctly.

The customisable probable-cause analysis engine policy (called ProbableCauseAnalysis) has a list of user-modifiable parameters at the top. Here you can modify the amount of booster points that are added for each criteria, if the default values are not suited to your needs. For example, the default score boosters based on initial Severity are as follows:

A key configurable section is the keywords section, which is used to analyse the Summary field. For each keyword present in the list, the engine will add on the corresponding booster points for that keyword. It comes with a default list of keywords out-of-the-box, each one with a corresponding booster score. Words that relate to hardware failures are weighted higher than words that relate to performance slow-down, for example. All of these score boosters can be modified as the user sees fit. What is of great convenience with this analysis dimension, is that users can add their own keywords to the list, or remove ones that they don't need. The user then decides the importance of each keyword by specifying a relative point booster for each one:

The format of the table entries is: {"keyword", case-sensitivity[0|1], score-boost}

For example, I might add the following three keywords to the bottom of the list, together with each one's case-sensitivity specification, and booster score:

{"CORE-ROUTER", 1, 50},
{"GigabitEthernet0/0", 0, 50},
{"crashed", 0, 60}

NOTE: the second parameter is whether the keyword is case-sensitive (0 = no, 1 = yes).

A summary of the user-modifiable parameters are as follows:

Parameter Description Default value
DatasourceName This parameter refers to the Netcool/Impact name of the datasource this instance of the policy is to process events from. Under most circumstances, this parameter does not need to be changed. "defaultobjectserver"
ProbableCauseField This parameter refers to the name of the field that the probable-cause score booster points will be added to. Under most circumstances, this parameter does not need to be changed. This policy can be used to carry out probable-cause analysis in on-premise deployments where CauseWeight is being used to store probable-cause weighting. In this case, simply change the parameter value to "CauseWeight". This parameter is used in conjunction with the next one: ParentAlertGroup. "CEAEventScore"
ParentAlertGroup The parameter refers to the AlertGroup used by synthetic parent events. Under most circumstances, this parameter does not need to be changed. This policy can be used to carry out probable-cause analysis in on-premise deployments where parent events use the AlertGroup "ScopeIDParent" instead. This parameter is used in conjunction with the previous one: ProbableCauseField. "CEACorrelationKeyParent"
ProcessingDelay This is the processing delay introduced to allow internal probable-cause analysis to happen first, as well as Network Manager RCA. Under most circumstances, 60 seconds should be sufficient, however may be extended in scenarios where CNEA or Network Manager event processing is taking longer to occur. 60
BatchSize The probable-cause engine runs under a Policy Activator and by default processes up to 1000 events at a time, in the order they were received by the Event Manager. Under most circumstances, this parameter does not need to be changed. 1000
Sev* The Severity score boosters (Sev1 to Sev6) boost each event's probable-cause score based on its Severity. Note that a Severity of 6 is present to cater to Fatal events that may be received from ITM. 10 to 60
FirstEventBoost This is the score booster added to the first event in an event group. The chronologically first event in a set of events has a high probability of being a probable cause, hence the high weighting of this criterion. 100
ITNMRootCauseBoost This is the score booster added if Network Manager has determined that this event is a root-cause event. Root cause events have a high probability of being a probable cause, hence the high weighting of this criterion. 110
SummaryKeywords This is an array that contains a list of the keywords along with corresponding score boosters that would increase an event's likelihood of being a probable-cause event if any of the words are present in an event's Summary field. Note that the corresponding points for each keyword are added cumulatively, for each keyword in the list that is present in the Summary field. Various
AddSBEGCauseWeight This parameter is either 0 (disabled) or 1 (enabled) and simply tells the automation to add on any values found in the CauseWeight field to the overall score. These values are significant where customers have invested in defining CauseWeight values in their environments. If not set, the default value of the CauseWeight field is zero (0), hence won't skew the probable-cause score in events where CauseWeight is not set. If this Policy is being run against an on-premise system, and ProbableCauseField is set to CauseWeight, then this property should be disabled, otherwise any CauseWeight values may inadvertently be doubled. 1

The default behaviour of the Event Manager grouping engine is to create synthetic parent events and automatically include in the Summary field value from the child event with the highest probable-cause score. By default, it sets the Node field of the parent event to the unique group identifier for the group.

The second of the two Impact policies included in this project (called SetNodeOnProbableCause) modifies this behaviour by also synchronising the Node field value from the underlying child event with the highest probable-cause up to the Node field in the synthetic parent event. Some users find this more useful than using the internal group identifier in the Node field in the synthetic parent event.

It does this by comparing the Summary of the parent event to that of the child events. Where it finds a match, it copies the Node field from the child event up to the parent event, if they're not aligned already. It is assuming that the Summary field has already been copied up to the parent event by the grouping engine, hence the child event with the matching Summary field will be the one to copy. Where more than one of the child events have the same Summary field value, it will copy the Node field from the child event with the highest probable-cause score, since this will be the one the grouping engine will be using to copy the Summary field value from.

The only two user-modifiable properties in this Policy are:

DatasourceName = "defaultobjectserver";
ProcessingDelay = 60;

The ability to specify the datasource name is where the policy is being used against an ObjectServer pair other than the default pair. The processing delay is introduced to allow time for the probable-cause analysis to also happen. Under most circumstances, neither of these parameters need to be changed.

The method probable cause analysis uses to mark events as processed is to set the Grade field to a non-zero value. If this field is being used for some other purpose, it will likely impact this probable-cause function. In this case, you can do a search-and-replace for the Grade field in the policy ProbableCauseAnalysis and replace Grade with one of the following alternatives (if unused in your environment):

  • Poll
  • PhysicalSlot
  • PhysicalPort
  • X733EventType
  • X733ProbableCause

If all of the above fields are already used, you can alternatively create a new, dedicated flag field via the $OMNIHOME/bin/nco_sql utility for this purpose - for example:

ALTER TABLE alerts.status ADD COLUMN ProbableCause INTEGER;

To add a new custom field in an OCP-based environment:

  • Log into your OCP console
  • Navigate to: Workloads > Secrets, search for omni-secret, copy OMNIBUS_ROOT_PASSWORD to your clipboard
  • Navigate to: Workloads > Pods, search for ncoprimary-0, select the pod, and choose the Terminal tab
  • Run: /opt/IBM/tivoli/netcool/omnibus/bin/nco_sql -server AGG_P -user root -password <OMNIBUS_ROOT_PASSWORD>
  • Enter the above SQL into the command prompt to create the field
  • Do a search-and-replace in the policy ProbableCauseAnalysis, substituting Grade with your new field name and Save it

You will need to repeat this process in the backup ObjectServer AGG_B located in the ncobackup-0 pod, and add the new field to the bidirectional Aggregation Gateway mapping stored in the objserv-agg-backup-config ConfigMap so that this field gets replicated between the primary and backup Aggregation ObjectServers. See this doc link for further information for how to replace the default bidirectional Aggregation Gateway mapping with a new one.

This new functionality allows a user to extend and customise the process of probable-cause analysis and generally tune its prioritisation logic. The provided Policies can also be used out-of-the-box without any modification to enable additional criteria to be considered in probable-cause analysis.

If these two functions are to be used long-term, it is recommended to edit the Services and check the box: "Starts automatically when server starts". This will ensure these Services restart automatically, in the event of a server restart.
1 comment



Mon September 26, 2022 05:03 AM

Great write up, let's make a video of this too.