Watson for AIOps Event Manager comes with a built-in probable-cause analysis engine. It works on child events that have been grouped via topological correlation by analysing corresponding resource relationships within the topology as well as doing an analysis of the event itself. While probable-cause analysis in Event Manager is fully automated and requires no configuration, it is often desirable to be able to tune probable-cause analysis or add additional criteria.
A new feature in Event Manager (also known as Netcool Operations Insight version 1.6.6) is the ability to customise probable-cause analysis. It is packaged as two Netcool/Impact policies, each with its own respective Policy Activator Service, and is disabled by default. Users can simply start these Services running using the default settings, or tune it before use. Note that it does not replace the internal probable-cause analysis engine; rather it augments it, and runs in addition to the built-in engine.
The two Policies and Services can be found in Netcool/Impact under the project:
EventManagementAnalyticsPROBABLE CAUSE ANALYSISThe customisable probable-cause analysis engine carries out additional probable-cause analysis on all child events, and takes into account the following criteria:
- the Severity of the event
- if the event is the first one chronologically amongst its peers
- if Network Manager has marked it as a Root Cause event
- based on keywords in the Summary field - users can also add their own keywords to the list
- whether scope-based event grouping
CauseWeight
should be taken into account
Probable-cause analysis is deferred for each event until it is at least 60 seconds old (configurable). This is to allow the internal probable-cause analysis engine to process the event first, and also for Network Manager to do a Root Cause Analysis (RCA) of the event. This Impact-based probable-cause engine then augments the internal one by adding additional probable-cause "boost points" (to the
CEAEventScore
field) based on the criteria listed above. The more criteria boxes it ticks, the more booster points it receives, increasing its likelihood that it will bubble to the top and end up being the most likely probable-cause event in the group. Where two events end up with the same probable-cause score, Event Manager simply marks both events as potential probable-causes (ie. with blue coloured bars). In this case, the event whose details get propagated to the synthetic parent event is indeterminate. This is acceptable, since either one could legitimately be the probable-cause, and so choosing either one to headline the group is logical.
Probable-cause is represented in the Event View as a percentage bar, relative to the other events in the group. Hence it doesn't matter what the actual integer value of
CEAEventScore
is, only what it is relative to the other events in the group. As long as a consistent approach is taken to all events being processed, probable-cause analysis will be done correctly.
TUNABLE PROPERTIESThe customisable probable-cause analysis engine policy (called
ProbableCauseAnalysis
) has a list of user-modifiable parameters at the top. Here you can modify the amount of booster points that are added for each criteria, if the default values are not suited to your needs. For example, the default score boosters based on initial
Severity
are as follows:
A key configurable section is the keywords section, which is used to analyse the
Summary
field. For each keyword present in the list, the engine will add on the corresponding booster points for that keyword. It comes with a default list of keywords out-of-the-box, each one with a corresponding booster score. Words that relate to hardware failures are weighted higher than words that relate to performance slow-down, for example. All of these score boosters can be modified as the user sees fit. What is of great convenience with this analysis dimension, is that users can add their own keywords to the list, or remove ones that they don't need. The user then decides the importance of each keyword by specifying a relative point booster for each one:
The format of the table entries is:
{"keyword", case-sensitivity[0|1], score-boost}
For example, I might add the following three keywords to the bottom of the list, together with each one's case-sensitivity specification, and booster score:
{"CORE-ROUTER", 1, 50},
{"GigabitEthernet0/0", 0, 50},
{"crashed", 0, 60}
NOTE: the second parameter is whether the keyword is case-sensitive (0 = no, 1 = yes).
A summary of the user-modifiable parameters are as follows:
Parameter |
Description |
Default value |
DatasourceName |
This parameter refers to the Netcool/Impact name of the datasource this instance of the policy is to process events from. Under most circumstances, this parameter does not need to be changed. |
"defaultobjectserver" |
ProbableCauseField |
This parameter refers to the name of the field that the probable-cause score booster points will be added to. Under most circumstances, this parameter does not need to be changed. This policy can be used to carry out probable-cause analysis in on-premise deployments where CauseWeight is being used to store probable-cause weighting. In this case, simply change the parameter value to "CauseWeight". This parameter is used in conjunction with the next one: ParentAlertGroup. |
"CEAEventScore" |
ParentAlertGroup |
The parameter refers to the AlertGroup used by synthetic parent events. Under most circumstances, this parameter does not need to be changed. This policy can be used to carry out probable-cause analysis in on-premise deployments where parent events use the AlertGroup "ScopeIDParent" instead. This parameter is used in conjunction with the previous one: ProbableCauseField. |
"CEACorrelationKeyParent" |
ProcessingDelay |
This is the processing delay introduced to allow internal probable-cause analysis to happen first, as well as Network Manager RCA. Under most circumstances, 60 seconds should be sufficient, however may be extended in scenarios where CNEA or Network Manager event processing is taking longer to occur. |
60 |
BatchSize |
The probable-cause engine runs under a Policy Activator and by default processes up to 1000 events at a time, in the order they were received by the Event Manager. Under most circumstances, this parameter does not need to be changed. |
1000 |
Sev* |
The Severity score boosters (Sev1 to Sev6) boost each event's probable-cause score based on its Severity. Note that a Severity of 6 is present to cater to Fatal events that may be received from ITM. |
10 to 60 |
FirstEventBoost |
This is the score booster added to the first event in an event group. The chronologically first event in a set of events has a high probability of being a probable cause, hence the high weighting of this criterion. |
100 |
ITNMRootCauseBoost |
This is the score booster added if Network Manager has determined that this event is a root-cause event. Root cause events have a high probability of being a probable cause, hence the high weighting of this criterion. |
110 |
SummaryKeywords |
This is an array that contains a list of the keywords along with corresponding score boosters that would increase an event's likelihood of being a probable-cause event if any of the words are present in an event's Summary field. Note that the corresponding points for each keyword are added cumulatively, for each keyword in the list that is present in the Summary field. |
Various |
AddSBEGCauseWeight |
This parameter is either 0 (disabled) or 1 (enabled) and simply tells the automation to add on any values found in the CauseWeight field to the overall score. These values are significant where customers have invested in defining CauseWeight values in their environments. If not set, the default value of the CauseWeight field is zero (0), hence won't skew the probable-cause score in events where CauseWeight is not set. If this Policy is being run against an on-premise system, and ProbableCauseField is set to CauseWeight, then this property should be disabled, otherwise any CauseWeight values may inadvertently be doubled. |
1 |
SET NODE ON PROBABLE CAUSEThe default behaviour of the Event Manager grouping engine is to create synthetic parent events and automatically include in the
Summary
field value from the child event with the highest probable-cause score. By default, it sets the
Node
field of the parent event to the unique group identifier for the group.
The second of the two Impact policies included in this project (called
SetNodeOnProbableCause
) modifies this behaviour by also synchronising the
Node
field value from the underlying child event with the highest probable-cause up to the
Node
field in the synthetic parent event. Some users find this more useful than using the internal group identifier in the
Node
field in the synthetic parent event.
It does this by comparing the
Summary
of the parent event to that of the child events. Where it finds a match, it copies the
Node
field from the child event up to the parent event, if they're not aligned already. It is assuming that the
Summary
field has already been copied up to the parent event by the grouping engine, hence the child event with the matching
Summary
field will be the one to copy. Where more than one of the child events have the same
Summary
field value, it will copy the
Node
field from the child event with the highest probable-cause score, since this will be the one the grouping engine will be using to copy the
Summary
field value from.
The only two user-modifiable properties in this Policy are:
DatasourceName = "defaultobjectserver";
ProcessingDelay = 60;
The ability to specify the datasource name is where the policy is being used against an ObjectServer pair other than the default pair. The processing delay is introduced to allow time for the probable-cause analysis to also happen. Under most circumstances, neither of these parameters need to be changed.
PROBABLE CAUSE PROCESSED FLAG FIELDThe method probable cause analysis uses to mark events as processed is to set the
Grade
field to a non-zero value. If this field is being used for some other purpose, it will likely impact this probable-cause function. In this case, you can do a search-and-replace for the
Grade
field in the policy
ProbableCauseAnalysis
and replace
Grade
with one of the following alternatives (if unused in your environment):
Poll
PhysicalSlot
PhysicalPort
X733EventType
X733ProbableCause
If all of the above fields are already used, you can alternatively create a new, dedicated flag field via the
$OMNIHOME/bin/nco_sql
utility for this purpose - for example:
ALTER TABLE alerts.status ADD COLUMN ProbableCause INTEGER;
go
To add a new custom field in an OCP-based environment:
- Log into your OCP console
- Navigate to: Workloads > Secrets, search for
omni-secret
, copy OMNIBUS_ROOT_PASSWORD
to your clipboard
- Navigate to: Workloads > Pods, search for
ncoprimary-0
, select the pod, and choose the Terminal tab
- Run:
/opt/IBM/tivoli/netcool/omnibus/bin/nco_sql -server AGG_P -user root -password <OMNIBUS_ROOT_PASSWORD>
- Enter the above SQL into the command prompt to create the field
- Do a search-and-replace in the policy
ProbableCauseAnalysis
, substituting Grade
with your new field name and Save it
NOTE: You will need to repeat this process in the backup ObjectServer
AGG_B
located in the
ncobackup-0
pod, and add the new field to the bidirectional Aggregation Gateway mapping stored in the
objserv-agg-backup-config
ConfigMap so that this field gets replicated between the primary and backup Aggregation ObjectServers. See
this doc link for further information for how to replace the default bidirectional Aggregation
Gateway mapping with a new one.
SUMMARYThis new functionality allows a user to extend and customise the process of probable-cause analysis and generally tune its prioritisation logic. The provided Policies can also be used out-of-the-box without any modification to enable additional criteria to be considered in probable-cause analysis.
If these two functions are to be used long-term, it is recommended to edit the Services and check the box:
"Starts automatically when server starts". This will ensure these Services restart automatically, in the event of a server restart.