View Only

Watson for AIOps AI Manager customisable probable-cause analysis

By Jonathan Settle posted Mon November 27, 2023 04:56 AM

Like

Watson for AIOps AI Manager customisable probable-cause analysis

IBM AIOps has a powerful probable causes system. The concept is to rank the alerts in order of importance to resolve the incident as quick as possible. Allowing users to focus on the important alerts, with out having to search of key information in a sea of alerts. AIOps has multiple features that in combination allow clients to easily and deterministic , control the ranking their alerts. This blog will show you how to customize the probable causes, to allow you get the most out of it.

This is an image of top 3 alerts show in the incident of AIops incident viewer

The Topology tab of the incident is a really useful feature as it if the alerts have topological resource, then this will show the paths between the effect resource. Removing the need to know the relationship, but also remove clutter be not including relationship that are not needed

Let jump into how probable causes can be computed, There are 5 methods.

1) Topological walking and Natural language (NLP) alert classification

· When alerts are correlated the AI, walk the topological, to find the Short Path (in the soon to be release version this is changing to Lowest Cost) between the resources with alerts on them. With in the topology each edge type a type and a direction, this is given a weight. For example, if you had container “runs on” a server, and there were correlated alert on both, then the server is more likely to be the root cause than the container. i.e wight will be higher going toward the there server.

· After the path wight is calculated the , the context of the alert, is taken into account. This is done by classifying the alert into 1 of 7 categories, via doing a Natural language classification (NLP) on the summary field of the alert .

· The Alert classification and Path score are then combined to give an overall score of the alert. (i.e. their relative importance).

· To allow modelling of any environment, edge type weights and nlp classification can be changed and added to as required to meet the need of client environment. (see below for these details)

2) Word based

· Word based approached is very useful when the environment does not have topological data. But also work in partnership with the topological probable causes when the topological data available. Word based approach allows the users to define keywords, that are given a importance score. If a word is in the summary field, the alert is give that score (sum of all words). While this is not “AI” is an very highly effective method that gives guaranteeing operation control (know behaviour).

· As all customer environments are different, API exist to add and remove word. This will be walk thought in the post

3) Severity Boost

· With the severity boost the higher severity alert will get a larger score. While it can be augured that it a higher severity does not always mean it the root cause. Higher severity alert generally gives more context to the operator about the impact. So have them higher ranked with a large group is important, so operator can get context quickly.

4) 1^st to occurred

· Boost the score for the alert with the lowest first occurrence time.

5) External score set (via IBM Tivoli Network Monitoring (ITNM) , or existing system)

· For client with existing ITNM installs can integrate, the root causes that ITNM calculates.

· For customer with their own scoring mechanism, they can simple write a score value in the alert attribute to have add to the AIOps score, Just uses that score.

How to modify probable cause

This section will step by step take you though accessing the API and some example modification the different approach, there is full swagger api doc in the API section of the official docs.

Login into your OpenShift cluster, in this demo AIOps is installed in the aiops namespace.

ROUTE=$(oc get route cpd -n aiops --no-headers | awk '{print $2}')

PASS=$(oc get secret admin-user-details -o jsonpath='{.data.initial_admin_password}' -n aiops | base64 -d)

TOKEN=$(curl -s -k -X POST https://$ROUTE/icp4d-api/v1/authorize -H 'Content-Type: application/json' -d '{"username": "admin","password": "'`echo $PASS`'"}' | jq .token | sed 's/"//g')

Each part of the Probable causes system and be switch off and on depending the environment requirements

To get the configuration use /customisation/scoring/config

i.e.

curl -k -X GET --header 'Accept: application/json' -H "Authorization: Bearer ${TOKEN}" https://$ROUTE/aiops/api/issue-resolution/mime/v1/customisation/scoring/config -H "accept: application/json" -H "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" | jq > out.json

This will list what attribute of turned on and off.

To off the pathCaculation, alter the required value i.e setting to false and then send it back

curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer ${TOKEN}" https://$ROUTE/aiops/api/issue-resolution/mime/v1/customisation/scoring/config -H "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" -d @out.json

PathCaculationEnabled

This turn on or off the probable cause base on the topological data the text classification of the summary. The client 1^st must make topological resource group have correlatebale set. This allow alerts to be correlate by topological correlation.

When a problem occurs a path is calculated between all the alert in the group. This is currently the shortest number of hops. This create the path score as was seen above.

To get a list of all the edges (I am uses the jq command to make it readable)

curl -k -X GET --header 'Accept: application/json' -H "Authorization: Bearer ${TOKEN}" https://$ROUTE/aiops/api/issue-resolution/mime/v1/customisation/edge_type -H "accept: application/json" -H "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" | jq

Most edges have a different weight due to the directorial. I.e. Application that DependsOn a Host. This means an “IN” has a higher wight than “OUT”. Please note if the system encounter as unknown edge type, it will uses the average of wight in the edge label group. But uses should try and avoid this is possible.

To add and edge type or modify and edge type all you need to do is POST and request, in the below format.

curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer ${TOKEN}" https://$ROUTE/aiops/api/issue-resolution/mime/v1/customisation/edge_type -H "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" -d '{"edge_type_list":[{ "edge": "newEdgeName", "label": "association", "outdegree": 20,"indegree": 5}]}'

This will allow you to add your own edge type, allowing you to model your environment how ever you see fit.

After the path score is calculated, the summary is classified in to one of the 7 labels

This can be modified by /customisation/label_weight to get the list GET and POST to modify the list. Warning before removing a label from the list please remove all the training example for the classifier. The classifier looks at the word in summary and attempt to put the alert in the above classes. In the future I will write more about the classifier.

The WordEnabled

property in the configuration turn on the keyword searching in the summary of the alert. I.e if the summary contains a word specified in the list GET on /customisation/words . Then it will get give the SUM of the word that the alert contains.

curl -k -X GET --header 'Accept: application/json' -H "Authorization: Bearer ${TOKEN}" https://$ROUTE/aiops/api/issue-resolution/mime/v1/customisation/words -H "accept: application/json" -H "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" | jq

Where the list can be modified (add/remove) and then POST /customisation/words to update the system.

For example

{ “words”:[

{“word”:”error”,”caseSenstive”: false, “weight”:100},

{“word”:”ECC”,”caseSenstive”: true, “weight”:10}

]}

If an alert summary contained “error”,”ERROR”,”eRroR” then a score of 100 would be added to the total score of the alert (hight score is ranked 1^st). In the case of ECC it would have to match ECC.

Where both words are in the summary i.e. "error an ECC has be found" would have a score 110.

SeverityEnabled

Based on the severity of the alert the weight is added to the score.

Severity	Weight
1	10
2	20
3	30
4	40
5	50
6	60

Therefore if the word score was 110 (Error and ECC) and the severity of the alert was Critical which is 6, 60 would be added to the score. i.e. making it 170.

To get the list or modified the wights of each severity us the GET/POST on /customisation/severity

CauseScoreEnabled

If there a field in the configuration is set to true then if the details section of the alert has a property called “CauseWeight” the AIOps will pick that score up and add it to the value. The idea being that environment might already be using existing NOI scope based automation, and you wish to carry them. If the field is not there, there will be no effect. This field cannot be change.

Itnmproccessing and itnmcausewieght

ITNM already does processing, this allow AIM to pick up these value and reflect that value if in details section of the alert there is a propriety call “NmosCauseType” with the value of 1 then itnmcausewieght is added to the score.

FirstBoostWeight and firstBoost

if firstBoost is set to true, the amount specified in the firstBoostWeight will be added to the alert that had the lowest firstoccurrence time.

Management

Watson for AIOps AI Manager customisable probable-cause analysis

By Jonathan Settle posted Mon November 27, 2023 04:56 AM