AIOps

 View Only

How to monitor your AIOps system with AIOps!

By PATRICK O'NEILL posted Mon April 08, 2024 01:51 PM

  

Cloud Pak for AIOps can ingest events from many sources and carry out analytics on them for you to reduce the noise and focus on the events that matter most in your network.

However, what if AIOps itself, or the infrastructure that it is running on, has a looming problem that could result in losing your event management system! For example, one or more PVC's are filling up.

This blog details some best practices and easy to configure integrations that can be used to implement additional self-monitoring.

Since the Cloud Pak for AIOps runs on Red Hat Openshift we can utilise the inbuilt Alert Manager feature of Openshift to do the monitoring and configure it to send alerts to AIOps when it sees a problem with any of the underlying infrastructure. Example: Pod restarts, low storage, latency issues etc.

NOTE: Alert manager will usually send you initial warning alerts and then escalate those to Critical as the issue becomes more serious. For example, when storage passes a threshold you will initially get a warning alert but then as it goes to 80%, 90% the alert will become Critical.

The high level steps to configure it are as follows:

  1. Create a Generic Webhook integration.
  2. Copy the webhook URL to the Openshift Alert Manager user interface.
  3. Configure Alert Manager to send the alerts you are interested in.

Create a Generic Webhook integration

  • From the AIOps Main menu select Define-->Integrations-->Add Integration.
  • Search for "Generic Webhook" and click that tile to setup a new webhook ingestion endpoint.
  • Give the integration a name and select "None" as the "Authentication type"and click "Next"
  • Click "Load sample mapping" and select "Prometheus Alert Manager" to configure the integration with the correct schema mapping from Prometheus to AIOps alerts.
  • Click Done

Copy the webhook URL to the Openshift Alert Manager user interface.

From the new Generic Webhook integration you created, copy the full webhook URL and switch to the Openshift console. Example webhook

https://whconn-6492ea38-cf72-4f3d-b7d3-9fbbc2b1065c-aiops.apps.example.com/webhook-connector/ezca4fgo2tj

On the Openshift console navigate to Administration-->Cluster Settings-->Configuration-->Alertmanager

Edit the "Receivers" section for "Critical","Default"etc.

Select "Webhook" for "Receiver Type" and paste the full URL into the "URL" field

The "Routing labels" section allows you to further filter what alerts get sent to AIOps. The above example sends all alerts via the Value of ".*"

Example: When an Openshift alert is generated it will have "Labels" attached to it like below

In the "Routing labels" section of the Webhook receiver you can use "alertname" and a value of ".*" to allow all alerts through or you can filter by using, for instance, alertname with a value of "SystemMemoryExceedsReservation" or any other regular expression that gives you the set of alerts that you want to manage in AIOps.

Save the Alertmanager configuration and shortly after that you should see any existing alerts that are active appear in your AIOps alert list. Example output below

Optionally, you can also create an AIOps policy to scope group these Prometheus based alerts if you would like those to be grouped together in one AIOps incident as shown above, "GROUP(17 alerts)...". All the other AIOps capability can also be utilised (ChatOps, Seasonality etc) as these alerts are now just alerts like any other alerts in the system and will be processed accordingly.

A simple example policy, like the following, would provide a scope based grouping:

IMPORTANT: Openshift Alert manager requires a trusted webhook endpoint. If your AIOps deployment is not using certs that are trusted by Openshift itself it will not send the alert to AIOps. You need to use trusted certs or add the AIOps CA signing cert to the Openshift CA bundle.


#automation-featured-area-2
1 comment
80 views

Permalink

Comments

24 days ago

FYI you can configure the Prometheus Alert Manager to accept a self-signed cert by setting the Webhook with insecure_skip_verify: true such as below in your alert manager config YAML. To find this config, login to OCP, on the left navigation panel, go to Administration > Cluster Settings > Configuration tab > Alertmanager ) and update the webhook receiver that you have added in the previous step. Note that the YAML below is just a sample and the values doesn't match the config posted in this blog, please update them according to your environment:
receivers: - name: aiops-webhook-connector webhook_configs: - url: >- https://whconn-71f7306d-64e9-4b3c-8f3d-75a422e88fa3-aiops.apps.mire.cp.fyre.ibm.com/webhook-connector/k69kg3wgc3f http_config: basic_auth: username: test password: test tls_config: insecure_skip_verify: true