Storage Fusion

 View Only

Viewing and interpreting events in IBM Spectrum Fusion

By Byron Williams posted Tue October 26, 2021 11:26 AM

  

Viewing and interpreting events in IBM Spectrum Fusion

Introduction

The Event Manager is responsible for collecting and processing all alerts generated by IBM Spectrum Fusion components. These alerts are initially converted into Kubernetes events and placed on the event queue, so they can be accessed like normal events. However, additional information is added to the events so that other components in IBM Spectrum Fusion can provide additional processing, such as opening Call Home tickets for designated critical events, and visualization through the Event Manager UI page, where the events can be filtered, searched, and downloaded.

Basic Function

Event Manager has 3 basic functions

  1. Receive Alerts - Any component can POST an alert in AlertManager v4 format to the following url https://eventmanager-ibm-spectrum-fusion-ns.apps.cps-r81-9-46-123-89.rtp.raleigh.ibm.com/api/v1/eventmanager/alerts
  2. Convert the information in the alert into the labels and annotations of an OCP event, and place that event onto the OCP event queue. In general, alert and event are synonymous, but for this document Event Manager receives alerts and converts them into events.
  3. If the ISF severity is CRITICAL, and the alertname of the alert is listed in the isf-serviceability-operator-allow-tickets config map, send information to the Call Home Client to open a ticket and automatically upload a default set of files.

Alert Sources

Event Manager will receive alerts from (potentially) 4 different sources

  1. ISF Components, like the Compute or Networking operators
  2. Spectrum Scale
  3. ISF hardware SNMPv3 traps
  4. Prometheus (has not been tested yet)

Event Severities

Event have 3 severities

  1. INFO - just for information; the user does not need to take any action; will typically live 3-4 hours
  2. WARNING - a condition has been identified that should be examined by the user within the next few days; will typically live 7 days
  3. CRITICAL - a condition has been identified that should be examined by the user immediately; if the name of this alert is in the isf-serviceability-operator-allow-tickets configmap, Event Manager will also open a Call Home ticket (if enabled) and automatically upload logs (if enabled); will typically live 14 days

Event Fixed Status

Each ISF event in the OCP event queue has a field in its annotations named isf_fixed. It can be set to true or false. INFO events are created with isf_fixed=true, while WARNING and CRITICAL events are set to isf_fixed=false.

The intent is to give the user a way to mark that a particular event has been investigated and does not need any more attention.

Typically, the user will see a new WARNING event on the Events page, that specifies a condition (ie disk XXX is > 80% full) and is created with isf_fixed=false. The user can then follow-up on the condition, and once it has been dealt with the user can go back to the UI and set the fixed status to true to indicate the condition does not need anymore attention. (Note: For the initial release, only CRITICAL events can be changed, and only from fixed=false to fixed=true.)

Alert Processing

Event Manager checks all received alerts for duplication, based on the labels and fixed status.

  • Determines the labels of the incoming alert
  • Search for an existing event with the same set of labels
  • Compare the fixed status of the incoming alert and the found event.

If an event is found that matches the set of labels of the incoming alert, but the event has fixed=true and the alert has fixed=false, a new event will be created. Otherwise, the new alert will be considered a duplicate of the found event.

There are 3 sets of fields in the events to help keep track of duplication

  1. isf_first_seen - the date and time the event was received
  2. isf_last_seen - the date and time the most recent duplicate was received
  3. isf_times_seen - the number of times a duplicate has been found for this event

Event Maintenance

To keep events around longer than the lifetime of an OCP event, Event Manager will update a field in each WARNING and CRITICAL event every hour. This prevents OCP from deleting the event.

Command Line Tools

To see a list of the current ISF events from the OCP command line, type

oc get event --field-selector reason=ISFEventManager LAST SEEN TYPE REASON OBJECT MESSAGE 168m Warning ISFEventManager deployment/eventmanager BMYLC1000-Test Event for logcollector - 968 151m Warning ISFEventManager deployment/eventmanager BMYLC1000-Test Event for logcollector, longer collection - 225


Note that OCP events have a TYPE field, which can be either Normal or Warning. All ISF events with severity=INFO have Type=Normal. ISF events with severity=WARNING or severity=CRITICAL have Type=Warning.

Example Event

apiVersion: v1 count: 1 eventTime: null firstTimestamp: "2021-09-10T16:36:03Z" involvedObject: apiVersion: apps/v1 kind: Deployment name: eventmanager namespace: ibm-spectrum-fusion-ns resourceVersion: "88029876" uid: 24877f13-3ed5-4062-98cd-fb233385050b kind: Event lastTimestamp: "2021-09-10T16:36:03Z" message: BMYLC1000-Test Event for logcollector - 968 metadata: annotations: cause: A situation like a SGPanic or a quorum loss could initiate the unmount container_restart: "false" container_unready: "false" description: A filesystem was forced to unmount by SpectrumScale ftdc_scope: "" identifier: gpfs02 internalComponent: fsmount is_resolvable: "false" isf_first_seen: "2021-09-09T16:24:36Z" isf_fixed: "false" isf_last_seen: "2021-09-09T16:24:36Z" isf_ticket_id: TS003759703 isf_ticket_id_confirmed: "true" isf_ticket_id_logsuploaded: "true" isf_ticket_requested_at: "2021-09-09T16:24:36Z" isf_times_seen: "1" logk8s00: route.openshift.io:v1:routes:ibm-spectrum-fusion-ns loglel00: isf-collection-sets:must-gather message: BMYLC1000-Test Event for logcollector priority: "" remedy: "" requireUnique: "true" scope: NODE time: "2021-04-14T10:19:12-07:00" tzone: PDT user_action: 'Check error messages and the error log for further details. Also see the topic File system forced unmount in the IBM Spectrum Scale documentation: Troubleshooting. File system issues' creationTimestamp: "2021-09-10T16:36:03Z" labels: alertname: BMYLC1000 component: filesystem controllerinstance: logcollector-6446f68445-gs775 controllername: logcollector entity_name: gpfs02 entity_type: FILESYSTEM isf_node: master-7b50.cps-r81-9-46-123-89.rtp.raleigh.ibm.com isf_uid: 41b8dae3-0233-4733-9163-e5f27b551fe6 node: "12" severity: CRITICAL

Event UI

On the Spectrum Fusion home page, there is a list of the most recent events in the right-most panel:

Fusion Home Page


Clicking the arrow in the upper-right corner will bring up the Events page.

At the top of the page is a bar chart showing the distribution of the current events:

Bar Chart of Existing Events


The list of events can be filtered by

  • severity:
    Filter by Severity
  • category
    Filter by Category
  • and/or contents of the description field.
    Filter by Description Text

For example, entering BMYNW into the search field will show all the events whose descriptions include that string:

CRITICAL events that have opened a ticket can be manually marked as Fixed from the menu at the end of each line. This allows the user to keep track of which events have been dealt with and which still need attention.

Fixed Status


After an event has been marked as fixed its icon changes to a green checkmark.

Fixed Icon Changes


The list can also be sorted by Timestamp to examine the newest or oldest events still in the list.

Sort by Timestamp


The current contents of all events on a page can be downloaded as json using the download button.

Download Events
{ "events": [ { "metadata": { "name": "eventmanager.16a8b59eb206ee07", "namespace": "ibm-spectrum-fusion-ns", "selfLink": "/api/v1/namespaces/ibm-spectrum-fusion-ns/events/eventmanager.16a8b59eb206ee07", "uid": "aff71675-ad53-4bea-95eb-0bbd1b8ec911", "resourceVersion": "14437521", "creationTimestamp": "2021-09-27T14:55:22Z", "labels": { "alertname": "BMYNW0201", "controllerinstance": "isf-network-operator-controller-manager-6bbd95445d-6mdlp", "controllername": "isf-network-operator-controller-manager", "identifier": "5555", "isf_category": "Network", "isf_node": "control-0.rackd.mydomain.com", "isf_uid": "234f3f55-353d-4af8-b2fa-b114cc3b0dd3", "severity": "CRITICAL" },
0 comments
17 views

Permalink