Viewing and interpreting events in IBM Spectrum Fusion
Introduction
The Event Manager is responsible for collecting and processing all alerts generated by IBM Spectrum Fusion components. These alerts are initially converted into Kubernetes events and placed on the event queue, so they can be accessed like normal events. However, additional information is added to the events so that other components in IBM Spectrum Fusion can provide additional processing, such as opening Call Home tickets for designated critical events, and visualization through the Event Manager UI page, where the events can be filtered, searched, and downloaded.
Basic Function
Event Manager has 3 basic functions
- Receive Alerts - Any component can POST an alert in AlertManager v4 format to the following url https://eventmanager-ibm-spectrum-fusion-ns.apps.cps-r81-9-46-123-89.rtp.raleigh.ibm.com/api/v1/eventmanager/alerts
- Convert the information in the alert into the labels and annotations of an OCP event, and place that event onto the OCP event queue. In general, alert and event are synonymous, but for this document Event Manager receives alerts and converts them into events.
- If the ISF severity is CRITICAL, and the alertname of the alert is listed in the isf-serviceability-operator-allow-tickets config map, send information to the Call Home Client to open a ticket and automatically upload a default set of files.
Alert Sources
Event Manager will receive alerts from (potentially) 4 different sources
- ISF Components, like the Compute or Networking operators
- Spectrum Scale
- ISF hardware SNMPv3 traps
- Prometheus (has not been tested yet)
Event Severities
Event have 3 severities
- INFO - just for information; the user does not need to take any action; will typically live 3-4 hours
- WARNING - a condition has been identified that should be examined by the user within the next few days; will typically live 7 days
- CRITICAL - a condition has been identified that should be examined by the user immediately; if the name of this alert is in the isf-serviceability-operator-allow-tickets configmap, Event Manager will also open a Call Home ticket (if enabled) and automatically upload logs (if enabled); will typically live 14 days
Event Fixed Status
Each ISF event in the OCP event queue has a field in its annotations named isf_fixed. It can be set to true or false. INFO events are created with isf_fixed=true, while WARNING and CRITICAL events are set to isf_fixed=false.
The intent is to give the user a way to mark that a particular event has been investigated and does not need any more attention.
Typically, the user will see a new WARNING event on the Events page, that specifies a condition (ie disk XXX is > 80% full) and is created with isf_fixed=false. The user can then follow-up on the condition, and once it has been dealt with the user can go back to the UI and set the fixed status to true to indicate the condition does not need anymore attention. (Note: For the initial release, only CRITICAL events can be changed, and only from fixed=false to fixed=true.)
Alert Processing
Event Manager checks all received alerts for duplication, based on the labels and fixed status.
- Determines the labels of the incoming alert
- Search for an existing event with the same set of labels
- Compare the fixed status of the incoming alert and the found event.
If an event is found that matches the set of labels of the incoming alert, but the event has fixed=true and the alert has fixed=false, a new event will be created. Otherwise, the new alert will be considered a duplicate of the found event.
There are 3 sets of fields in the events to help keep track of duplication
- isf_first_seen - the date and time the event was received
- isf_last_seen - the date and time the most recent duplicate was received
- isf_times_seen - the number of times a duplicate has been found for this event
Event Maintenance
To keep events around longer than the lifetime of an OCP event, Event Manager will update a field in each WARNING and CRITICAL event every hour. This prevents OCP from deleting the event.
Command Line Tools
To see a list of the current ISF events from the OCP command line, type
oc get event --field-selector reason=ISFEventManager
LAST SEEN TYPE REASON OBJECT MESSAGE
168m Warning ISFEventManager deployment/eventmanager BMYLC1000-Test Event for logcollector - 968
151m Warning ISFEventManager deployment/eventmanager BMYLC1000-Test Event for logcollector, longer collection - 225
Note that OCP events have a TYPE field, which can be either Normal or Warning. All ISF events with severity=INFO have Type=Normal. ISF events with severity=WARNING or severity=CRITICAL have Type=Warning.
Example Event
apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2021-09-10T16:36:03Z"
involvedObject:
apiVersion: apps/v1
kind: Deployment
name: eventmanager
namespace: ibm-spectrum-fusion-ns
resourceVersion: "88029876"
uid: 24877f13-3ed5-4062-98cd-fb233385050b
kind: Event
lastTimestamp: "2021-09-10T16:36:03Z"
message: BMYLC1000-Test Event for logcollector - 968
metadata:
annotations:
cause: A situation like a SGPanic or a quorum loss could initiate the unmount
container_restart: "false"
container_unready: "false"
description: A filesystem was forced to unmount by SpectrumScale
ftdc_scope: ""
identifier: gpfs02
internalComponent: fsmount
is_resolvable: "false"
isf_first_seen: "2021-09-09T16:24:36Z"
isf_fixed: "false"
isf_last_seen: "2021-09-09T16:24:36Z"
isf_ticket_id: TS003759703
isf_ticket_id_confirmed: "true"
isf_ticket_id_logsuploaded: "true"
isf_ticket_requested_at: "2021-09-09T16:24:36Z"
isf_times_seen: "1"
logk8s00: route.openshift.io:v1:routes:ibm-spectrum-fusion-ns
loglel00: isf-collection-sets:must-gather
message: BMYLC1000-Test Event for logcollector
priority: ""
remedy: ""
requireUnique: "true"
scope: NODE
time: "2021-04-14T10:19:12-07:00"
tzone: PDT
user_action: 'Check error messages and the error log for further details. Also see the topic File system forced unmount in the IBM Spectrum Scale documentation: Troubleshooting. File system issues'
creationTimestamp: "2021-09-10T16:36:03Z"
labels:
alertname: BMYLC1000
component: filesystem
controllerinstance: logcollector-6446f68445-gs775
controllername: logcollector
entity_name: gpfs02
entity_type: FILESYSTEM
isf_node: master-7b50.cps-r81-9-46-123-89.rtp.raleigh.ibm.com
isf_uid: 41b8dae3-0233-4733-9163-e5f27b551fe6
node: "12"
severity: CRITICAL
Event UI
On the Spectrum Fusion home page, there is a list of the most recent events in the right-most panel:
Clicking the arrow in the upper-right corner will bring up the Events page.
At the top of the page is a bar chart showing the distribution of the current events:
The list of events can be filtered by
- severity:
- category
- and/or contents of the description field.
For example, entering BMYNW into the search field will show all the events whose descriptions include that string:
CRITICAL events that have opened a ticket can be manually marked as Fixed from the menu at the end of each line. This allows the user to keep track of which events have been dealt with and which still need attention.
After an event has been marked as fixed its icon changes to a green checkmark.
The list can also be sorted by Timestamp to examine the newest or oldest events still in the list.
The current contents of all events on a page can be downloaded as json using the download button.
{
"events": [
{
"metadata": {
"name": "eventmanager.16a8b59eb206ee07",
"namespace": "ibm-spectrum-fusion-ns",
"selfLink": "/api/v1/namespaces/ibm-spectrum-fusion-ns/events/eventmanager.16a8b59eb206ee07",
"uid": "aff71675-ad53-4bea-95eb-0bbd1b8ec911",
"resourceVersion": "14437521",
"creationTimestamp": "2021-09-27T14:55:22Z",
"labels": {
"alertname": "BMYNW0201",
"controllerinstance": "isf-network-operator-controller-manager-6bbd95445d-6mdlp",
"controllername": "isf-network-operator-controller-manager",
"identifier": "5555",
"isf_category": "Network",
"isf_node": "control-0.rackd.mydomain.com",
"isf_uid": "234f3f55-353d-4af8-b2fa-b114cc3b0dd3",
"severity": "CRITICAL"
},