Storage Management and Reporting

Alerting notification frequencies in IBM Storage Insights

By TIBERIU HAJAS posted 18 days ago

In the latest update of IBM Storage Insights (Q1 2021) there is an improvement around the way alerts are displayed, basically there is a consolidation of the displayed alerts, instead of displaying every single alert that are getting triggered and creating a lot of noise (for example in case a volume status changes from Normal to Error), it is coalesced under a single alert with a reference to the number of occurrences:

IBM Storage Insights alerting frequency notification

While the consolidation is great, there is also a notification frequency that is further refining the appearance of Alerts on the page. This is briefly described in the Documentation page.  I would like to take a deep dive into some use cases of these frequencies and find their usefulness. 

Evidently, these notification frequencies are in close correlation with the way the data collection is performed on the infrastructure.  Roughly there are 3 methods of how the data collection (the metadata) is made: 
- probes  (frequency is typically daily with manual override possibility)
- performance monitors (frequency is often, typically every 5 minutes but not more then 60 min)
- events (undetermined but it can be frequent, several times a minute depending on the features activated in the infrastructure eg. DRP pools or tiering for SVC)
when a storage administrator is defining a notification frequency, having these 3 data collection is mind will be very useful.

Now let's dissect when and why a specific frequency is likely to be used.  The first option is the "Send every time condition is violated" which is fairly self explanatory, but it's a double edged sword, this needs to be used cautiously! 

Let's bring in a visual of an alert definition: 
IBM Storage Insights alert definition

For each section there will be up to 3 categories : 
- General
- Capacity 
- Performance

Now setting the Probe Status to "Send every time condition is violated" should not cause any clutter (in term of alerts), since probes are typically scheduled to be executed daily, even with some manual overriding it will be only be run a handful of times. However, setting Status to this notification frequency might get into an alert storm when the device has a faulty component (this status represents the events which are sent by the hardware).

Setting a Capacity or Performance metric to this notification frequency is definitely not a good idea, since the performance sampling of a device is often 5 minutes, or a capacity value can change due to events. For example, a volume used space is shrinking as apps are writing to it,  so we keep need to remind the earlier 3 aspects : probe. performance monitors and events, while the probe won't have a great effect on Capacity (since it's daily) and no effect on performance metrics from the alert definition, the event/performance monitor is something that frequently gets evaluated.  We also have to keep in mind that the Alert consolidation we described at the beginning of this blog is only going to consolidate alerts from the same category when these happen during a close interval (or at the same interval when the evaluation happens), so if an event comes in about Volume used space shrinking to 30%, then another 5 min later it shrinks to 28% ..etc these are not going to be consolidated if the frequency is "Send every time condition is violated."

Another good area where this frequency can be used, is the General definition for entities like: Disks, Ports, Host Connections where ideally we want to monitor each and every action like "Removed Port."

Moving on to the next notification frequency: "Send once until problem clears"  which is one of the most useful notification, however, it can be tricky to fully grasp how to apply in various alert definitions.  The most obvious case is for a Performance related alert definition, here is another screen capture illustrating the scenario: 

IBM Storage Insights performance alert

For example,  Performance metric : Total IO Rate if we set to trigger at 1000 ops/s, this is something we determined as a baseline.
sample 1:  value 1200  Triggers
sample 2:  value 1500  No Trigger
sample 3:  value 800  No Trigger (Cleared)
sample 4:  value 1100 Triggers

If we look at the samples above, we can identify when that clearing of the problem occurs. If the samples are at 5 minutes (which for IBM Storage Insights is the lowest granularity), then across this total of 20 minutes interval we will end up with 2 x triggered alerts.  This notification frequency is definitely something to be used for Performance metrics, in fact, it can be noticed that by default when the alert definition is created, it defaults to this option.  

This notification frequency is not a good candidate to be used in General or Capacity type of alerts, because those are either updated daily or unidirectional (growing or shrinking).
For example, probe status or capacity value
day 1 : value failed trigger
day 2 : value failed no trigger 
day 2 : manual probe value success (cleared) 
day 3 : value success no trigger 
day 4 : value failed trigger 

The third option is "Send every  ... value ... unit." This is fairly versatile if the triggering condition has a periodicity. For example, a backup is run every 5 hours and there is a known IOPS surge, a temporary burst in space usage due to some workloads, but then the space is reclaimed.  So if we look at the performance example as earlier: 

For example,  Performance metric: Total IO Rate if we set to trigger at 1000 ops/s,  "Send every 5 hours"
sample 1:  value 1200  Triggers
sample 2:  value 1500  No Trigger
sample 3 (1h later) : value 1500 No Trigger
sample 45 (4h later) : value 1500 No Trigger 
sample 65 (5h later):  value 800  No Trigger  
sample 75 (6h later):  value 1100 Triggers

So the next trigger will only happen when both the time had passed and threshold is violated, this can be almost looked at as a blackout time, if the nature of the workload is known.

Now to make things even more complex, there is an additional option to suppress a notification even more,  "Only send notifications after the condition is violated for ... value ... unit."  This option is additive to the earlier set frequency.  For example, if the "Send every time condition is violated" is combined with "Only send notification after the condition is violated for 15 minutes," in case of a performance alert, for more then 400 IOPS, of which chart is the following: 
IBM Storage Insights performance chart

The equivalent most likely can be achieved with "Send every 15 minute,"this option can be looked at as further fine tune for certain custom alerts where too many violation are seen due to unrefined periodicity. 

To recap, there is a plethora of options when it comes to alert frequencies and combined with the consolidation of occurrences will reduce the "alert noise" quite significantly, but still on the back of our mind when we define a new alert, we need to keep ask ourselves:  Is this metric going to be triggered by a probe?  Is this metric going to be triggered by an event?  or is it going to be triggered by the performance monitor that typically is always on (it is fundamentally important to turn on performance monitoring, it's the bread and butter of IBM Storage Insights). Once these questions are answered the frequency can be set appropriately.  A special consideration is for the case of Custom alerts, here there is a combined aspect that needs to be taken into in account, you can define a customer alert with 1 x performance component and 1 x general status,  then the notification frequency needs to be of the "lowest common denominator" in terms of frequency, which is likely to be the performance component and accordingly the "Send once until problem clears" is likely the best option.