File and Object Storage

File and Object Storage

Software-defined storage for building a global AI, HPC and analytics data platform 

 View Only

Health Monitoring of IBM Spectrum Scale Cluster via External Monitoring Framework.

By SANDEEP PATIL posted Tue January 22, 2019 10:55 AM

  
Ack: Andreas Koeninger for his review and inputs.

Many a times administrators or system integrators who manage the data center infrastructure have their own set of tools or framework that they use for monitoring the systems. Hence when one deploys a storage system inside a data center , there are requirements on how to integrate the storage system health monitoring into the administrators or system integrators framework. With IBM Spectrum Scale , this becomes possible via the use of IBM Spectrum Scale management API (REST API's) as well as IBM Spectrum Scale Performance Monitoring Bridge.

To start with the stated integration project, you need to understand the REST API support on IBM Spectrum Scale as given below.
https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adm_restapi_main.htm

Once you understand the semantics, the next logical question is the list of supported API that one can leverage to integrate with an external data center management tool/framework.
https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adm_listofapicommands.htm


Typically the purpose is for monitoring of health events of the spectrum scale cluster where the appropriate Spectrum Scale Management API (REST GET API) allow to fetch the data from Spectrum Scale cluster like shown here : https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adm_apiv2getnodehealthevents.htm

One can also filter the results on any of the fields for the events like component,severity,state etc. For example to retrieve only "WARNING" events , following is a example of retrieving using REST API

[root@os-11 ~]# curl -X GET -k -u admin:admin001 "https://localhost:443/scalemgmt/v2/nodes/os-11/health/events?filter=severity=WARNING"
{
"events" : [ {
"activeSince" : "2019-01-17 11:08:15,300",
"component" : "FILESYSTEM",
"description" : "An internally mounted or a declared but not mounted filesystem was detected",
"entityName" : "gpfs0",
"entityType" : "FILESYSTEM",
"message" : "The filesystem gpfs0 is probably needed, but not mounted",
"name" : "unmounted_fs_check",
"oid" : 3837,
"reportingNode" : "os-11.novalocal",
"severity" : "WARNING",
"state" : "DEGRADED",
"type" : "STATE_CHANGE",
"userAction" : "Run mmlsmount all_local to verify that all expected filesystems are mounted"
} ],
"status" : {
"code" : 200,
"message" : "The request finished successfully."
}
}

So far so good, but then the next question that comes across is - how to get the events continuously in an incremental fashion using these REST API ?

Well, pulling events that appeared after a specific time is not directly possible in Spectrum Scale 5.0.2. But the way to achieve it is by leveraging the oid associated with the of the events. oid is the internal unique ID of the event. So the external monitoring application which is pulling the events will need to remember the oid of the latest event it has retrieved and then query the REST API with a filter on oid, e.g.:

[root@os-11 ~]curl -k -u admin:admin001 -XGET -H content-type:application/json "https://localhost:443/scalemgmt/v2/nodes/:all:/health/events?filter=oid>1574"
{
"events" : [ {
"activeSince" : "2019-01-16 04:07:02,583",
"component" : "THRESHOLD",
"description" : "The thresholds value reached a normal level.",
"entityName" : "os-11.novalocal",
"entityType" : "NODE",
"message" : "The value of defined in mem_memfree for component MemFree_Rule/c671m2vm4 reached a normal level.",
"name" : "thresholds_normal",
"oid" : 1610,
"reportingNode" : "os-11.novalocal",
"severity" : "INFO",
"state" : "HEALTHY",
"type" : "STATE_CHANGE",
"userAction" : "N/A"
}, {
"activeSince" : "2019-01-16 05:06:53,297",
"component" : "THRESHOLD",
"description" : "The thresholds value reached a normal level.",
"entityName" : "os-12.novalocal",
"entityType" : "NODE",
"message" : "The value of defined in mem_memfree for component MemFree_Rule/c671m2vm3 reached a normal level.",
"name" : "thresholds_normal",
"oid" : 1612,
"reportingNode" : "os-11.novalocal",
"severity" : "INFO",
"state" : "HEALTHY",
"type" : "STATE_CHANGE",
"userAction" : "N/A"
} ],
"status" : {
"code" : 200,
"message" : "The request finished successfully."
}
}


One can also reduce the amount of data returned by limiting the output to the fields we are are interested in, e.g.
[root@os-11 ~] curl -k -u admin:admin001 -XGET -H content-type:application/json "https://localhost:443/scalemgmt/v2/nodes/:all:/health/events?filter=oid>154&fields=activeSince,message,name,severity"
{
"events" : [ {
"activeSince" : "2019-01-16 04:07:02,583",
"message" : "The value of defined in mem_memfree for component MemFree_Rule/os-11.novalocal reached a normal level.",
"name" : "thresholds_normal",
"oid" : 1610,
"severity" : "INFO"
}, {
"activeSince" : "2019-01-16 05:06:53,297",
"message" : "The value of defined in mem_memfree for component MemFree_Rule/os-12.novalocal reached a normal level.",
"name" : "thresholds_normal",
"oid" : 1612,
"severity" : "INFO"
} ],
"status" : {
"code" : 200,
"message" : "The request finished successfully."
}
}

This way one can integrate the health monitoring events of Spectrum Scale cluster into external monitoring framework using the Spectrum Scale REST API. For Performance Monitoring of the cluster externally (with tools like Grafana) following are the blogs that deep dive into them

https://developer.ibm.com/storage/2018/12/18/ibm-elastic-storage-server-performance-graphical-visualization-with-grafana/

0 comments
8 views

Permalink