IBM FlashSystem

IBM FlashSystem

Find answers and share expertise on IBM FlashSystem

 View Only

Monitoring IBM Spectrum Scale with Icinga2 / Nagios

By Archive User posted Fri January 29, 2016 09:11 AM

  

Originally posted by: AlexanderSaupp


IBM Spectrum Scale is a clustered scale-out file system, in our days mostly running on the Linux platform. To assure proper operation of large Spectrum Scale environments, it is critical to keep track of hardware, software and environmental situations that require attention. There are potentially a lot of components and vendors involved (server, operating system, network, storage), which introduce complex requirements to the monitoring capabilities.

In general, Spectrum Scale does not impose specific monitoring solutions, but provides flexibility for integration into diverse operational procedures. The solution that a customer will choose will be influenced by the preexisting monitoring environment he already has in place, as most data centers already have operational guidelines and tool chain they have in place already.

This article summarizes best practices for monitoring Spectrum Scale environments with the Icinga 2 tool, which is commonly seen as a successor to the popular Nagios 3. However, many concepts described herein can easily be adapted to other monitoring solutions - most parts would work with Nagios in the same manner. The samples given in this document are simplified and most probably incomplete as the authors do not monitor large environments as their daily business. Comments and extensions are very welcome!

After giving an introduction into logical configuration of the Icinga 2 tool, this article provides general recommendations and best practices for monitoring Spectrum Scale environments. The configuration of a test setup in an IBM lab is given at the end, including downloadable configuration samples. This article does not focus on performance monitoring, which is a separate topic on its own and addressed e.g. by mmperfmon and the Spectrum Scale GUI.

 

Authored by Achim Christ & Alexander Saupp

 

Introducing Icinga 2

Base Concept

Both, Nagios and Icinga provide datacenter monitoring capabilities by periodically running scripts (named 'checks') and evaluating their return code (0 = success, 1 = warning, 2 = critical, 3 = unknown). The output (stdout) returned by such checks is used to provide more details on test results in human readable format. The new Icinga 2 API is not in scope of this article.

Nagios and Icinga are well documented, icinga.org provides a good overview of the features available.

Icinga already comes with a significant number of pre-defined service checks for common datacenter equipment, for example for checking CPU, RAM and disk utilization of servers. On top of that, Icinga provides a plugin architecture which allows for defining custom service checks specific to a certain configuration. Web portals such as Nagios Exchange or Icinga Exchange are available for users to share and collaborate on such plugins.

The optional Icinga Web package provides a graphical representation of the overall datacenter status. Typically, such a dashboard would represent the state of an entire datacenter including all servers, as well as storage and networking equipment.

Icinga Web 2 Host Groups

Icinga Web 2 Service Problems

 

Configuration Concept

Icinga 2 configuration is similar, but more advanced than Nagios definitions. Icinga 2 has a concept of 'Host objects', 'Service definitions' and 'Check commands'. The following is a minimal example of a valid Icinga 2 configuration file:

object Host "server1" {
  address = "192.168.0.1"
  check_command = "hostalive"
  vars.os= "Linux"
}

object Service "ping4" {
  host_name = "server1"
  check_command = "ping4"
}

This example creates a Host object named 'server1', and defines a service 'ping4' which is periodically checked for that specific host. The address defined for the host will be passed as a parameter to the check_ping4 command. There is a wide range of check commands readily available with Icinga, such as check_sshcheck_http or check_smtp. Host objects can also have check commands associated directly, which determine whether the host itself is alive.

Once Icinga has executed the service check at least once it knows the status of that service. The status of all services and hosts is shown in the web frontend, but is is also possible to configure additional reporting and alerting on certain situations. When certain hosts or services fail, Icinga can send further notification via SMTP (email) or SNMP. However, this part of the configuration is not covered within this article.

In addition to the (IP) address, hosts can have further attributes associated with them. Examples for such attributes (variables) are the operating system as shown in the example above, but may also include service specific parameters such as names of file systems that are supposed to be mounted. This allows for using rather generic check commands which are then executed with very specific settings in different situations.

There are numerous ways to associate services with hosts. In the example above one may assign specific service definitions to all Linux servers:

apply Service "ssh" {
  check_command = "ssh"
  assign where host.vars.os == "Linux"
}

Icinga allows for flexible rulesets when assigning services to hosts. These rules might be based on complex regular expressions using variables defined for the host. In most deployments, however, traditional 'Host groups' provide sufficient flexibility for efficiently managing large numbers of hosts and services:

object HostGroup "linux-servers" {
  display_name = "Linux Servers"
}

object Host "server1" {
  address = "192.168.0.11"
  groups += [ "linux-servers" ]
}

object Host "server2" {
  address = "192.168.0.12"
  groups += [ "linux-servers" ]
}

apply Service "ssh" {
  check_command = "ssh"
  assign where "linux-servers" in host.groups
}

This example creates a group of Linux servers, and adds 'server1' and 'server2' to it. All services associated with the host group will automatically be executed for all hosts within that group. As a host can be member of numerous groups, this allows for a flexible configuration which is easily extensible in the future.

 

Monitoring Remote Systems

In general, check commands are executed locally. Agentless checks will check availability of remote services accessible through the network, as shown in the previous examples. Icinga 2 has multiple concepts for agent based checks to allow for execution of plugins on remote machines:

  • Icinga 2 Client
  • SSH
  • SNMP
  • NRPE (Nagios Remote Plugin Executor)
  • NSClient++

This article is based on NRPE - even if some favor Icinga 2 clients for security reasons. IBM Spectrum Scale runs within secured datacenter environments with limited exposure to security risks. Using NRPE ensures compatibility with the large amount of environments which run Nagios or other variants thereof.

 

NRPE Details (Nagios Remote Plugin Executor)

An excerpt from the official documentation explains the NRPE architecture:

check_nrpe is a plugin executed by the local Icinga server like any other plugin. It calls the NRPE process which is running as a daemon on the remote machine. The daemon itself executes the plugin on the same machine and transmits the information gathered back to the check_nrpe plugin which in turn delivers it to Icinga.

Nagios Remote Plugin Executor Architecture

NRPE provides a simple, yet powerful mechanism to check availability of remote services not accessible through the network directly. A NRPE daemon needs to be installed and running on the remote machine, accepting service check requests defined on the Icinga server. The NRPE daemon is available for a variety of platforms, and comes with the standard package repositories of most Linux distributions already.

For security reasons, each check command available through NRPE has to be defined locally on the remote machine (/etc/nagios/nrpe.cfg). NRPE has a parameter 'dont_blame_nrpe' which specifies whether or not arguments are allowed to be sent by the monitoring server when calling such checks.

# COMMAND ARGUMENT PROCESSING
# This option determines whether or not the NRPE daemon will allow clients
# to specify arguments to commands that are executed.  This option only works
# if the daemon was configured with the --enable-command-args configure script
# option.  
#
# *** ENABLING THIS OPTION IS A SECURITY RISK! ***
# Read the SECURITY file for information on some of the security implications
# of enabling this variable.
#
# Values: 0=do not allow arguments, 1=allow command arguments

dont_blame_nrpe=1

This leaves administrators with two alternative ways for checking e.g. the mount state of file systems on remote machines:

  • Hard-code the file system name(s) to the /etc/nagios/nrpe.cfg file on each client. If different clients have different file systems to be checked, the nrpe.cfg files would have to be adapted per host.
  • Enable 'dont_blame_nrpe=1' within the nrpe.cfg client configuration file. Create a variable, e.g. in the host object definition that contains the file system name, and pass this parameter to the check. In this manner, the nrpe.cfg file is identical on each client.

 

NRPE client configuration /etc/nagios/nrpe.cfg:

  command[check_spectrumscale_capacity]=/usr/bin/sudo /usr/lib64/nagios/plugins/check_spectrumscale.sh -m$ARG1$

 

Icinga server configuration: /etc/icinga2/conf.d/IBM.conf:

  object Host "g1_node1" {
    address = "192.168.0.112"
    check_command = "hostalive"
    groups += [ "IBMSpectrumScale","IBMSpectrumScaleNSDClient" ]
    vars.os= "Linux"
    vars.fs= "group1fs"
  }

  object CheckCommand "check_spectrumscale_capacity" {
    import "plugin-check-command"
    command = [ "/usr/local/nagios/libexec/check_nrpe" ]
    arguments = {
      "-H" = "$address$"
      "-c" = "check_spectrumscale_capacity"
      "-a" = "$fs$"
    }
  }


Sample Implementation

Monitoring Design

The monitoring scenario outlined in this chapter will contain a single Icinga 2 instance, which is also running an Icinga Web 2 frontend for visualization. A couple of IBM Spectrum Scale servers with an active file system will be monitored via NRPE.

Two types of checks will be defined:

  • Base OS checks (CPU, RAM, network...)
  • Spectrum Scale specific checks (GPFS state, file system mount state...)

All Spectrum Scale nodes are contained within a 'spectrumscale' host group object, which is easily extensible once more servers are added to the configuration, and allows for adding additional service checks as required. Since user-defined groups are also available in Spectrum Scale (though they're referred to as 'node classes'), the node grouping concepts can be applied equally. In large configurations it is advisable to define more granular groups of Spectrum Scale hosts, such as NSD clients and NSD servers.

Lab Setup

Icinga Spectrum Scale Demo Setup

Service Checks

Base Linux OS Service Checks

The following list of standard checks can be used to get an overview of the general system health of each node. Depending on the type of server, it is possible to expand the list for hardware events as well as for application specific service checks. If, for example, the server acts as web server, an additional service check using check_http might be assigned.
 

For all nodes:

  • check_ping - Use ping to check connection statistics for a remote host.
  • check_linux_bonding - Checks bonding interfaces on Linux.
  • check_load - Tests the current system load average.
  • check_disk - Checks the amount of used disk space on a local file system.

For NSD servers:

  • check_multipath - Checks multipath connections to SAN storage on Linux. Also has the option to specify a required redundany level.

IBM Spectrum Scale Service Checks

It is also common practice to develop custom, specific service checks for non standard applications. Such custom check commands can be defined in any language which can be executed on the monitored server - simple Bash scripts introduce minimal dependencies and typically come with a very small footprint.

For servers running Spectrum Scale, the following list of custom checks can be used to get more specific information on cluster file systems.

For all nodes:

  • GPFS status, as reported by mmgetstate

For NSD clients:

  • File system mount state, as reported by mmlsmount
  • Capacity per pool and inode monitoring per file set, as reported by mmlsfileset and mmdf.
    Attention: mmdf is io intense, so this check should be run infrequently, e.g. once per day
  • Quota monitoring per user, group and file set, as reported by mmlsquota

For NSD servers running IBM Spectrum Scale RAID:

  • Physical disk state, as reported by mmlspdisk
  • Overall system state, as reported by gnrhealthcheck

A sample check implementation along with sample configuration files for Icinga 2 can be found in the following Git repository:

https://gitlab.com/itsmee/icinga/tree/master

Further Ideas

In addition to monitoring the state of the file system and supporting components, one may choose to monitor additional services running on the cluster nodes. This may include underlying file protocol services such as Samba (SMB protocol), Ganesha (NFS protocol), or OpenStack Swift (Object protocol). These components are available with the optional Spectrum Scale protocol support package, but monitoring such services is beyond the scope of this article.

Recommendations for Monitoring Spectrum Scale

Once a monitoring solution is in place, and remote monitoring is established among nodes of a Spectrum Scale cluster, the configuration can easily be extended to incorporate additional checks. Using host groups for NSD clients and NSD servers allows for administrators to easily and consistently apply specific service checks to a large number of nodes. When developing further checks, keep the following recommendations in mind:

  • Each check command adds load to the system. Carefully evaluate the load (and potential performance impact) of used Spectrum Scale commands, and consider starting with lower check frequencies such as hourly or daily.

  • The NRPE plugin has a timeout value which limits the amount of time after which a running check is given up on. Certain Spectrum Scale commands may potentially run for an extended period of time - consider raising the timeout value (-t) as in the following example:

  object CheckCommand "check_nrpe_long" {
    import "plugin-check-command"
    command = [ "/usr/local/nagios/libexec/check_nrpe" ]
    arguments = {
      "-H" = "$address$"
      "-t" = "60"
      ...
    }
  }
  • Most Spectrum Scale commands (mm...) have a -Y parameter to generate machine-parsable output. This allows for producing colon (':') separated values which can be processed easily.

  • Spectrum Scale commands need to be run as the root user of the system, while the NRPE daemon typically runs with a named user account. NRPE allows for adding a command prefix to all check commands automatically, which can be used to elevate permissions of the NRPE user when running service checks.

NRPE configuration /etc/nagios/nrpe.cfg:

  command_prefix=/usr/bin/sudo

Sudo configuration /etc/sudoers:

  nrpe ALL=(ALL) NOPASSWD: /usr/lib64/nagios/plugins/
  Defaults:nrpe !requiretty
  • While some argue that NRPE command argument processing gives attackers more options, it also allows for a centralized, well organized configuration. In large Spectrum Scale clusters such efficient management capabilities are considered mandatory. As storage clusters typically run within secured datacenter environments, the security concerns that result from the use of NRPE checks can be acceptable for most deployments. In cases where strict security is required, other approaches can be considered.

  • The following (incomplete) list of important configuration files is meant as a quick starter for newbies.

Icinga server configuration:

/etc/icinga2/conf.d/*.conf - Icinga configuration files

/usr/lib64/nagios/plugins/check_* - available check commands

 

NRPE client configuration:

/etc/nagios/nrpe.cfg - NRPE (client) definitions

/usr/lib64/nagios/plugins/check_* - available check commands

/etc/sudoers - be sure to add nrpe user when using sudo with check commands

Summary

IBM Spectrum Scale integrates nicely into the concepts and architecture found in monitoring solutions such as Nagios or Icinga. The above article outlines a sample implementation based on the Icinga 2 tool, but most recommendations can easily be adapted to other monitoring solutions as well. A sample check implementation, as outlined above, along with sample definition files can be found in the following Git repository:

 

https://gitlab.com/itsmee/icinga/tree/master

2 comments
13 views

Permalink

Comments

Tue April 04, 2017 02:44 AM

Originally posted by: AlexanderSaupp


check_spectrumscale is now integrated with 'mmhealth node' - and does the inode checks on the cluster manager only to reduce system load.

Thu April 28, 2016 04:03 AM

Originally posted by: AlexanderSaupp


Had an interesting discussion lately, somebody proposed to use the 'cluster manager role' to get a HA way of doing costly checks once only. ​E.g.: Implement an inode check that you apply to all cluster nodes, but the check itself will terminate w/o generating load if the node is not the cluster manager. Interessting - will consider that for an update along with some other checks I was pointed to.