High Performance Computing

 View Only

Monitoring IBM Spectrum LSF with the TIG stack

By Gábor Samu posted Sun February 26, 2023 05:25 PM

  

I previously posted a blog which discussed visualizing Spectrum LSF data with Grafana. In that blog, I discussing using Grafana to pull data from Elasticsearch, which was logged by Spectrum LSF Suites using the included data collector (also included with LSF Explorer). 

In keeping with the theme of monitoring, here is a different take on creating a dashboard for LSF using Telegraf, InfluxDB and Grafana, also known as the TIG stack. Before we begin, it's important to note that there are powerful monitoring and reporting tools available from IBM as add-ons to LSF; IBM Spectrum LSF RTM and IBM Spectrum LSF Explorer. You can find more details about the add-on capabilities for LSF here.

 LSF provides many hooks which enable organizations to extend and customize functionalities. In this example, we'll demonstrate the feasibility of monitoring an LSF cluster with the TIG stack, by scraping data from various LSF user commands, using the JSON-formatted output where possible. This blog is not meant as a detailed guide to deploying the TIG stack. Basic knowledge of the TIG stack is of benefit here.

Out of the box, Telegraf has the ability to monitor numerous system metrics. Furthermore, there exists literally hundreds of plugins for Telegraf to monitor a wide variety of devices, services and software. A search however, didn’t reveal the existence of any plugin to monitor LSF. A bit of research revealed that InfluxDB supports what is known as “line protocol”. This is a well defined text-based format for writing data to InfluxDB. I used the following reference on “line protocol” to guide me. Using line protocol it would be ultimately possible to write a plugin for Telegraf to effecively scrape information from LSF and output in line protocol format for writing to InfluxDB.

Before I could begin writing the plugin, the key was to determine what information from Spectrum LSF would be useful to display in the dashboard, and how that information could be extracted. For this example, I've kept this as simple as possible. The key metrics I decided to report on were servers, queues, jobs and process information for the LSF scheduler daemons. Refer to the following table for details:  

Metric(s) Command
LSF scheduler performance metrics badmin perfmon view -json
LSF available servers, CPUs, cores, slots badmin showstatus
LSF server by status (total number Ok, closed, unreachable, unavailable) badmin showstatus
LSF job statistics (total number running, suspended, pending) badmin showstatus
LSF queue statistics (per queue, total number of jobs running, suspended, pending) bqueues -json -o queue_name:12 njobs pend run susp rsv ususp ssusp
LSF mbatchd process metrics (Telegraf - inputs.procstat)
LSF mbschd process metrics (Telegraf - inputs.procstat)
LSF management lim process metrics (Telegraf - inputs.procstat)

Using the above set of metrics, the next steps was to create an example plugin script for Telegraf to capture output from the above noted commands and output in the required line protocol format for logging to InfluxDB. It should be noted that bqueues and badmin perfmon viewsupport output in JSON format with the appropriate flags specified. However, badmin showstatus does not support output in JSON format. This meant that for badmin showstatusit was necessary to scrape data assuming hard coded field positions in the output.

NOTE The example Telegraf plugin script for Spectrum LSF is provided below. This is just an example and is provided "as is", for testing purposes. 

Example lsf_telegraf_agent.py script

For completeness, below is the detail regarding the configuration of the environment. It should be noted that the simple test environment consists of a single server running IBM Spectrum LSF Suite for HPC and a separate server which runs the InfluxDB instance. 

Hostname Component Version
kilenc OS (LSF mgmt server) CentOS Stream release 8 (ppc64le)
kilenc Spectrum LSF Suite for HPC v10.2.0.13
adatbazis OS (InfluxDB server) Fedora release 36 (aarch64)
adatbazis InfluxDB v1.8.10
kilenc Telegraf v1.24.3
kilenc Grafana v9.1.6

Configuration

The following steps assume that IBM Spectrum LSF Suite for HPC, InfluxDB and Telegraf have been installed.

  1. Start InfluxDB on the host adatbazis (systemctl start influxdb)
  2. On the LSF management server kilenc, configure telegraf to connect to the influxDB instance on host adatbazis. Edit the configuration /etc/telegraf/telegraf.conf and specify the correct URL in the outputs.influxdb section as follows:

    # # Configuration for sending metrics to InfluxDB
    [[outputs.influxdb]]
    #   ## The full HTTP or UDP URL for your InfluxDB instance.
    #   ##
    #   ## Multiple URLs can be specified for a single cluster, only ONE of the
    #   ## urls will be written to each interval.
    #   # urls = ["unix:///var/run/influxdb.sock"]
    #   # urls = ["udp://127.0.0.1:8089"]
    #   # urls = ["http://127.0.0.1:8086"]
    # Added gsamu Jan 04 2023
    urls = ["http://adatbazis:8086"]
  3. On the LSF management server kilenc, configure Telegraf with the custom plugin script lsf_telegraf_agent.py to collect and log metrics from IBM Spectrum LSF Suite for HPC. Edit the configuration /etc/telegraf/telegraf.conf and specify the correct command path in the section inputs.exec. Additionally, set data_format equal to influx.Note that the script lsf_telegraf_agent.py was copied to the directory /etc/telegraf/telegraf.d/scripts with permissions octal 755 and owner set to user telegraf. Note: User telegraf was automatically created during the installation of Telegraf.
    # ## Gather LSF metrics
    [[inputs.exec]]
      ## Commands array
       commands = [  "/etc/telegraf/telegraf.d/scripts/lsf_telegraf_agent.py" ]
       timeout = "30s"
       interval = "30s"
       data_format = "influx"
     # ## End LSF metrics
  4. Telegraf provides the ability to collect metrics on processes. Here we’ll use the Telegraf procstat facility to monitor the LSF mbatchd and mbschd processes. These are the key daemons involved in handling query requests and making scheduling decisions for jobs in the environment. Edit the configuration /etc/telegraf/telegraf.confand configure the two following inputs.procstat sections.
    # ## Monitor CPU and memory utilization for LSF processes
    # ## mbatchd, mbschd, lim (manager)
    [[inputs.procstat]]
    exe = "lim"
    pattern = "lim"
    pid_finder = "pgrep"
    
    [[inputs.procstat]]
    exe = "mbschd"
    pattern = "mbschd"
    pid_finder = "pgrep"
    
    [[inputs.procstat]]
    exe = "mbatchd"
    pattern = "mbatchd"
    pid_finder = "pgrep"
  5. With the configuration to Telegraf complete, it’s now time to test if the configuration and custom LSF agent is functioning as expected. Note that the following operation is performed on the LSF management candidate host kilenc and assumes that the LSF daemons are up and running. This is achieve by running the command: telegraf –config /etc/telegraf/telegraf.conf –test. Note: Any errors in the configuration file /etc/telegraf/telegraf.conf will result in errors in the output. 
  6. Assuming there were no errors in the previous step with telegraf, proceed to start the telegraf process via systemd. 
    [root@kilenc telegraf]# systemctl start telegraf
    [root@kilenc telegraf]# systemctl status telegraf
    ● telegraf.service - Telegraf
       Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
       Active: active (running) since Thu 2023-01-19 14:13:51 EST; 1 day 1h ago
         Docs: https://github.com/influxdata/telegraf
     Main PID: 3225959 (telegraf)
        Tasks: 35 (limit: 190169)
       Memory: 192.6M
       CGroup: /system.slice/telegraf.service
               └─3225959 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/tele>
    
    Jan 19 14:13:51 kilenc systemd[1]: Starting Telegraf...
    Jan 19 14:13:51 kilenc systemd[1]: Started Telegraf.
  7. On the host running the database instance, adatbazis, perform queries to check whether the database telegraf exists, as well as checking if LSF related data is being logged. This is confirmed in the output below.
  8. With Telegraf successfully logging data to the InfluxDB instance, it will now be possible to create a data source in Grafana in order to create a dashboard containing LSF metrics. As noted at the outset, this article is not meant to be an extensive guide to the creation of dashoards in Grafana. In the Grafana navigation select Configuration> Data sources.
  9. Select the Add data source button, followed by InfluxDB, which is listed under Time series databases. On the settings page specify following values:
    Variable Value
    URL http://adatbazis:8086
    Database telegraf
    Basic Auth (enable)
    User <influxdb_username>
    Password <influxdb_password>
    Next, click on Save & test. If all variables and settings were properly specified, the message datasource is working. 17 measurements found.
  10. With the datasource configured in Grafana, the final step is to create a dashboard. Creating a dashboard requires creating panels which display data pulled from the configured data source using targeted queries. With some effort, I was able to piece together the following dashboard which includes both metrics from LSF, as well as metrics from Telegraf input.procstat for the LSF processes mbatchd, mbschd and the management lim. 

As you can see, with a short plugin script to collect information from LSF, it’s possible to monitor your LSF cluster using the TIG stack. If you have any questions, please feel free to reach out or comment. 

0 comments
23 views

Permalink