High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

Centralized system and LSF logging on a Turing Pi system

By Gábor Samu posted Fri April 05, 2024 02:12 PM

  

Logs are one of those indispensable things in IT when things go wrong. Having worked in technical support for software products in a past life, I’ve likely looked at hundreds (or more) logs over the years, helping to identify issues. So, I really appreciate the importance of logs, but I can honestly say that I never really thought about a logging strategy for the systems on my home network - primarily those running Linux.

One of my longtime friends, Peter Czanik, who also works in IT, happens to be a logging guru as well as an IBM Champion for Power Systems (yeah!). So it’s only natural that we get to talking about logging. He is often complaining that even at IT security conferences people are unaware of the importance of central logging. So, why is it so important? For security it’s obvious: logs are stored independently from the compromised system, so they cannot be modified or deleted by the attacker. But central logging is beneficial for the HPC operator as well. First of all, it’s availability. You can read the logs even if one of your nodes becomes unreachable. Instead of trying to breath life into the failed node, you can just take a look at the logs and see a broken hard drive, or a similar deadly problem. And it is also convenience, as all logs are available at a single location. Logging into each node on the 3 node cluster to check locally saved logs is inconvenient but doable. On a 10 node cluster it takes a long time. On a 100 node cluster a couple of working days. While, if your logs are collected to a central location, maybe a single grep command, or search in a Kibana or similar web interface.

I've lately been tinkering with LSF on a Turing Pi V1 system. For me, the Turing Pi has always been a cluster in a box. My Turing Pi is fully populated with 7 compute modules. I’ve designed Node 1 to be the NFS server and LSF manager for the cluster. Naturally I turned to Peter for his guidance on this, and the result is this blog. Peter recommended that I use syslog-ng for log aggregation and also helped me through some of my first steps with syslog-ng. And the goal was to aggregate both the system (syslog) as well as LSF logs on Node 1. TL;DR it was easy to get it all working. But I encourage you to read on to better understand the nuances and necessary configuration both syslog-ng and LSF that was needed.

The environment

The following software has been deployed on the Turing Pi:

  • Raspberry Pi OS (2023-02-21-raspios-bullseye-arm64-lite.img)
  • syslog-ng 3 – (3.28.1 as supplied with Raspberry Pi OS)
  • IBM LSF Standard Edition V10.1.0.13
  • The Turing Pi system is configured as follows:

Node 1 (turingpi) is the manager node of this cluster in a box and has by far the most storage. Naturally we want to use that as the centralized logging server. 

Node Hostname Hardware Notes
1 turingpi CM3+ LSF manager, NFS server, 128GB SDcard
2 kemeny CM3 4GB eMMC flash
3 neumann CM3+ 8GB SDcard
4 szilard CM3+ 8GB SDcard
5 teller CM3+ 8GB SDcard
6 vonkarman CM3+ 8GB SDcard
7 wigner CM3+ 8GB SDcard

Syslog-ng & LSF setup

1. Raspberry Pi OS configures rsyslog out of the box. The first step is to install syslog-ng on Node 1 in the environment. Note that installing syslog-ng automatically disables rsyslog on the nodes.

Truncated output of apt update; apt-get install syslog-ng -> 

root@turingpi:~# apt update; apt-get install syslog-ng -y 
Hit:1 http://security.debian.org/debian-security bullseye-security InRelease
Hit:2 http://deb.debian.org/debian bullseye InRelease                                                        
Hit:3 http://deb.debian.org/debian bullseye-updates InRelease                                                
Hit:4 https://repos.influxdata.com/debian stable InRelease                                                   
Hit:5 https://repos.influxdata.com/debian bullseye InRelease                                                 
Hit:6 http://archive.raspberrypi.org/debian bullseye InRelease                                  
Hit:7 https://packagecloud.io/ookla/speedtest-cli/debian bullseye InRelease                     
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
....
....
unning kernel seems to be up-to-date. Failed to check for processor microcode upgrades. No services need to be restarted. No containers need to be restarted. No user sessions are running outdated binaries.

2. With syslog-ng installed, it’s now time to build the configuration for it. A new configuration file fromnet.conf is shown below, in which a syslog-ng destination is created which will aggregate logs from the Turing Pi nodes in /var/log/fromnet in plain text format. Additionally, the logs will be written in JSON format to the file /var/log/fromnet.json.

root@turingpi:~# cat /etc/syslog-ng/fromnet.conf 
# source
source s_fromnet {
  syslog(port(601));
};
# destination 
destination d_fromnet {
  file("/var/log/fromnet");
  file("/var/log/fromnet.json" template("$(format-json --scope rfc5424 --scope dot-nv-pairs
        --rekey .* --shift 1 --scope nv-pairs)\n") );
};
# log path
log {
  source(s_fromnet);
  destination(d_fromnet);
}; 

3. Unless we only want to see source IP addresses in the collected logs, it’s necessary to update the syslog-ng configuration file /etc/syslog-ng/syslog-ng.conf to record the hostnames from which the log messages have originated. This is done by adding the keep_hostname(yes) parameter to the options section as follows:

....
....
# First, set some global options. 
options { chain_hostnames(off); flush_lines(0); use_dns(no); use_fqdn(no);          
        keep_hostname(yes);dns_cache(no); owner("root"); group("adm"); perm(0640); 
        stats_freq(0); bad_hostname("^gconfd$"); 
};
....
....

4. Next, the IBM LSF configuration is updated to prevent the creation of local logfiles for the LSF daemons. This is done by commenting the LSF_LOGDIR option in the configuration file $LSF_ENVDIR/lsf.conf. At the same time, we also set LSF_LOG_MASK=LOG_DEBUG for testing purposes to enable verbose logging for the LSF daemons.

....
....
# Daemon log messages
# LSF_LOGDIR=/opt/ibm/lsf/log
LSF_LOG_MASK=LOG_DEBUG
....
....

5. Finally, to make the changes take effect, both syslog-ng and LSF are restarted.

root@turingpi:~# systemctl restart syslog-ng 
root@turingpi:~# . /opt/ibm/lsf/conf/profile.lsf  
root@turingpi:~# lsf_daemons restart 
Stopping the LSF subsystem 
Starting the LSF subsystem

6. With the configuration ready on the centralized logging server, host turingpi, we now turn our attention to Nodes 2-7 in the cluster. Here we’ll use the parallel-ssh tool to streamline some operations. We start with the installation of syslog-ng across Nodes 2-7. Note that the output of the installation of syslog-ng across the compute nodes has been truncated.

Truncated output of parallel-ssh -h /opt/workers -i “apt-get install syslog-ng -y” -> 

root@turingpi:~# parallel-ssh -h /opt/workers -i "apt-get install syslog-ng -y" 
[1] 13:57:07 [SUCCESS] kemeny
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libbson-1.0-0 libdbi1 libesmtp6 libhiredis0.14 libivykis0 libmaxminddb0
  libmongoc-1.0-0 libmongocrypt0 libnet1 libprotobuf-c1 librabbitmq4
  librdkafka1 libriemann-client0 libsensors-config libsensors5 libsnappy1v5
  libsnmp-base libsnmp40 syslog-ng-core syslog-ng-mod-add-contextual-data
  syslog-ng-mod-amqp syslog-ng-mod-examples syslog-ng-mod-extra
  syslog-ng-mod-geoip2 syslog-ng-mod-getent syslog-ng-mod-graphite
  syslog-ng-mod-http syslog-ng-mod-map-value-pairs syslog-ng-mod-mongodb
  syslog-ng-mod-python syslog-ng-mod-rdkafka syslog-ng-mod-redis
  syslog-ng-mod-riemann syslog-ng-mod-slog syslog-ng-mod-smtp
  syslog-ng-mod-snmp syslog-ng-mod-sql syslog-ng-mod-stardate
  syslog-ng-mod-stomp syslog-ng-mod-xml-parser
Suggested packages:
  mmdb-bin lm-sensors snmp-mibs-downloader rabbitmq-server graphite-web
  mongodb-server libdbd-mysql libdbd-pgsql libdbd-sqlite3 activemq
The following packages will be REMOVED:
  rsyslog
....
....
Setting up syslog-ng (3.28.1-2+deb11u1) ... Processing triggers for man-db (2.9.4-2) ... Processing triggers for libc-bin (2.31-13+rpt2+rpi1+deb11u8) ... Stderr: debconf: unable to initialize frontend: Dialog debconf: (TERM is not set, so the dialog frontend is not usable.) debconf: falling back to frontend: Readline debconf: unable to initialize frontend: Readline debconf: (This frontend requires a controlling tty.) debconf: falling back to frontend: Teletype dpkg-preconfigure: unable to re-open stdin: .... ....

7. Following the installation of syslog-ng across Nodes 2-7. We verify that the installation was successful by checking the syslog-ng service status.

Truncated output of parallel-ssh -h /opt/workers -i “systemctl status syslog-ng” -> 

root@turingpi:~# parallel-ssh -h /opt/workers -i "systemctl status syslog-ng" 
[1] 14:03:46 [SUCCESS] kemeny
 syslog-ng.service - System Logger Daemon
     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-03-28 13:57:01 EDT; 6min ago
       Docs: man:syslog-ng(8)
   Main PID: 28694 (syslog-ng)
      Tasks: 2 (limit: 779)
        CPU: 40.228s
     CGroup: /system.slice/syslog-ng.service
             └─28694 /usr/sbin/syslog-ng -F

Mar 28 13:57:00 kemeny systemd[1]: Starting System Logger Daemon...
Mar 28 13:57:01 kemeny syslog-ng[28694]: DIGEST-MD5 common mech free
Mar 28 13:57:01 kemeny systemd[1]: Started System Logger Daemon.
....
....

8. Create the configuration file send.conf in /opt on host turingpi. Note that /opt is an NFS export on turingpi and is NFS mounted by all of the compute nodes. This file will set the HOST field to the local hostname for log messages that are sent. This in done in the subsequent steps where “placeholder” will be replaced using a sed operation with the local hostname. Additionally, a data source s_hpc is defined which will scan /opt/ibm/lsf/log for the presence of LSF daemon logfiles.

root@turingpi:/# cat /opt/send.conf
rewrite r_host { set("placeholder", value("HOST")); };

destination d_net {
  syslog("turingpi" port(601));
};
source s_hpc {
  wildcard-file(
      base-dir("/opt/ibm/lsf/log")
      filename-pattern("*.log.*")
      recursive(no)
      follow-freq(1)
  );
};
log {
  source(s_src);
  source(s_hpc);
  rewrite(r_host); 
  destination(d_net);
};

9. On Nodes 2-7, copy the file /opt/send.conf to /etc/syslog-ng/conf.d/send.conf.

root@turingpi:/# parallel-ssh -h /opt/workers -i "cp /opt/send.conf /etc/syslog-ng/conf.d" 
[1] 14:19:29 [SUCCESS] kemeny
[2] 14:19:30 [SUCCESS] vonkarman
[3] 14:19:30 [SUCCESS] wigner
[4] 14:19:30 [SUCCESS] szilard
[5] 14:19:30 [SUCCESS] teller
[6] 14:19:31 [SUCCESS] neumann

10. Using sed, replace the “placeholder” string in /etc/syslog-ng/conf.d/send.conf with the local hostname. And we also double check that the change was correctly made.

root@turingpi:/# parallel-ssh -h /opt/workers -i 'HOST=`hostname`; sed -i "s/placeholder/$HOST/g" /etc/syslog-ng/conf.d/send.conf' 
[1] 14:38:09 [SUCCESS] kemeny
[2] 14:38:09 [SUCCESS] teller
[3] 14:38:09 [SUCCESS] vonkarman
[4] 14:38:09 [SUCCESS] wigner
[5] 14:38:09 [SUCCESS] neumann
[6] 14:38:09 [SUCCESS] szilard

Output of parallel-ssh -h /opt/workers -i “cat /etc/syslog-ng/conf.d/send.conf” -> 

root@turingpi:/# parallel-ssh -h /opt/workers -i "cat /etc/syslog-ng/conf.d/send.conf" 
[1] 14:38:33 [SUCCESS] kemeny
rewrite r_host { set("kemeny", value("HOST")); }; destination d_net { syslog("turingpi" port(601)); }; source s_hpc { wildcard-file( base-dir("/opt/ibm/lsf/log") filename-pattern("*.log.*") recursive(no) follow-freq(1) ); }; log { source(s_sys); source(s_hpc); rewrite(r_host); destination(d_net); };

11. Finally, syslog-ng is restarted on Nodes 2-7 and the status of the service is checked to ensure that there are no errors.

root@turingpi:/opt# parallel-ssh -h /opt/workers -i "systemctl restart syslog-ng" 
[1] 14:49:03 [SUCCESS] kemeny
[2] 14:49:05 [SUCCESS] szilard
[3] 14:49:06 [SUCCESS] vonkarman
[4] 14:49:06 [SUCCESS] neumann
[5] 14:49:06 [SUCCESS] teller
[6] 14:49:07 [SUCCESS] wigner

Truncated output of parallel-ssh -h /opt/workers -i “systemctl status syslog-ng” -> 

root@turingpi:/opt# parallel-ssh -h /opt/workers -i "systemctl status syslog-ng" 
[1] 14:49:31 [SUCCESS] kemeny
 syslog-ng.service - System Logger Daemon
     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-03-28 14:49:03 EDT; 28s ago
       Docs: man:syslog-ng(8)
   Main PID: 34982 (syslog-ng)
      Tasks: 2 (limit: 779)
        CPU: 398ms
     CGroup: /system.slice/syslog-ng.service
             └─34982 /usr/sbin/syslog-ng -F

Mar 28 14:49:02 kemeny systemd[1]: Starting System Logger Daemon...
Mar 28 14:49:02 kemeny syslog-ng[34982]: DIGEST-MD5 common mech free
Mar 28 14:49:03 kemeny systemd[1]: Started System Logger Daemon.

Does it work?

The answer to this question is an emphatic YES!

Let’s begin with a simple test running the logger command on all of the compute nodes, while monitoring /var/log/fromnet on host turingpi.

root@turingpi:/home/lsfadmin# date; parallel-ssh -h /opt/workers -i 'HOST=`hostname`; logger This is a test from node $HOST. Do not panic!' 
Wed  3 Apr 21:41:45 EDT 2024 
[1] 21:41:46 [SUCCESS] teller 
[2] 21:41:46 [SUCCESS] neumann 
[3] 21:41:46 [SUCCESS] wigner 
[4] 21:41:46 [SUCCESS] kemeny 
[5] 21:41:46 [SUCCESS] szilard 
[6] 21:41:46 [SUCCESS] vonkarman

root@turingpi:/var/log# tail -f fromnet |grep panic 
Apr  3 21:41:46 szilard root[10918]: This is a test from node szilard. Do not panic! 
Apr  3 21:41:46 wigner root[11011]: This is a test from node wigner. Do not panic! 
Apr  3 21:41:46 neumann root[11121]: This is a test from node neumann. Do not panic! 
Apr  3 21:41:46 kemeny root[11029]: This is a test from node kemeny. Do not panic! 
Apr  3 21:41:46 teller root[10875]: This is a test from node teller. Do not panic! 
Apr  3 21:41:46 vonkarman root[10805]: This is a test from node vonkarman. Do not panic!

Next, let’s look at whether the LSF logging is also captured. Here we simply restart the LSF daemons on Nodes 2-7 and monitor the /var/log/fromnet file. The  output can be viewed below.

Truncated output of tail -f /var/log/fromnet -> 

root@turingpi:/var/log# tail -f fromnet
Apr  3 21:44:59 kemeny res[691]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 kemeny lim[688]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 kemeny sbatchd[693]: Daemon on host <kemeny> received signal <15>; exiting 
Apr  3 21:44:59 kemeny lsf_daemons[11434]: Stopping the LSF subsystem 
Apr  3 21:44:59 kemeny systemd[1]: lsfd.service: Succeeded. 
Apr  3 21:44:59 kemeny systemd[1]: lsfd.service: Consumed 11min 56.744s CPU time. 
Apr  3 21:44:59 szilard lim[685]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 szilard res[687]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 szilard sbatchd[689]: Daemon on host <szilard> received signal <15>; exiting 
Apr  3 21:44:59 vonkarman lim[686]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 vonkarman sbatchd[690]: Daemon on host <vonkarman> received signal <15>; exiting 
Apr  3 21:44:59 vonkarman res[688]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 teller lim[683]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 teller res[689]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 teller sbatchd[691]: Daemon on host <teller> received signal <15>; exiting 
Apr  3 21:44:59 teller lsf_daemons[11294]: Stopping the LSF subsystem 
Apr  3 21:44:59 wigner lim[719]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 wigner res[722]: term_handler: Received signal 15, exiting 
Apr  3 21:44:59 wigner sbatchd[724]: Daemon on host <wigner> received signal <15>; exiting 
Apr  3 21:44:59 wigner lsf_daemons[11438]: Stopping the LSF subsystem 

Conclusion

What started out as a chat about logging, grew into an idea of a blog, for which I am thankful for the collaboration of Peter. We’ve illustrated an example here of how to setup centralized logging on a Turing Pi system with syslog-ng to collect system and LSF logs.

Of course collecting log messages centrally is just the start of a journey. It is an important step as it allows for significantly easier debugging and troubleshooting. You can store logs to databases for easier search. And once you better understand which log messages are important, you can even potentially parse those and generate alersts from them or dashboards. All of these help you to make sure that your HPC system runs smoothly and with minimal downtime. 

1 comment
23 views

Permalink

Comments

Tue April 09, 2024 11:45 AM

Thanks for sharing this. I think the similar approach should be recommended for scaling up/down LSF cluster in cloud through LSF Resource Connector. To help monitoring LSF status and troubleshooting issues on auto provisioned and reclaimed instances, this approach will keep log messages even after an instances is reclaimed for analysis later.