Logs are one of those indispensable things in IT when things go wrong. Having worked in technical support for software products in a past life, I’ve likely looked at hundreds (or more) logs over the years, helping to identify issues. So, I really appreciate the importance of logs, but I can honestly say that I never really thought about a logging strategy for the systems on my home network - primarily those running Linux.
One of my longtime friends, Peter Czanik, who also works in IT, happens to be a logging guru as well as an IBM Champion for Power Systems (yeah!). So it’s only natural that we get to talking about logging. He is often complaining that even at IT security conferences people are unaware of the importance of central logging. So, why is it so important? For security it’s obvious: logs are stored independently from the compromised system, so they cannot be modified or deleted by the attacker. But central logging is beneficial for the HPC operator as well. First of all, it’s availability. You can read the logs even if one of your nodes becomes unreachable. Instead of trying to breath life into the failed node, you can just take a look at the logs and see a broken hard drive, or a similar deadly problem. And it is also convenience, as all logs are available at a single location. Logging into each node on the 3 node cluster to check locally saved logs is inconvenient but doable. On a 10 node cluster it takes a long time. On a 100 node cluster a couple of working days. While, if your logs are collected to a central location, maybe a single grep command, or search in a Kibana or similar web interface.
I've lately been tinkering with LSF on a Turing Pi V1 system. For me, the Turing Pi has always been a cluster in a box. My Turing Pi is fully populated with 7 compute modules. I’ve designed Node 1 to be the NFS server and LSF manager for the cluster. Naturally I turned to Peter for his guidance on this, and the result is this blog. Peter recommended that I use syslog-ng for log aggregation and also helped me through some of my first steps with syslog-ng. And the goal was to aggregate both the system (syslog) as well as LSF logs on Node 1. TL;DR it was easy to get it all working. But I encourage you to read on to better understand the nuances and necessary configuration both syslog-ng and LSF that was needed.
The environment
The following software has been deployed on the Turing Pi:
- Raspberry Pi OS (2023-02-21-raspios-bullseye-arm64-lite.img)
- syslog-ng 3 – (3.28.1 as supplied with Raspberry Pi OS)
- IBM LSF Standard Edition V10.1.0.13
- The Turing Pi system is configured as follows:
Node 1 (turingpi) is the manager node of this cluster in a box and has by far the most storage. Naturally we want to use that as the centralized logging server.
Node |
Hostname |
Hardware |
Notes |
1 |
turingpi |
CM3+ |
LSF manager, NFS server, 128GB SDcard |
2 |
kemeny |
CM3 |
4GB eMMC flash |
3 |
neumann |
CM3+ |
8GB SDcard |
4 |
szilard |
CM3+ |
8GB SDcard |
5 |
teller |
CM3+ |
8GB SDcard |
6 |
vonkarman |
CM3+ |
8GB SDcard |
7 |
wigner |
CM3+ |
8GB SDcard |
Syslog-ng & LSF setup
1. Raspberry Pi OS configures rsyslog out of the box. The first step is to install syslog-ng on Node 1 in the environment. Note that installing syslog-ng automatically disables rsyslog on the nodes.
Truncated output of apt update; apt-get install syslog-ng ->
root@turingpi:~# apt update; apt-get install syslog-ng -y
Hit:1 http://security.debian.org/debian-security bullseye-security InRelease
Hit:2 http://deb.debian.org/debian bullseye InRelease
Hit:3 http://deb.debian.org/debian bullseye-updates InRelease
Hit:4 https://repos.influxdata.com/debian stable InRelease
Hit:5 https://repos.influxdata.com/debian bullseye InRelease
Hit:6 http://archive.raspberrypi.org/debian bullseye InRelease
Hit:7 https://packagecloud.io/ookla/speedtest-cli/debian bullseye InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
....
....
unning kernel seems to be up-to-date.
Failed to check for processor microcode upgrades.
No services need to be restarted.
No containers need to be restarted.
No user sessions are running outdated binaries.
2. With syslog-ng installed, it’s now time to build the configuration for it. A new configuration file fromnet.conf is shown below, in which a syslog-ng destination is created which will aggregate logs from the Turing Pi nodes in /var/log/fromnet in plain text format. Additionally, the logs will be written in JSON format to the file /var/log/fromnet.json.
root@turingpi:~# cat /etc/syslog-ng/fromnet.conf
# source
source s_fromnet {
syslog(port(601));
};
# destination
destination d_fromnet {
file("/var/log/fromnet");
file("/var/log/fromnet.json" template("$(format-json --scope rfc5424 --scope dot-nv-pairs
--rekey .* --shift 1 --scope nv-pairs)\n") );
};
# log path
log {
source(s_fromnet);
destination(d_fromnet);
};
3. Unless we only want to see source IP addresses in the collected logs, it’s necessary to update the syslog-ng configuration file /etc/syslog-ng/syslog-ng.conf to record the hostnames from which the log messages have originated. This is done by adding the keep_hostname(yes) parameter to the options section as follows:
....
....
# First, set some global options.
options { chain_hostnames(off); flush_lines(0); use_dns(no); use_fqdn(no);
keep_hostname(yes);dns_cache(no); owner("root"); group("adm"); perm(0640);
stats_freq(0); bad_hostname("^gconfd$");
};
....
....
4. Next, the IBM LSF configuration is updated to prevent the creation of local logfiles for the LSF daemons. This is done by commenting the LSF_LOGDIR option in the configuration file $LSF_ENVDIR/lsf.conf. At the same time, we also set LSF_LOG_MASK=LOG_DEBUG for testing purposes to enable verbose logging for the LSF daemons.
....
....
# Daemon log messages
# LSF_LOGDIR=/opt/ibm/lsf/log
LSF_LOG_MASK=LOG_DEBUG
....
....
5. Finally, to make the changes take effect, both syslog-ng and LSF are restarted.
root@turingpi:~# systemctl restart syslog-ng
root@turingpi:~# . /opt/ibm/lsf/conf/profile.lsf
root@turingpi:~# lsf_daemons restart
Stopping the LSF subsystem
Starting the LSF subsystem
6. With the configuration ready on the centralized logging server, host turingpi, we now turn our attention to Nodes 2-7 in the cluster. Here we’ll use the parallel-ssh tool to streamline some operations. We start with the installation of syslog-ng across Nodes 2-7. Note that the output of the installation of syslog-ng across the compute nodes has been truncated.
Truncated output of parallel-ssh -h /opt/workers -i “apt-get install syslog-ng -y” ->
root@turingpi:~# parallel-ssh -h /opt/workers -i "apt-get install syslog-ng -y"
[1] 13:57:07 [SUCCESS] kemeny
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
libbson-1.0-0 libdbi1 libesmtp6 libhiredis0.14 libivykis0 libmaxminddb0
libmongoc-1.0-0 libmongocrypt0 libnet1 libprotobuf-c1 librabbitmq4
librdkafka1 libriemann-client0 libsensors-config libsensors5 libsnappy1v5
libsnmp-base libsnmp40 syslog-ng-core syslog-ng-mod-add-contextual-data
syslog-ng-mod-amqp syslog-ng-mod-examples syslog-ng-mod-extra
syslog-ng-mod-geoip2 syslog-ng-mod-getent syslog-ng-mod-graphite
syslog-ng-mod-http syslog-ng-mod-map-value-pairs syslog-ng-mod-mongodb
syslog-ng-mod-python syslog-ng-mod-rdkafka syslog-ng-mod-redis
syslog-ng-mod-riemann syslog-ng-mod-slog syslog-ng-mod-smtp
syslog-ng-mod-snmp syslog-ng-mod-sql syslog-ng-mod-stardate
syslog-ng-mod-stomp syslog-ng-mod-xml-parser
Suggested packages:
mmdb-bin lm-sensors snmp-mibs-downloader rabbitmq-server graphite-web
mongodb-server libdbd-mysql libdbd-pgsql libdbd-sqlite3 activemq
The following packages will be REMOVED:
rsyslog
....
....
Setting up syslog-ng (3.28.1-2+deb11u1) ...
Processing triggers for man-db (2.9.4-2) ...
Processing triggers for libc-bin (2.31-13+rpt2+rpi1+deb11u8) ...
Stderr: debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
....
....
7. Following the installation of syslog-ng across Nodes 2-7. We verify that the installation was successful by checking the syslog-ng service status.
Truncated output of parallel-ssh -h /opt/workers -i “systemctl status syslog-ng” ->
root@turingpi:~# parallel-ssh -h /opt/workers -i "systemctl status syslog-ng"
[1] 14:03:46 [SUCCESS] kemeny
● syslog-ng.service - System Logger Daemon
Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2024-03-28 13:57:01 EDT; 6min ago
Docs: man:syslog-ng(8)
Main PID: 28694 (syslog-ng)
Tasks: 2 (limit: 779)
CPU: 40.228s
CGroup: /system.slice/syslog-ng.service
└─28694 /usr/sbin/syslog-ng -F
Mar 28 13:57:00 kemeny systemd[1]: Starting System Logger Daemon...
Mar 28 13:57:01 kemeny syslog-ng[28694]: DIGEST-MD5 common mech free
Mar 28 13:57:01 kemeny systemd[1]: Started System Logger Daemon.
....
....
8. Create the configuration file send.conf in /opt on host turingpi. Note that /opt is an NFS export on turingpi and is NFS mounted by all of the compute nodes. This file will set the HOST field to the local hostname for log messages that are sent. This in done in the subsequent steps where “placeholder” will be replaced using a sed operation with the local hostname. Additionally, a data source s_hpc is defined which will scan /opt/ibm/lsf/log for the presence of LSF daemon logfiles.
root@turingpi:/# cat /opt/send.conf
rewrite r_host { set("placeholder", value("HOST")); };
destination d_net {
syslog("turingpi" port(601));
};
source s_hpc {
wildcard-file(
base-dir("/opt/ibm/lsf/log")
filename-pattern("*.log.*")
recursive(no)
follow-freq(1)
);
};
log {
source(s_src);
source(s_hpc);
rewrite(r_host);
destination(d_net);
};
9. On Nodes 2-7, copy the file /opt/send.conf to /etc/syslog-ng/conf.d/send.conf.
root@turingpi:/# parallel-ssh -h /opt/workers -i "cp /opt/send.conf /etc/syslog-ng/conf.d"
[1] 14:19:29 [SUCCESS] kemeny
[2] 14:19:30 [SUCCESS] vonkarman
[3] 14:19:30 [SUCCESS] wigner
[4] 14:19:30 [SUCCESS] szilard
[5] 14:19:30 [SUCCESS] teller
[6] 14:19:31 [SUCCESS] neumann
10. Using sed, replace the “placeholder” string in /etc/syslog-ng/conf.d/send.conf with the local hostname. And we also double check that the change was correctly made.
root@turingpi:/# parallel-ssh -h /opt/workers -i 'HOST=`hostname`; sed -i "s/placeholder/$HOST/g" /etc/syslog-ng/conf.d/send.conf'
[1] 14:38:09 [SUCCESS] kemeny
[2] 14:38:09 [SUCCESS] teller
[3] 14:38:09 [SUCCESS] vonkarman
[4] 14:38:09 [SUCCESS] wigner
[5] 14:38:09 [SUCCESS] neumann
[6] 14:38:09 [SUCCESS] szilard
Output of parallel-ssh -h /opt/workers -i “cat /etc/syslog-ng/conf.d/send.conf” ->
root@turingpi:/# parallel-ssh -h /opt/workers -i "cat /etc/syslog-ng/conf.d/send.conf"
[1] 14:38:33 [SUCCESS] kemeny
rewrite r_host { set("kemeny", value("HOST")); };
destination d_net {
syslog("turingpi" port(601));
};
source s_hpc {
wildcard-file(
base-dir("/opt/ibm/lsf/log")
filename-pattern("*.log.*")
recursive(no)
follow-freq(1)
);
};
log {
source(s_sys);
source(s_hpc);
rewrite(r_host);
destination(d_net);
};
11. Finally, syslog-ng is restarted on Nodes 2-7 and the status of the service is checked to ensure that there are no errors.
root@turingpi:/opt# parallel-ssh -h /opt/workers -i "systemctl restart syslog-ng"
[1] 14:49:03 [SUCCESS] kemeny
[2] 14:49:05 [SUCCESS] szilard
[3] 14:49:06 [SUCCESS] vonkarman
[4] 14:49:06 [SUCCESS] neumann
[5] 14:49:06 [SUCCESS] teller
[6] 14:49:07 [SUCCESS] wigner
Truncated output of parallel-ssh -h /opt/workers -i “systemctl status syslog-ng” ->
root@turingpi:/opt# parallel-ssh -h /opt/workers -i "systemctl status syslog-ng"
[1] 14:49:31 [SUCCESS] kemeny
● syslog-ng.service - System Logger Daemon
Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2024-03-28 14:49:03 EDT; 28s ago
Docs: man:syslog-ng(8)
Main PID: 34982 (syslog-ng)
Tasks: 2 (limit: 779)
CPU: 398ms
CGroup: /system.slice/syslog-ng.service
└─34982 /usr/sbin/syslog-ng -F
Mar 28 14:49:02 kemeny systemd[1]: Starting System Logger Daemon...
Mar 28 14:49:02 kemeny syslog-ng[34982]: DIGEST-MD5 common mech free
Mar 28 14:49:03 kemeny systemd[1]: Started System Logger Daemon.
Does it work?
The answer to this question is an emphatic YES!
Let’s begin with a simple test running the logger command on all of the compute nodes, while monitoring /var/log/fromnet on host turingpi.
root@turingpi:/home/lsfadmin# date; parallel-ssh -h /opt/workers -i 'HOST=`hostname`; logger This is a test from node $HOST. Do not panic!'
Wed 3 Apr 21:41:45 EDT 2024
[1] 21:41:46 [SUCCESS] teller
[2] 21:41:46 [SUCCESS] neumann
[3] 21:41:46 [SUCCESS] wigner
[4] 21:41:46 [SUCCESS] kemeny
[5] 21:41:46 [SUCCESS] szilard
[6] 21:41:46 [SUCCESS] vonkarman
root@turingpi:/var/log# tail -f fromnet |grep panic
Apr 3 21:41:46 szilard root[10918]: This is a test from node szilard. Do not panic!
Apr 3 21:41:46 wigner root[11011]: This is a test from node wigner. Do not panic!
Apr 3 21:41:46 neumann root[11121]: This is a test from node neumann. Do not panic!
Apr 3 21:41:46 kemeny root[11029]: This is a test from node kemeny. Do not panic!
Apr 3 21:41:46 teller root[10875]: This is a test from node teller. Do not panic!
Apr 3 21:41:46 vonkarman root[10805]: This is a test from node vonkarman. Do not panic!
Next, let’s look at whether the LSF logging is also captured. Here we simply restart the LSF daemons on Nodes 2-7 and monitor the /var/log/fromnet file. The output can be viewed below.
Truncated output of tail -f /var/log/fromnet ->
root@turingpi:/var/log# tail -f fromnet
Apr 3 21:44:59 kemeny res[691]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 kemeny lim[688]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 kemeny sbatchd[693]: Daemon on host <kemeny> received signal <15>; exiting
Apr 3 21:44:59 kemeny lsf_daemons[11434]: Stopping the LSF subsystem
Apr 3 21:44:59 kemeny systemd[1]: lsfd.service: Succeeded.
Apr 3 21:44:59 kemeny systemd[1]: lsfd.service: Consumed 11min 56.744s CPU time.
Apr 3 21:44:59 szilard lim[685]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 szilard res[687]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 szilard sbatchd[689]: Daemon on host <szilard> received signal <15>; exiting
Apr 3 21:44:59 vonkarman lim[686]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 vonkarman sbatchd[690]: Daemon on host <vonkarman> received signal <15>; exiting
Apr 3 21:44:59 vonkarman res[688]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 teller lim[683]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 teller res[689]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 teller sbatchd[691]: Daemon on host <teller> received signal <15>; exiting
Apr 3 21:44:59 teller lsf_daemons[11294]: Stopping the LSF subsystem
Apr 3 21:44:59 wigner lim[719]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 wigner res[722]: term_handler: Received signal 15, exiting
Apr 3 21:44:59 wigner sbatchd[724]: Daemon on host <wigner> received signal <15>; exiting
Apr 3 21:44:59 wigner lsf_daemons[11438]: Stopping the LSF subsystem
Conclusion
What started out as a chat about logging, grew into an idea of a blog, for which I am thankful for the collaboration of Peter. We’ve illustrated an example here of how to setup centralized logging on a Turing Pi system with syslog-ng to collect system and LSF logs.
Of course collecting log messages centrally is just the start of a journey. It is an important step as it allows for significantly easier debugging and troubleshooting. You can store logs to databases for easier search. And once you better understand which log messages are important, you can even potentially parse those and generate alersts from them or dashboards. All of these help you to make sure that your HPC system runs smoothly and with minimal downtime.