High Performance Computing Group

 View Only

 IBM Platform LSF 10.1 multinode firewall issue

Colin Rudakiewicz's profile image
Colin Rudakiewicz posted Thu October 17, 2024 03:00 PM
Hi all,

We have MSC Actran running on two nodes running ok with firewalld disabled.

Using the bundled mpiexec.hydra (export I_MPI_HYDRA_BOOTSTRAP=lsf) to
integrate with lsf using blaunch instead of default ssh.

I don’t know exact sequence, but when firewalld is enabled blaunch starts
hydra_bstrap_proxy on node 2, nothing is started on node 1, nios process
also starts on node 2 and listens on two ephemeral/random TCP ports, res
on node 2 established connection to one of these but res on node 1 is
unable to get pas SYN of three way TCP handshake, res process strace on node 1 shows:

18102 09:55:36 connect(3, {sa_family=AF_INET, sin_port=htons(46100),
sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EHOSTUNREACH (No route to host)

tcpdump on node2 shows ICMP reject host prohibited generated by firewalld..

node1 res log shows: resRexecPjob: resPjobCallbackNIOS(46100) failed.

In our lsf.conf we have explicitly set:
LSF_NIOS_PORT_RANGE=47000-48000

But for whatever reason nios starts on random ephemeral TCP port number
outside of this range? We configured LSF_NIOS_PORT_RANGE many months ago
as we were experiencing firewall problems with bsub -K and that continues
to work normally.

Any ideas please why LSF_NIOS_PORT_RANGE is ignored?

Best Regards - Colin

YI SUN's profile image
YI SUN
Colin Rudakiewicz's profile image
Colin Rudakiewicz

Hello Yi, patch details appear to be an identical  match to the problem we are seeing, will test install..

Many Thanks,