Greetings,
We are running LSF Suite for HPC 10.2.0.6. We observe that resizing the terminal window of an interactive LSF job will "detach" from the job, leaving the job running, but losing the ability to interact with it.
This appears to be a similar to (but not the same as) this old LSF 8 bug:
P100757: This fix prevents an interactive parallel job from exiting with the SIGPROF signal when resizing its terminal window.
Ibm |
remove preview |
|
P100757: This fix prevents an interactive parallel job from exiting with the SIGPROF signal when resizing its terminal window. |
LSF 8.0 Linux |
View this on Ibm > |
|
|
In that case SIGWINCH is interpreted as SIGPROF, which is not the case for us. In that issue, it sounds like the (wrong) signal was actually getting delivered to the process, and since it's not trapping SIGPROF, the process exits
(edited). In our problem, the "res" process is looking to see whether the job submitter is in the "docker" group. I can't imagine why it cares if the user is in that group when the signal arrives. The job is already running. Besides, with EXEC_DRIVERs, the job is run as the lsfadmin user, not the submitting user.First some details about our configuration:
Begin Application
NAME = docker1
CONTAINER = docker[image($LSB_CONTAINER_IMAGE) options(--rm)]
EXEC_DRIVER = starter[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker1_starter.py] controller[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker-control.py] monitor[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker-monitor.py]
End Application
We are using our own python job wrapper to build the "docker run" command. It began life as the IBM wrapper but now does more. We hope, but are not confident, that this is not the source of the problem.
Here is what we observe:
In trying to fix the message:
checkDriverAuthfork: user <mcallawa> is not in docker group
we attempt to use PAM to extend the "docker" group to the group we use to indicate membership in a "compute enabled feature". This is an Active Directory group called "compute". So we enable PAM:
USE_PAM_CREDS = Y
Then add the following /etc/pam.d/lsf
auth required pam_group.so
auth required pam_localuser.so
account required pam_unix.so
session required pam_limits.so
And /etc/security/group.conf
*;*;%compute;Al0000-2400;docker
So in other words, "docker" is a local group to each host with GID 490, and "compute" is an AD group, and we expect PAM to "add it" to the docker group.
Empirically we see that "res" PIDs *do* obtain the docker GID of 490, but once the docker containers are launched, the container processes do NOT have the docker GID.
We observe this problem with the EXEC_DRIVER implementation in lsb.applications, but not with the old JOB_LAUNCHER interface.
It *seems* as though something in the "res" process is doing a group check where there does not need to be one. Or, if it does need to do this, it's ignoring PAM.
------------------------------
Matt Callaway
------------------------------
#SpectrumComputingGroup