High Performance Computing Group

 View Only
Expand all | Collapse all

Terminal resize disconnects from job, or, problems with signal handling in the NIOS process

  • 1.  Terminal resize disconnects from job, or, problems with signal handling in the NIOS process

    Posted Tue April 07, 2020 09:22 AM
    Greetings,

    We are running LSF Suite for HPC 10.2.0.6. We observe that resizing the terminal window of an interactive LSF job will "detach" from the job, leaving the job running, but losing the ability to interact with it.

    This appears to be a similar to (but not the same as) this old LSF 8 bug: 

    P100757: This fix prevents an interactive parallel job from exiting with the SIGPROF signal when resizing its terminal window.
    Ibm remove preview
    P100757: This fix prevents an interactive parallel job from exiting with the SIGPROF signal when resizing its terminal window.
    LSF 8.0 Linux
    View this on Ibm >


    In that case SIGWINCH is interpreted as SIGPROF, which is not the case for us. In that issue, it sounds like the (wrong) signal was actually getting delivered to the process, and since it's not trapping SIGPROF, the process exits (edited). In our problem, the "res" process is looking to see whether the job submitter is in the "docker" group. I can't imagine why it cares if the user is in that group when the signal arrives. The job is already running. Besides, with EXEC_DRIVERs, the job is run as the lsfadmin user, not the submitting user.

    First some details about our configuration:

    Begin Application
    NAME = docker1
    CONTAINER = docker[image($LSB_CONTAINER_IMAGE) options(--rm)]
    EXEC_DRIVER = starter[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker1_starter.py] controller[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker-control.py] monitor[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker-monitor.py]
    End Application

    We are using our own python job wrapper to build the "docker run" command. It began life as the IBM wrapper but now does more. We hope, but are not confident, that this is not the source of the problem.

    Here is what we observe:

    We performed an exhaustive search of all the signals to see what happens with them.  This is from sending the signal to the nios process on the compute client, the job that acts as the user's intermediary when submitting an interactive job.  It relays signals to the processes running on the exec host.  There are 4 categories the signals fall into based on what happens.

    1) The bug this issue is about.  The interactive connection is terminated, the container continues to run on the exec host, and the res process on the exec host logs the message.  I'll note that these are all signals that can get generated from the terminal layer.

    checkDriverAuthfork: user <mcallawa> is not in docker group
    sigJobLevelDriver: Job failed in lsfExecDriverfork(). 
    • INT
    • QUIT
    • TERM
    • WINCH

    2) Signals that seem to work as intended.  They're delivered to the processes in the container, res does not log anything, and the appropriate thing happens:

    • HUP - the job is killed, nothing is logged and the container is killed
    • STOP - the shell stops the job, fg works fine afterward
    • TSTP - The processes stop responding to input, "docker ps" on the exec host shows the container as "Paused"

    3) Signals where nothing seems to happen at all.  I believe these are signals the nios process traps for its own use

    • USR2
    • PIPE
    • CHLD
    • CONT
    • TTIN
    • TTOU
    • URG
    • TMIN

    4) Signals where the interactive session is terminated, res does NOT log a message, and the container keeps running.  I believe these are just signals that nios does not trap and just exits when receiving that signal because it's just the default action for untrapped signals.  I wouldn't categorize these as being a bug:

    • ILL
    • TRAP
    • ABRT
    • BUS
    • FPE
    • KILL
    • USR1
    • SEGV
    • ALRM
    • STKFLT
    • XCPU
    • XFSZ
    • VTALRM
    • PROF
    • IO
    • PWR
    • SYS

    In trying to fix the message:

    checkDriverAuthfork: user <mcallawa> is not in docker group

    we attempt to use PAM to extend the "docker" group to the group we use to indicate membership in a "compute enabled feature". This is an Active Directory group called "compute". So we enable PAM:

    USE_PAM_CREDS = Y

    Then add the following /etc/pam.d/lsf

    auth required pam_group.so
    auth required pam_localuser.so
    account required pam_unix.so
    session required pam_limits.so

    And /etc/security/group.conf

    *;*;%compute;Al0000-2400;docker

    So in other words, "docker" is a local group to each host with GID 490, and "compute" is an AD group, and we expect PAM to "add it" to the docker group.

    Empirically we see that "res" PIDs *do* obtain the docker GID of 490, but once the docker containers are launched, the container processes do NOT have the docker GID.

    We observe this problem with the EXEC_DRIVER implementation in lsb.applications, but not with the old JOB_LAUNCHER interface.

    It *seems* as though something in the "res" process is doing a group check where there does not need to be one. Or, if it does need to do this, it's ignoring PAM.


    ------------------------------
    Matt Callaway
    ------------------------------

    #SpectrumComputingGroup


  • 2.  RE: Terminal resize disconnects from job, or, problems with signal handling in the NIOS process
    Best Answer

    Posted Tue April 07, 2020 12:11 PM
    Hi Matt,

    There is a fix in 10.1.0.9 that seems like a possible match with this description:  "This fix prevents an interactive Docker job from losing its connection when resizing the terminal".   

    Here is a link to 10.1 fix pack 9:

    https://www.ibm.com/support/pages/ibm-spectrum-lsf-101-fix-pack-9-10109

    Please contact the LSF Support team for additional help.

    ------------------------------
    John Welch
    ------------------------------