This should be in a best practices guide somewhere. Job include post proc as no allows you to allow long running post processes not to block incoming jobs, for example copying results from local scratch to NFS, but also has a risk introducing negative outcomes when the GPU is attempted to be used during this overlap period, especially when the mode needs to be switched as I alluded to in the previous post.
Original Message:
Sent: Sat February 03, 2024 11:27 AM
From: Ray Rose
Subject: Problem with GPU allocation in Spectrum LSF Suite for Enterprise
I added the "JOB_INCLUDE_POSTPROC=Y" statement, and then, for good measure, restarted the LSF daemons on all the LSF servers (which I had already done twice already), and then the problem went away. I guess I'll never know whether this statement fixed it, or if it was restarting the servers. But I can't see any scenario where this statement could do any harm, so I plan to leave it there.
------------------------------
Ray Rose
Original Message:
Sent: Fri January 19, 2024 02:17 AM
From: Bernd Dammann
Subject: Problem with GPU allocation in Spectrum LSF Suite for Enterprise
We have seen similar things some years ago, and since then we have this in our lsb.params file:
# this is needed to make sure, that cgroups etc are cleared, before a new
# job gets started (suggested by LSF support, added Dec 2016)
# important for releasing GPUs after usage!
JOB_INCLUDE_POSTPROC=Y
What we observed was, that the first GPU job ended, and the next one was dispatched, but the GPU was still marked as "in use", and then the new job failed, because it could not get exclusive access to the GPU!
Maybe it helps!
------------------------------
Bernd Dammann
Original Message:
Sent: Thu January 18, 2024 10:17 AM
From: Ray Rose
Subject: Problem with GPU allocation in Spectrum LSF Suite for Enterprise
"bjobs -l" for a typical job shows:
RESOURCE REQUIREMENT DETAILS:
Combined: select[((type == X86_64 &&standardnode) && (ngpus>0)) && (type == an
y)] order[r15s:pg] rusage[mem=45056.00:ngpus_physical=1.00
] span[ptile=16] affinity[core(1)*1]
Effective: select[((type == X86_64 &&standardnode) && (type == any))] order[r1
5s:pg] rusage[mem=45056.00] span[ptile=16] affinity[core(1
)*1]
GPU REQUIREMENT DETAILS:
Combined: num=1:mode=exclusive_process:mps=no:j_exclusive=yes:gvendor=nvidia
Effective: num=1:mode=exclusive_process:mps=no:j_exclusive=yes:gvendor=nvidia
Note that the GPU requirement is not there in "Effective."
There is no "EXTERNAL MESSAGES:" section in the reply, and a "bread" command returns "No message available."
When the job starts, $CUDA_VISIBLE_DEVICES is not set, and no devices show up in the nvidia-smi response.
Nothing has been written to the sbatchd log since shortly after the master was restarted yesterday.
This does not affect all jobs or all nodes or all of any particular user's jobs. For example, when I submit jobs, they almost never get this problem.
Consider the following three jobs:
1275012 jkweber DONE x86_2h 4*cccxc501.pok.ibm.c Jan 16 12:36 Jan 16 12:36 Jan 16 12:49 L
1275097 jkweber EXIT x86_6h 4*cccxc501.pok.ibm.c Jan 16 12:49 Jan 16 12:49 Jan 16 12:50 L
1275139 jkweber DONE x86_2h 4*cccxc501.pok.ibm.c Jan 16 13:01 Jan 16 13:01 Jan 16 13:10 L
The timestamps are for submit time, start time, finish time, respectively.
The first job succeeded. The second job failed with no GPUs allocated. The third job succeeded. The only noteworthy difference is the queue. I created the x86_2h queue to troubleshoot this problem. The only significant differences between x86_2h and x86_6h are the maximum runtime (6 hours vs 2 hours) and the "order" setting - "order[r15s:pg]" for x86_6h and "order[-slots]" for x86_2h. This appeared to solve the problem, but then when more users started using the x86_2h queue, the problem started showing up there.
I'll try restarting some of the nodes. Is there any need to restart the masters?
The fix you mention appears to be for LSF 10.1, not Suite for Enterprise. There is no patchinstall command in S4E. Also, the README describes it as "Fix to support NVIDIA H100 GPU for an LSF environment.' Our cluster has a mix of V100s, A100s, and H100s. All have been working well for many monthis. Since this problem first showed up about a week ago, it has affected only V100s and A100s. H100s are doing file. My current hunch is that one or more users may be doing something that causes compute nodes to go into a "bad" state, and then all subsequent jobs that request GPUs get this problem. But somehow, nodes eventually recover and start allocating GPUs properly. I suspect that the H100s are not getting this failure because only a small subset of the user community are authorized to use the H100s, and that subset does not include users who are triggering the problem.
------------------------------
Ray Rose
Original Message:
Sent: Tue January 16, 2024 02:04 PM
From: YI SUN
Subject: Problem with GPU allocation in Spectrum LSF Suite for Enterprise
When the issue happens, does it correct itself later without any change on the same GPU node? Does bjobs -l show GPU allocation info? And does lshosts/lsload/bhosts show GPU info correctly?
You may also check LSF sbatchd log see if there is any error messages? If job cross nodes, specifically check sbatchd log on job's first execution node.
You may use nvdia-smi reset option to reset GPU then see if it helps.
One suggestion is to apply post FP14 patch http://www.ibm.com/support/fixcentral/swg/selectFixes?product=ibm/Other+software/IBM+Spectrum+LSF&release=All&platform=All&function=fixId&fixids=lsf-10.1-build601754&includeSupersedes=0 and then check the result again..
As GPU management at job execution level involves LSF sbatchd/res/lim services, the way to start LSF services on the node sometime may affect LSF behaviour. One thing you can try is to stop LSF services on a node, use lsadmin/badmin or bctrld to start LSF services, then submit GPU job to the node see if the issue will happen.
Or create a support case to request more investigation.
------------------------------
YI SUN
Original Message:
Sent: Tue January 16, 2024 12:03 PM
From: Ray Rose
Subject: Problem with GPU allocation in Spectrum LSF Suite for Enterprise
Running LSF Suite For Enterprise 10.2.0.14 on a medium size GPU cluster - 200 compute nodes, 1600 GPUs.
About a week ago, we started seeing some jobs that requested GPUs but didn't get them. No GPUs showed up in nvidia-smi response. No message passed to the compute node (null response from bread command). The condition is intermittent and appears to be random. For any compute node, any user, any queue, sometimes all dispatched jobs don't get their GPUs for a while, then they do, then they don't, etc. Whenever a change a setting, things seem to get better for a while, then the problem comes back. There were no changes to cluster settings before this started. Any suggestions?
------------------------------
Ray Rose
------------------------------