High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to discussions

Expand all | Collapse all

Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Bernd DammannFri January 19, 2024 02:18 AM

Larry AdamsFri January 19, 2024 09:10 AM

If in cases where the GPU is still being used in post exec, that setting would be critical. This ...

Ray RoseFri January 19, 2024 01:38 PM

I'll give the JOB_INCLUDE_POSTPROC=Y option a try. But what's strange here is that this problem ...

Ray RoseSat February 03, 2024 11:28 AM

I added the " JOB_INCLUDE_POSTPROC=Y" statement, and then, for good measure, restarted the LSF daemons ...

Larry AdamsMon February 05, 2024 10:24 AM

This should be in a best practices guide somewhere. Job include post proc as no allows you to allow ...

1. Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
Ray Rose
Posted Tue January 16, 2024 12:03 PM

Reply
Running LSF Suite For Enterprise 10.2.0.14 on a medium size GPU cluster - 200 compute nodes, 1600 GPUs.
About a week ago, we started seeing some jobs that requested GPUs but didn't get them. No GPUs showed up in nvidia-smi response. No message passed to the compute node (null response from bread command). The condition is intermittent and appears to be random. For any compute node, any user, any queue, sometimes all dispatched jobs don't get their GPUs for a while, then they do, then they don't, etc. Whenever a change a setting, things seem to get better for a while, then the problem comes back. There were no changes to cluster settings before this started. Any suggestions?

------------------------------
Ray Rose
------------------------------
2. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
YI SUN
Posted Tue January 16, 2024 02:04 PM

Reply
When the issue happens, does it correct itself later without any change on the same GPU node? Does bjobs -l show GPU allocation info? And does lshosts/lsload/bhosts show GPU info correctly?

You may also check LSF sbatchd log see if there is any error messages? If job cross nodes, specifically check sbatchd log on job's first execution node.

You may use nvdia-smi reset option to reset GPU then see if it helps.

One suggestion is to apply post FP14 patch http://www.ibm.com/support/fixcentral/swg/selectFixes?product=ibm/Other+software/IBM+Spectrum+LSF&release=All&platform=All&function=fixId&fixids=lsf-10.1-build601754&includeSupersedes=0 and then check the result again..

As GPU management at job execution level involves LSF sbatchd/res/lim services, the way to start LSF services on the node sometime may affect LSF behaviour. One thing you can try is to stop LSF services on a node, use lsadmin/badmin or bctrld to start LSF services, then submit GPU job to the node see if the issue will happen.

Or create a support case to request more investigation.

------------------------------
YI SUN
------------------------------

Original Message
3. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
Ray Rose
Posted Thu January 18, 2024 10:17 AM

Reply
"bjobs -l" for a typical job shows:

RESOURCE REQUIREMENT DETAILS:
Combined: select[((type == X86_64 &&standardnode) && (ngpus>0)) && (type == an
y)] order[r15s:pg] rusage[mem=45056.00:ngpus_physical=1.00
] span[ptile=16] affinity[core(1)*1]
Effective: select[((type == X86_64 &&standardnode) && (type == any))] order[r1
5s:pg] rusage[mem=45056.00] span[ptile=16] affinity[core(1
)*1]

GPU REQUIREMENT DETAILS:
Combined: num=1:mode=exclusive_process:mps=no:j_exclusive=yes:gvendor=nvidia
Effective: num=1:mode=exclusive_process:mps=no:j_exclusive=yes:gvendor=nvidia

Note that the GPU requirement is not there in "Effective."
There is no "EXTERNAL MESSAGES:" section in the reply, and a "bread" command returns "No message available."
When the job starts, $CUDA_VISIBLE_DEVICES is not set, and no devices show up in the nvidia-smi response.

Nothing has been written to the sbatchd log since shortly after the master was restarted yesterday.

This does not affect all jobs or all nodes or all of any particular user's jobs. For example, when I submit jobs, they almost never get this problem.
Consider the following three jobs:
1275012 jkweber DONE x86_2h 4*cccxc501.pok.ibm.c Jan 16 12:36 Jan 16 12:36 Jan 16 12:49 L
1275097 jkweber EXIT x86_6h 4*cccxc501.pok.ibm.c Jan 16 12:49 Jan 16 12:49 Jan 16 12:50 L
1275139 jkweber DONE x86_2h 4*cccxc501.pok.ibm.c Jan 16 13:01 Jan 16 13:01 Jan 16 13:10 L

The timestamps are for submit time, start time, finish time, respectively.

The first job succeeded. The second job failed with no GPUs allocated. The third job succeeded. The only noteworthy difference is the queue. I created the x86_2h queue to troubleshoot this problem. The only significant differences between x86_2h and x86_6h are the maximum runtime (6 hours vs 2 hours) and the "order" setting - "order[r15s:pg]" for x86_6h and "order[-slots]" for x86_2h. This appeared to solve the problem, but then when more users started using the x86_2h queue, the problem started showing up there.

I'll try restarting some of the nodes. Is there any need to restart the masters?

The fix you mention appears to be for LSF 10.1, not Suite for Enterprise. There is no patchinstall command in S4E. Also, the README describes it as "Fix to support NVIDIA H100 GPU for an LSF environment.' Our cluster has a mix of V100s, A100s, and H100s. All have been working well for many monthis. Since this problem first showed up about a week ago, it has affected only V100s and A100s. H100s are doing file. My current hunch is that one or more users may be doing something that causes compute nodes to go into a "bad" state, and then all subsequent jobs that request GPUs get this problem. But somehow, nodes eventually recover and start allocating GPUs properly. I suspect that the H100s are not getting this failure because only a small subset of the user community are authorized to use the H100s, and that subset does not include users who are triggering the problem.

------------------------------
Ray Rose
------------------------------

Original Message
4. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
YI SUN
Posted Thu January 18, 2024 09:15 PM

Reply
it seems sometime merge of resource requirement for the job doesn't work correctly. Better open a case to Support.

------------------------------
YI SUN
------------------------------

Original Message
5. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
Bernd Dammann
Posted Fri January 19, 2024 02:18 AM

Reply
We have seen similar things some years ago, and since then we have this in our lsb.params file:

# this is needed to make sure, that cgroups etc are cleared, before a new
# job gets started (suggested by LSF support, added Dec 2016)
# important for releasing GPUs after usage!
JOB_INCLUDE_POSTPROC=Y

What we observed was, that the first GPU job ended, and the next one was dispatched, but the GPU was still marked as "in use", and then the new job failed, because it could not get exclusive access to the GPU!

Maybe it helps!

------------------------------
Bernd Dammann
------------------------------

Original Message
6. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
Larry Adams
Posted Fri January 19, 2024 09:10 AM

Reply
If in cases where the GPU is still being used in post exec, that setting would be critical. This would be especially true if the GPU is in any of the Exclusive modes.

------------------------------
Larry Adams
------------------------------

Original Message
7. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
Ray Rose
Posted Fri January 19, 2024 01:38 PM

Reply
I'll give the JOB_INCLUDE_POSTPROC=Y option a try. But what's strange here is that this problem just started spontaneously about a week ago. I didn't change anything in LSF or Linux. Also, it only affects some jobs, and I can't find anything that could be a trigger. My current hunch, for various reasons, is that one or more users must be doing something to cause it. In some queues with limited membership, it never happens. This suggests that the culprit(s) must not have access to those queues, so it is only happening in the queues that are open to all users.

------------------------------
Ray Rose
------------------------------

Original Message
8. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
Ray Rose
Posted Sat February 03, 2024 11:28 AM

Reply
I added the "JOB_INCLUDE_POSTPROC=Y" statement, and then, for good measure, restarted the LSF daemons on all the LSF servers (which I had already done twice already), and then the problem went away. I guess I'll never know whether this statement fixed it, or if it was restarting the servers. But I can't see any scenario where this statement could do any harm, so I plan to leave it there.

------------------------------
Ray Rose
------------------------------

Original Message
9. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Like
Larry Adams
Posted Mon February 05, 2024 10:24 AM

Reply
This should be in a best practices guide somewhere. Job include post proc as no allows you to allow long running post processes not to block incoming jobs, for example copying results from local scratch to NFS, but also has a risk introducing negative outcomes when the GPU is attempted to be used during this overlap period, especially when the mode needs to be switched as I alluded to in the previous post.

------------------------------
Larry Adams
------------------------------

Original Message

High Performance Computing Group

High Performance Computing Group

Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Ray RoseTue January 16, 2024 12:03 PM

YI SUNTue January 16, 2024 02:04 PM

Ray RoseThu January 18, 2024 10:17 AM

YI SUNThu January 18, 2024 09:15 PM

Bernd DammannFri January 19, 2024 02:18 AM

Larry AdamsFri January 19, 2024 09:10 AM

Ray RoseFri January 19, 2024 01:38 PM

Ray RoseSat February 03, 2024 11:28 AM

Larry AdamsMon February 05, 2024 10:24 AM

1. Problem with GPU allocation in Spectrum LSF Suite for Enterprise

2. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

3. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

4. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

5. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

6. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

7. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

8. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

9. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Additional
Resources

Office

Quick Links

High Performance Computing Group

High Performance Computing Group

Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Ray RoseTue January 16, 2024 12:03 PM

YI SUNTue January 16, 2024 02:04 PM

Ray RoseThu January 18, 2024 10:17 AM

YI SUNThu January 18, 2024 09:15 PM

Bernd DammannFri January 19, 2024 02:18 AM

Larry AdamsFri January 19, 2024 09:10 AM

Ray RoseFri January 19, 2024 01:38 PM

Ray RoseSat February 03, 2024 11:28 AM

Larry AdamsMon February 05, 2024 10:24 AM

1. Problem with GPU allocation in Spectrum LSF Suite for Enterprise

2. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

3. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

4. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

5. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

6. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

7. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

8. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

9. RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

Related Content

IBM Spectrum LSF with Nvidia DGX systems

LSF GPU Preemption Demonstration

LSF and AMD GPUs

Extending the Spectrum LSF GUI to display job GPU metrics

GPU usage information for jobs in IBM Spectrum LSF

Additional Resources

Office

Quick Links

Additional
Resources