High Performance Computing Group

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only
  • 1.  Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Tue January 16, 2024 12:03 PM

    Running LSF Suite For Enterprise 10.2.0.14 on a medium size GPU cluster - 200 compute nodes, 1600 GPUs.
    About a week ago, we started seeing some jobs that requested GPUs but didn't get them. No GPUs showed up in nvidia-smi response. No message passed to the compute node (null response from bread command). The condition is intermittent and appears to be random. For any compute node, any user, any queue, sometimes all dispatched jobs don't get their GPUs for a while, then they do, then they don't, etc. Whenever a change a setting, things seem to get better for a while, then the problem comes back. There were no changes to cluster settings before this started. Any suggestions?



    ------------------------------
    Ray Rose
    ------------------------------


  • 2.  RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Tue January 16, 2024 02:04 PM

    When the issue happens, does it correct itself later without any change on the same GPU node? Does bjobs -l show GPU allocation info? And does lshosts/lsload/bhosts show GPU info correctly? 

    You may also check LSF sbatchd log see if there is any error messages? If job cross nodes, specifically check sbatchd log on job's first execution node.

    You may use nvdia-smi reset option to reset GPU then see if it helps.

    One suggestion is to apply post FP14 patch http://www.ibm.com/support/fixcentral/swg/selectFixes?product=ibm/Other+software/IBM+Spectrum+LSF&release=All&platform=All&function=fixId&fixids=lsf-10.1-build601754&includeSupersedes=0 and then check the result again..

    As GPU management at job execution level involves LSF sbatchd/res/lim services, the way to start LSF services on the node sometime may affect LSF behaviour. One thing you can try is to stop LSF services on a node, use lsadmin/badmin or bctrld to start LSF services, then submit GPU job to the node see if the issue will happen.

    Or create a support case to request more investigation.



    ------------------------------
    YI SUN
    ------------------------------



  • 3.  RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Thu January 18, 2024 10:17 AM

    "bjobs -l" for a typical job shows:

    RESOURCE REQUIREMENT DETAILS:
     Combined: select[((type == X86_64 &&standardnode) && (ngpus>0)) && (type == an
                         y)] order[r15s:pg] rusage[mem=45056.00:ngpus_physical=1.00
                         ] span[ptile=16] affinity[core(1)*1]
     Effective: select[((type == X86_64 &&standardnode) && (type == any))] order[r1
                         5s:pg] rusage[mem=45056.00] span[ptile=16] affinity[core(1
                         )*1] 

     GPU REQUIREMENT DETAILS:
     Combined: num=1:mode=exclusive_process:mps=no:j_exclusive=yes:gvendor=nvidia
     Effective: num=1:mode=exclusive_process:mps=no:j_exclusive=yes:gvendor=nvidia

    Note that the GPU requirement is not there in "Effective."
    There is no "EXTERNAL MESSAGES:" section in the reply, and a "bread" command returns "No message available."
    When the job starts, $CUDA_VISIBLE_DEVICES is not set, and no devices show up in the nvidia-smi response.

    Nothing has been written to the sbatchd log since shortly after the master was restarted yesterday.

    This does not affect all jobs or all nodes or all of any particular user's jobs. For example, when I submit jobs, they almost never get this problem.
    Consider the following three jobs:
    1275012   jkweber      DONE         x86_2h    4*cccxc501.pok.ibm.c Jan 16 12:36 Jan 16 12:36 Jan 16 12:49 L
    1275097   jkweber      EXIT         x86_6h    4*cccxc501.pok.ibm.c Jan 16 12:49 Jan 16 12:49 Jan 16 12:50 L 
    1275139   jkweber      DONE         x86_2h    4*cccxc501.pok.ibm.c Jan 16 13:01 Jan 16 13:01 Jan 16 13:10 L 

    The timestamps are for submit time, start time, finish time, respectively.

    The first job succeeded. The second job failed with no GPUs allocated. The third job succeeded. The only noteworthy difference is the queue. I created the x86_2h queue to troubleshoot this problem. The only significant differences between x86_2h and x86_6h are the maximum runtime (6 hours vs 2 hours) and the "order" setting - "order[r15s:pg]" for x86_6h and "order[-slots]" for x86_2h. This appeared to solve the problem, but then when more users started using the x86_2h queue, the problem started showing up there.

    I'll try restarting some of the nodes. Is there any need to restart the masters?

    The fix you mention appears to be for LSF 10.1, not Suite for Enterprise. There is no patchinstall command in S4E. Also, the README describes it as "Fix to support NVIDIA H100 GPU for an LSF environment.' Our cluster has a mix of V100s, A100s, and H100s. All have been working well for many monthis. Since this problem first showed up about a week ago, it has affected only V100s and A100s. H100s are doing file. My current hunch is that one or more users may be doing something that causes compute nodes to go into a "bad" state, and then all subsequent jobs that request GPUs get this problem. But somehow, nodes eventually recover and start allocating GPUs properly. I suspect that the H100s are not getting this failure because only a small subset of the user community are authorized to use the H100s, and that subset does not include users who are triggering the problem.



    ------------------------------
    Ray Rose
    ------------------------------



  • 4.  RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Thu January 18, 2024 09:15 PM

    it seems sometime merge of resource requirement for the job doesn't work correctly. Better open a case to Support.



    ------------------------------
    YI SUN
    ------------------------------



  • 5.  RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Fri January 19, 2024 02:18 AM

    We have seen similar things some years ago, and since then we have this in our lsb.params file:

    # this is needed to make sure, that cgroups etc are cleared, before a new
    # job gets started (suggested by LSF support, added Dec 2016)
    # important for releasing GPUs after usage!
    JOB_INCLUDE_POSTPROC=Y

    What we observed was, that the first GPU job ended, and the next one was dispatched, but the GPU was still marked as "in use", and then the new job failed, because it could not get exclusive access to the GPU!  

    Maybe it helps!



    ------------------------------
    Bernd Dammann
    ------------------------------



  • 6.  RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Fri January 19, 2024 09:10 AM

    If in cases where the GPU is still being used in post exec, that setting would be critical.  This would be especially true if the GPU is in any of the Exclusive modes.



    ------------------------------
    Larry Adams
    ------------------------------



  • 7.  RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Fri January 19, 2024 01:38 PM

    I'll give the JOB_INCLUDE_POSTPROC=Y option a try. But what's strange here is that this problem just started spontaneously about a week ago. I didn't change anything in LSF or Linux. Also, it only affects some jobs, and I can't find anything that could be a trigger. My  current hunch, for various reasons, is that one or more users must be doing something to cause it. In some queues with limited membership, it never happens. This suggests that the culprit(s) must not have access to those queues, so it is only happening in the queues that are open to all users.



    ------------------------------
    Ray Rose
    ------------------------------



  • 8.  RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Sat February 03, 2024 11:28 AM

    I added the  "JOB_INCLUDE_POSTPROC=Y" statement, and then, for good measure, restarted the LSF daemons on all the LSF servers (which I had already done twice already), and then the problem went away. I guess I'll never know whether this statement fixed it, or if it was restarting the servers. But I can't see any scenario where this statement could do any harm, so I plan to leave it there.



    ------------------------------
    Ray Rose
    ------------------------------



  • 9.  RE: Problem with GPU allocation in Spectrum LSF Suite for Enterprise

    Posted Mon February 05, 2024 10:24 AM

    This should be in a best practices guide somewhere.  Job include post proc as no allows you to allow long running post processes not to block incoming jobs, for example copying results from local scratch to NFS, but also has a risk introducing negative outcomes when the GPU is attempted to be used during this overlap period, especially when the mode needs to be switched as I alluded to in the previous post.



    ------------------------------
    Larry Adams
    ------------------------------