High Performance Computing Group

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only
  • 1.  LSF and DCGM support matrix?

    Posted 28 days ago

    Hi,

    back in 2021/22, there was a document, that was a "kind of" LSF DCGM support matrix?  Is there any update?

    We run the latest LSF Service Pack 15, and DCGM 4.2.3, but now we see messages like this 

    May 25 05:35:58 2025 87712 3 10.1 checkGPUStatus: Fail to get GPU healthy status: API version mismatch 

    and this (which might not be related, though):

    May 25 05:35:59 2025 87712 3 10.1 controlDevicesForJobCgroup: The bpf processing for job <1234567> doesn't exist any more. 

    In the bjobs -l output for the job, we also get this one:

    PENDING REASONS:
     Failed to send fan-out information to other SBDs; 

    Note: the job is a multi-node GPU job! 

    Any hints?  Thanks!



    ------------------------------
    Bernd Dammann
    ------------------------------


  • 2.  RE: LSF and DCGM support matrix?

    Posted 25 days ago

    I remember previously DCGM v3.18 was certified. bpf error should be related to DCGM v4 support. 



    ------------------------------
    YI SUN
    ------------------------------



  • 3.  RE: LSF and DCGM support matrix?

    Posted 25 days ago

    BTW, what is your CUDA driver version? 



    ------------------------------
    YI SUN
    ------------------------------