High Performance Computing Group

 View Only
  • 1.  LSF and AMD GPUs

    Posted Fri August 25, 2023 08:55 AM

    Hi,

    we added a new host with two AMD GPUs to our cluster - but they are not detected by LSF!  Is that supported at all by LSF?  We can find a  lot of references to AMD GPUs in the documentation for bsub, etc, but nothing for how to configure LSF to support/detect AMD GPUs.  We use

    LSB_GPU_NEW_SYNTAX=extend
    LSF_GPU_AUTOCONFIG=Y

    in lsf.conf. 

    Any hints, what we are missing?



    ------------------------------
    Bernd Dammann
    ------------------------------


  • 2.  RE: LSF and AMD GPUs

    Posted Sun August 27, 2023 12:17 PM

    What version of LSF is in use? E.g. run lim -V and sbatchd -V.



    ------------------------------
    YI SUN
    ------------------------------



  • 3.  RE: LSF and AMD GPUs

    Posted Sun August 27, 2023 12:24 PM

    We run the latest fix pack (14):

    $ lim -V
    EGO 3.4.0 build 601547, April 20 2023

    $ sbatchd -V
    IBM Spectrum LSF 10.1.0.0 build 601547, April 20 2023

    binary type: linux3.10-glibc2.17-x86_64



    ------------------------------
    Bernd Dammann
    ------------------------------



  • 4.  RE: LSF and AMD GPUs

    Posted Mon August 28, 2023 05:18 PM

    Add LSF_GPU_RESOURCE_IGNORE=Y in lsf.conf then restart the cluster. Also make sure you have installed ROCM SMI library on AMD GPU node. 



    ------------------------------
    YI SUN
    ------------------------------



  • 5.  RE: LSF and AMD GPUs

    Posted Tue August 29, 2023 02:45 AM

    Thanks!  We will try to add this to lsf.conf.  The ROCM library is installed.

    BTW, the documentation about LSF_GPU_RESOURCE_IGNORE is not very clear:  it says "Default: Y", which made us believe it is set to 'Y', if not present in lsf.conf - which is obviously not true! 



    ------------------------------
    Bernd Dammann
    ------------------------------



  • 6.  RE: LSF and AMD GPUs

    Posted Tue August 29, 2023 02:52 AM

    BTW, this 'default' behavior is also mentioned in the release notes of FixPack 13:

    ---

    Starting in IBM Spectrum LSF Version 10.1 Fix Pack 13, the default values of the following three parameters are changed to:

    LSF_GPU_AUTOCONFIG=Y
    LSB_GPU_NEW_SYNTAX=extend
    LSF_GPU_RESOURCE_IGNORE=Y

    If you have fix pack 13 installed, no further action is needed to set these parameters. 

    ---

    Is there a way to check those active parameters?



    ------------------------------
    Bernd Dammann
    ------------------------------



  • 7.  RE: LSF and AMD GPUs

    Posted Tue August 29, 2023 09:13 AM

    To be on the safe side, we added the parameter to lsf.conf and restarted the cluster.  Still no success! 

    Is there any guide/documentation, what one should be aware of, when using AMD GPUs?  How does LSF find them, and where does it look for it? 



    ------------------------------
    Bernd Dammann
    ------------------------------



  • 8.  RE: LSF and AMD GPUs

    Posted Tue August 29, 2023 10:55 AM

    You are right, those parameters are now enabled by default. So I guess lshosts -gpu doesn't show anything about AMD GPU in your cluster. Probably you should reach LSF support to run debug on LSF lim service. Is there any indication in lim log ?



    ------------------------------
    YI SUN
    ------------------------------