Business Analytics Connect, learn and share with over 10000 users across the IBM Business Analytics. Join / Log in
we added a new host with two AMD GPUs to our cluster - but they are not detected by LSF! Is that supported at all by LSF? We can find a lot of references to AMD GPUs in the documentation for bsub, etc, but nothing for how to configure LSF to support/detect AMD GPUs. We use
Any hints, what we are missing?
What version of LSF is in use? E.g. run lim -V and sbatchd -V.
We run the latest fix pack (14):
$ lim -VEGO 3.4.0 build 601547, April 20 2023
$ sbatchd -VIBM Spectrum LSF 10.1.0.0 build 601547, April 20 2023
binary type: linux3.10-glibc2.17-x86_64
Add LSF_GPU_RESOURCE_IGNORE=Y in lsf.conf then restart the cluster. Also make sure you have installed ROCM SMI library on AMD GPU node.
Thanks! We will try to add this to lsf.conf. The ROCM library is installed.
BTW, the documentation about LSF_GPU_RESOURCE_IGNORE is not very clear: it says "Default: Y", which made us believe it is set to 'Y', if not present in lsf.conf - which is obviously not true!
BTW, this 'default' behavior is also mentioned in the release notes of FixPack 13:
Starting in IBM Spectrum LSF Version 10.1 Fix Pack 13, the default values of the following three parameters are changed to:
If you have fix pack 13 installed, no further action is needed to set these parameters.
Is there a way to check those active parameters?
To be on the safe side, we added the parameter to lsf.conf and restarted the cluster. Still no success!
Is there any guide/documentation, what one should be aware of, when using AMD GPUs? How does LSF find them, and where does it look for it?
You are right, those parameters are now enabled by default. So I guess lshosts -gpu doesn't show anything about AMD GPU in your cluster. Probably you should reach LSF support to run debug on LSF lim service. Is there any indication in lim log ?