You may also try reset GPU, then restart LSF services see if it helps, nvidia-msi --gpu-reset.
Or reboot the problem nodes.
One more thing, is DCGM running on problem nodes, if so what is the version of DCGM?
------------------------------
YI SUN
------------------------------
Original Message:
Sent: Thu October 29, 2020 11:09 AM
From: YI SUN
Subject: LSF fails to detect GPUs on some IBM Power nodes
It may not relate to the installation. Do problem nodes have OS, GPU setup same as working nodes?
Suggest to contact Support for more investigation. You may get following data ready on both working and problem nodes for Support.
- nvidia-smi
- nvidia-smi topo -m
- lstop-no-graphics
- LSF lim debug log, lsadmin limdebug -c "LC_TRACE LC2_TOPOLOGY" -l 0
I remember there is GPU related elim program shipped with LSF, you can run the elim see if it can detect GPU properly.
Yi Sun
------------------------------
YI SUN
Original Message:
Sent: Wed October 28, 2020 11:51 AM
From: Xinhuai Zhang
Subject: LSF fails to detect GPUs on some IBM Power nodes
Dear All
LSF Spectrum LSF 10.1.0.9 was installed on a few IBM Power servers and Dell servers with or without GPUs. Among all the servers with GPUs, four servers are IBM 8335-GTH and 5 are Dell C4140. After the installation was successfully done, "lsload -gpu" and "bhosts -gpu" on 7 servers have detected and displayed GPUs, but they failed to do so on two IBM Power nodes. To prevent any installation error, I uninstalled LSF and changed the master nodes and run the installation again, the results are the same. Is there anything I need to check or install on those two IBM power nodes so that LSF can detect the GPUs installed there? On those two nodes, nvidia-smi shows the GPUs information correctly. Thanks!
Regards
Xinhuai
------------------------------
Xinhuai Zhang
------------------------------
#SpectrumComputingGroup