IBM Spectrum Computing Group

Expand all | Collapse all

LSF fails to detect GPUs on some IBM Power nodes

  • 1.  LSF fails to detect GPUs on some IBM Power nodes

    Posted Thu October 29, 2020 10:37 AM
    Dear All
    LSF Spectrum LSF 10.1.0.9 was installed on a few IBM Power servers and Dell servers with or without GPUs. Among all the servers with GPUs, four servers are IBM 8335-GTH and 5 are Dell C4140. After the installation was successfully done, "lsload -gpu" and "bhosts -gpu" on 7 servers have detected and displayed GPUs, but they failed to do so on two IBM Power nodes. To prevent any installation error, I uninstalled LSF and changed the master nodes and run the installation again, the results are the same. Is there anything I need to check or install on those two IBM power nodes so that LSF can detect the GPUs installed there? On those two nodes, nvidia-smi shows the GPUs information correctly.  Thanks!

    Regards
    Xinhuai


    ------------------------------
    Xinhuai Zhang
    ------------------------------


  • 2.  RE: LSF fails to detect GPUs on some IBM Power nodes

    Posted Thu October 29, 2020 11:10 AM
    It may not relate to the installation. Do problem nodes have OS, GPU setup same as working nodes?

    Suggest to contact Support for more investigation. You may get following data ready on both working and problem nodes for Support.
    • nvidia-smi
    • nvidia-smi topo -m
    • lstop-no-graphics
    • LSF lim debug log, lsadmin limdebug -c "LC_TRACE LC2_TOPOLOGY" -l 0
    I remember there is GPU related elim program shipped with LSF, you can run the elim see if it can detect GPU properly.

    Yi Sun

    ------------------------------
    YI SUN
    ------------------------------



  • 3.  RE: LSF fails to detect GPUs on some IBM Power nodes

    Posted Thu October 29, 2020 03:10 PM
    You may also try reset GPU, then restart LSF services see if it helps, nvidia-msi --gpu-reset.

    Or reboot the problem nodes.

    One more thing, is DCGM running on problem nodes, if so what is the version of DCGM?

    ------------------------------
    YI SUN
    ------------------------------



  • 4.  RE: LSF fails to detect GPUs on some IBM Power nodes

    Posted Thu October 29, 2020 11:28 PM
    Thanks a lot, Yi Sun! 
    The GPU reset solved the problem. After "nvidia-msi --gpu-reset" and lsfrestart, the GPUs can be detected and displayed.

    Regards
    Xinhuai

    ------------------------------
    Xinhuai Zhang
    ------------------------------