High Performance Computing Group

 View Only
  • 1.  How to run pytorch ddp job on multi-nodes

    Posted Thu June 01, 2023 08:55 AM

    Hi,
       I am a user of IBM Spectrum LSF 10.1. When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes,  I find it doesn't work.  LSF will allocate 2 nodes with 16 GPUs to the job, but the job doesn't run correctly. It will block until I run the command "torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py" on the other node. It seems lsf will allocate the resources but run torchrun command only on one node, so the job blocks. Is there any good solution to run pytorch ddp job on multi-nodes with multi-GPUs?  

    The script is shown as follows:

    #!/bin/bash
    #BSUB -J pytorch_ddp
    #BSUB -o %J.out
    #BSUB -e %J.err
    #BSUB -q zhangml
    #BSUB  -gpu "mode=exclusive_process:aff=yes"
    #BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"

    module load anaconda3
    source activate py1.10

    torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py



    ------------------------------
    zheng fa
    ------------------------------


  • 2.  RE: How to run pytorch ddp job on multi-nodes
    Best Answer

    Posted Fri June 02, 2023 03:17 AM

    torchrun needs to be executed on all hosts (as you found out), so we use something like 

    blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./elastic_ddp.py

    where $HOSTLIST contains the names of the nodes involved, and can be constructed from the LSB_AFFINITY_HOSTFILE.  $HOSTNAME will be set to the master node, as the script executes there, so that's sufficient to make it work.  To make it fully flexible, you can also extract values for '--nnodes' and '--nproc_per_node' from the LSB_AFFINITY_HOSTFILE.



    ------------------------------
    Bernd Dammann
    ------------------------------



  • 3.  RE: How to run pytorch ddp job on multi-nodes

    Posted Mon June 12, 2023 10:28 PM

    Thanks, this has been a great help!



    ------------------------------
    zheng fa
    ------------------------------



  • 4.  RE: How to run pytorch ddp job on multi-nodes

    Posted Tue July 25, 2023 01:26 PM

    Hi. Could you give me a whole demo of this script?



    ------------------------------
    kaijie shi
    ------------------------------



  • 5.  RE: How to run pytorch ddp job on multi-nodes

    Posted Tue July 25, 2023 01:28 PM

    Hi, how can I get $HOSTLIST and $HOSTNAME?



    ------------------------------
    kaijie shi
    ------------------------------



  • 6.  RE: How to run pytorch ddp job on multi-nodes

    Posted Fri June 02, 2023 12:35 PM

    Hi Zheng Fa,

    >When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes,  I find it doesn't work. 

    #BSUB  -gpu "mode=exclusive_process:aff=yes"
    #BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"

    Replace the above two lines with the lines below to request 16 slots (-n 16) and 8 slots per host (span[ptile]=8) and the second line to request 8 GPU per host.

    #BSUB  -n 16 -R "span[ptile=8]" 
    #BSUB -gpu  "num=8:mode=exclusive_process:aff=yes"

    You will need to have LSF 10.1. Fix Pack 6 or greater and have LSB_GPU_NEW_SYNTAX=extend in lsf.conf.   Additionally, check out the URL below if you want to request the number of GPUs per task (For example -gpu "num=1/task") instead of per host:

    https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=o-gpu



    ------------------------------
    John Welch
    ------------------------------



  • 7.  RE: How to run pytorch ddp job on multi-nodes

    Posted Mon June 12, 2023 10:29 PM

    Thanks, I will have a try.



    ------------------------------
    zheng fa
    ------------------------------