IBM Business Analytics

 View Only

How to run pytorch ddp job on multi-nodes

  • 1.  How to run pytorch ddp job on multi-nodes

    Posted Thu June 01, 2023 08:56 AM

    Hi

       I am a user of IBM Spectrum LSF 10.1. When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes,  I find it doesn't work.  LSF will allocate 2 nodes with 16 GPUs to the job, but the job doesn't run correctly. It will block until I run the command "torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py" on the other node. It seems lsf will allocate the resources but run torchrun command only on one node, so the job blocks. Is there any good solution to run pytorch ddp job on multi-nodes with multi-GPUs? 

    The script is shown as follows:

    #!/bin/bash

    #BSUB -J pytorch_ddp

    #BSUB -o %J.out

    #BSUB -e %J.err

    #BSUB -q zhangml

    #BSUB  -gpu "mode=exclusive_process:aff=yes"

    #BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"

    module load anaconda3

    source activate py1.10

    torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py



    ------------------------------
    zheng fa
    ------------------------------