Message Image

Log in

Skip to main content (Press Enter).

Skip auxiliary navigation (Press Enter).

Business Analytics Connect, learn and share with over 10000 users across the IBM Business Analytics. Ask a question

Skip main navigation (Press Enter).

IBM Business Analytics

View Only

Expand all | Collapse all

How to run pytorch ddp job on multi-nodes

1. How to run pytorch ddp job on multi-nodes

0 Like
zheng fa
Posted Thu June 01, 2023 08:56 AM

Reply
Hi，

I am a user of IBM Spectrum LSF 10.1. When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes, I find it doesn't work. LSF will allocate 2 nodes with 16 GPUs to the job, but the job doesn't run correctly. It will block until I run the command "torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py" on the other node. It seems lsf will allocate the resources but run torchrun command only on one node, so the job blocks. Is there any good solution to run pytorch ddp job on multi-nodes with multi-GPUs?

The script is shown as follows:

#!/bin/bash

#BSUB -J pytorch_ddp

#BSUB -o %J.out

#BSUB -e %J.err

#BSUB -q zhangml

#BSUB -gpu "mode=exclusive_process:aff=yes"

#BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"

module load anaconda3

source activate py1.10

torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py

------------------------------
zheng fa
------------------------------

Powered by Higher Logic