Hi,
I am a user of IBM Spectrum LSF 10.1. When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes, I find it doesn't work. LSF will allocate 2 nodes with 16 GPUs to the job, but the job doesn't run correctly. It will block until I run the command "torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py" on the other node. It seems lsf will allocate the resources but run torchrun command only on one node, so the job blocks. Is there any good solution to run pytorch ddp job on multi-nodes with multi-GPUs?
The script is shown as follows:
#!/bin/bash
#BSUB -J pytorch_ddp
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q zhangml
#BSUB -gpu "mode=exclusive_process:aff=yes"
#BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"
module load anaconda3
source activate py1.10
torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py
------------------------------
zheng fa
------------------------------