Thanks, I will have a try.
Original Message:
Sent: Fri June 02, 2023 12:34 PM
From: John Welch
Subject: How to run pytorch ddp job on multi-nodes
Hi Zheng Fa,
>When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes, I find it doesn't work.
#BSUB -gpu "mode=exclusive_process:aff=yes"
#BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"
Replace the above two lines with the lines below to request 16 slots (-n 16) and 8 slots per host (span[ptile]=8) and the second line to request 8 GPU per host.
#BSUB -n 16 -R "span[ptile=8]"
#BSUB -gpu "num=8:mode=exclusive_process:aff=yes"
You will need to have LSF 10.1. Fix Pack 6 or greater and have LSB_GPU_NEW_SYNTAX=extend in lsf.conf. Additionally, check out the URL below if you want to request the number of GPUs per task (For example -gpu "num=1/task") instead of per host:
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=o-gpu
------------------------------
John Welch
Original Message:
Sent: Thu June 01, 2023 03:47 AM
From: zheng fa
Subject: How to run pytorch ddp job on multi-nodes
Hi,
I am a user of IBM Spectrum LSF 10.1. When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes, I find it doesn't work. LSF will allocate 2 nodes with 16 GPUs to the job, but the job doesn't run correctly. It will block until I run the command "torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py" on the other node. It seems lsf will allocate the resources but run torchrun command only on one node, so the job blocks. Is there any good solution to run pytorch ddp job on multi-nodes with multi-GPUs?
The script is shown as follows:
#!/bin/bash
#BSUB -J pytorch_ddp
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q zhangml
#BSUB -gpu "mode=exclusive_process:aff=yes"
#BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"
module load anaconda3
source activate py1.10
torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py
------------------------------
zheng fa
------------------------------