High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to discussions

Expand all | Collapse all

How to run pytorch ddp job on multi-nodes

Jump to Best Answer

1. How to run pytorch ddp job on multi-nodes

Like
zheng fa
Posted Thu June 01, 2023 08:55 AM

Reply
Hi，
I am a user of IBM Spectrum LSF 10.1. When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes, I find it doesn't work. LSF will allocate 2 nodes with 16 GPUs to the job, but the job doesn't run correctly. It will block until I run the command "torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py" on the other node. It seems lsf will allocate the resources but run torchrun command only on one node, so the job blocks. Is there any good solution to run pytorch ddp job on multi-nodes with multi-GPUs?

The script is shown as follows:

#!/bin/bash
#BSUB -J pytorch_ddp
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q zhangml
#BSUB -gpu "mode=exclusive_process:aff=yes"
#BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"

module load anaconda3
source activate py1.10

torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=172.22.4.33:29400 ./elastic_ddp.py

------------------------------
zheng fa
------------------------------
2. RE: How to run pytorch ddp job on multi-nodes
Best Answer

Like
Bernd Dammann
Posted Fri June 02, 2023 03:17 AM

Reply
torchrun needs to be executed on all hosts (as you found out), so we use something like

blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./elastic_ddp.py

where $HOSTLIST contains the names of the nodes involved, and can be constructed from the LSB_AFFINITY_HOSTFILE. $HOSTNAME will be set to the master node, as the script executes there, so that's sufficient to make it work. To make it fully flexible, you can also extract values for '--nnodes' and '--nproc_per_node' from the LSB_AFFINITY_HOSTFILE.

------------------------------
Bernd Dammann
------------------------------

Original Message
3. RE: How to run pytorch ddp job on multi-nodes

Like
zheng fa
Posted Mon June 12, 2023 10:28 PM

Reply
Thanks, this has been a great help!

------------------------------
zheng fa
------------------------------

Original Message
4. RE: How to run pytorch ddp job on multi-nodes

Like
kaijie shi
Posted Tue July 25, 2023 01:26 PM

Reply
Hi. Could you give me a whole demo of this script?

------------------------------
kaijie shi
------------------------------

Original Message
5. RE: How to run pytorch ddp job on multi-nodes

Like
kaijie shi
Posted Tue July 25, 2023 01:28 PM

Reply
Hi, how can I get $HOSTLIST and $HOSTNAME?

------------------------------
kaijie shi
------------------------------

Original Message
6. RE: How to run pytorch ddp job on multi-nodes

Like
Fabio Geraci
Posted Fri April 26, 2024 02:36 PM

Reply
Does this need an East-West network to be in-place or it would work anyway?
If so what would be the impact?

Thanks
Fabio

------------------------------
Fabio Geraci
------------------------------

Original Message
7. RE: How to run pytorch ddp job on multi-nodes

Like
John Welch
Posted Fri June 02, 2023 12:35 PM

Reply
Hi Zheng Fa,

>When I try to submit a pytorch ddp job with 16 GPUs on 2 nodes, I find it doesn't work.

#BSUB -gpu "mode=exclusive_process:aff=yes"
#BSUB -R "32*{rusage[ngpus_physical=8]}+32*{rusage[ngpus_physical=8]}"

Replace the above two lines with the lines below to request 16 slots (-n 16) and 8 slots per host (span[ptile]=8) and the second line to request 8 GPU per host.

#BSUB -n 16 -R "span[ptile=8]"
#BSUB -gpu "num=8:mode=exclusive_process:aff=yes"

You will need to have LSF 10.1. Fix Pack 6 or greater and have LSB_GPU_NEW_SYNTAX=extend in lsf.conf. Additionally, check out the URL below if you want to request the number of GPUs per task (For example -gpu "num=1/task") instead of per host:

https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=o-gpu

------------------------------
John Welch
------------------------------

Original Message
8. RE: How to run pytorch ddp job on multi-nodes

Like
zheng fa
Posted Mon June 12, 2023 10:29 PM

Reply
Thanks, I will have a try.

------------------------------
zheng fa
------------------------------

Original Message

High Performance Computing Group

High Performance Computing Group

How to run pytorch ddp job on multi-nodes

zheng faThu June 01, 2023 08:55 AM

Bernd DammannFri June 02, 2023 03:17 AMBest Answer

zheng faMon June 12, 2023 10:28 PM

kaijie shiTue July 25, 2023 01:26 PM

kaijie shiTue July 25, 2023 01:28 PM

Fabio GeraciFri April 26, 2024 02:36 PM

John WelchFri June 02, 2023 12:35 PM

zheng faMon June 12, 2023 10:29 PM

1. How to run pytorch ddp job on multi-nodes

2. RE: How to run pytorch ddp job on multi-nodes
Best Answer

3. RE: How to run pytorch ddp job on multi-nodes

4. RE: How to run pytorch ddp job on multi-nodes

5. RE: How to run pytorch ddp job on multi-nodes

6. RE: How to run pytorch ddp job on multi-nodes

7. RE: How to run pytorch ddp job on multi-nodes

8. RE: How to run pytorch ddp job on multi-nodes

Additional
Resources

Office

Quick Links

High Performance Computing Group

High Performance Computing Group

How to run pytorch ddp job on multi-nodes

zheng faThu June 01, 2023 08:55 AM

Bernd DammannFri June 02, 2023 03:17 AMBest Answer

zheng faMon June 12, 2023 10:28 PM

kaijie shiTue July 25, 2023 01:26 PM

kaijie shiTue July 25, 2023 01:28 PM

Fabio GeraciFri April 26, 2024 02:36 PM

John WelchFri June 02, 2023 12:35 PM

zheng faMon June 12, 2023 10:29 PM

1. How to run pytorch ddp job on multi-nodes

2. RE: How to run pytorch ddp job on multi-nodes Best Answer

3. RE: How to run pytorch ddp job on multi-nodes

4. RE: How to run pytorch ddp job on multi-nodes

5. RE: How to run pytorch ddp job on multi-nodes

6. RE: How to run pytorch ddp job on multi-nodes

7. RE: How to run pytorch ddp job on multi-nodes

8. RE: How to run pytorch ddp job on multi-nodes

Related Content

GPU usage information for jobs in IBM Spectrum LSF

A quick look at automatic GPU configuration in IBM Spectrum LSF

DynaMIG management of NVIDIA DGX A100 with IBM Spectrum LSF

Running TensorFlow benchmark with Horovod across IBM Power servers in containers in an LSF cluster

Fine tuning AI models with InstructLab on IBM LSF

Additional Resources

Office

Quick Links

2. RE: How to run pytorch ddp job on multi-nodes
Best Answer

Additional
Resources