High Performance Computing Group

 View Only

 IBM Spectrum LSF support for ddp pytorch on multi node

Danial Maleki's profile image
Danial Maleki posted Mon August 19, 2024 05:25 PM

Hi everyone,

I'm trying to run a simple multi-node PyTorch script on an LSF scheduler using the following script that I borrow from https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes

bash
#!/bin/bash #BSUB -J pytorch_ddp #BSUB -o %J.out #BSUB -e %J.err #BSUB -q long #BSUB -n 2 -R "span[ptile=1]" #BSUB -gpu "num=2:mode=exclusive_process:aff=yes" ml load Miniforge3/24.1.2-0 conda activate pytorch HOSTLIST=$(echo $LSB_MCPU_HOSTS | awk '{for (i=1; i<=NF; i+=2) print $i}' | paste -sd,) blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./multinode.py 10 10

However, I'm encountering the following error: "10.1 lsb_launch(): Bad host name". Could you please guide me on how to resolve this issue?

Thank you!