Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.
Hi everyone,
I'm trying to run a simple multi-node PyTorch script on an LSF scheduler using the following script that I borrow from: https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes#bmcfa6563e-098c-4ad0-88f7-a0615a97de40
bash Copy code #!/bin/bash #BSUB -J pytorch_ddp #BSUB -o %J.out #BSUB -e %J.err #BSUB -q long #BSUB -n 2 -R "span[ptile=1]" #BSUB -gpu "num=2:mode=exclusive_process:aff=yes" ml load Miniforge3/24.1.2-0 conda activate pytorch HOSTLIST=$(echo $LSB_MCPU_HOSTS | awk '{for (i=1; i<=NF; i+=2) print $i}' | paste -sd,) blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./multinode.py 10 10
#!/bin/bash #BSUB -J pytorch_ddp #BSUB -o %J.out #BSUB -e %J.err #BSUB -q long #BSUB -n 2 -R "span[ptile=1]" #BSUB -gpu "num=2:mode=exclusive_process:aff=yes" ml load Miniforge3/24.1.2-0 conda activate pytorch HOSTLIST=$(echo $LSB_MCPU_HOSTS | awk '{for (i=1; i<=NF; i+=2) print $i}' | paste -sd,) blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./multinode.py 10 10
However, I'm encountering the following error: "10.1 lsb_launch(): Bad host name". Could you please guide me on how to resolve this issue?
"10.1 lsb_launch(): Bad host name"
Thank you!
------------------------------Danial Maleki------------------------------
Following are not blaunch command options.
--nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400
There was a typo in the original post - the command 'torchrun' is missing after $HOSTLIST!
Thank you that was really helpful.