Following are not blaunch command options.
--nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400
------------------------------
YI SUN
------------------------------
Original Message:
Sent: Mon August 19, 2024 06:06 PM
From: Danial Maleki
Subject: DDP job on multi node
Hi everyone,
I'm trying to run a simple multi-node PyTorch script on an LSF scheduler using the following script that I borrow from: https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes#bmcfa6563e-098c-4ad0-88f7-a0615a97de40
#!/bin/bashml load Miniforge3/24.1.2-0conda activate pytorchHOSTLIST=$(echo $LSB_MCPU_HOSTS | awk '{for (i=1; i<=NF; i+=2) print $i}' | paste -sd,)blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./multinode.py 10 10
However, I'm encountering the following error: "10.1 lsb_launch(): Bad host name"
. Could you please guide me on how to resolve this issue?
Thank you!
------------------------------
Danial Maleki
------------------------------