Hi everyone,
I'm trying to run a simple multi-node PyTorch script on an LSF scheduler using the following script that I borrow from https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes
bash Copy code #!/bin/bash #BSUB -J pytorch_ddp #BSUB -o %J.out #BSUB -e %J.err #BSUB -q long #BSUB -n 2 -R "span[ptile=1]" #BSUB -gpu "num=2:mode=exclusive_process:aff=yes" ml load Miniforge3/24.1.2-0 conda activate pytorch HOSTLIST=$(echo $LSB_MCPU_HOSTS | awk '{for (i=1; i<=NF; i+=2) print $i}' | paste -sd,) blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./multinode.py 10 10
#!/bin/bash #BSUB -J pytorch_ddp #BSUB -o %J.out #BSUB -e %J.err #BSUB -q long #BSUB -n 2 -R "span[ptile=1]" #BSUB -gpu "num=2:mode=exclusive_process:aff=yes" ml load Miniforge3/24.1.2-0 conda activate pytorch HOSTLIST=$(echo $LSB_MCPU_HOSTS | awk '{for (i=1; i<=NF; i+=2) print $i}' | paste -sd,) blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./multinode.py 10 10
However, I'm encountering the following error: "10.1 lsb_launch(): Bad host name". Could you please guide me on how to resolve this issue?
"10.1 lsb_launch(): Bad host name"
Thank you!