High Performance Computing Group

 View Only
  • 1.  DDP job on multi node

    Posted Tue August 20, 2024 05:35 AM
    Edited by Danial Maleki Tue August 20, 2024 07:41 PM

    Hi everyone,

    I'm trying to run a simple multi-node PyTorch script on an LSF scheduler using the following script that I borrow from: https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes#bmcfa6563e-098c-4ad0-88f7-a0615a97de40

    bash
    #!/bin/bash #BSUB -J pytorch_ddp #BSUB -o %J.out #BSUB -e %J.err #BSUB -q long #BSUB -n 2 -R "span[ptile=1]" #BSUB -gpu "num=2:mode=exclusive_process:aff=yes" ml load Miniforge3/24.1.2-0 conda activate pytorch HOSTLIST=$(echo $LSB_MCPU_HOSTS | awk '{for (i=1; i<=NF; i+=2) print $i}' | paste -sd,) blaunch -z $HOSTLIST --nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400 ./multinode.py 10 10

    However, I'm encountering the following error: "10.1 lsb_launch(): Bad host name". Could you please guide me on how to resolve this issue?

    Thank you!



    ------------------------------
    Danial Maleki
    ------------------------------



  • 2.  RE: DDP job on multi node

    Posted Tue August 20, 2024 07:59 PM

    Following are   not blaunch command options.

    --nnodes=2 --nproc_per_node=1 --rdzv_id=$RANDOM --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:29400


    ------------------------------
    YI SUN
    ------------------------------



  • 3.  RE: DDP job on multi node
    Best Answer

    Posted Wed August 21, 2024 03:33 AM

    There was a typo in the original post - the command 'torchrun' is missing after $HOSTLIST!



    ------------------------------
    Bernd Dammann
    ------------------------------



  • 4.  RE: DDP job on multi node

    Posted Wed August 21, 2024 12:38 PM

    Thank you that was really helpful. 



    ------------------------------
    Danial Maleki
    ------------------------------