High Performance Computing Group

 View Only
Expand all | Collapse all

Accessing multiple GPUs on different hosts using LSF

  • 1.  Accessing multiple GPUs on different hosts using LSF

    Posted Thu February 22, 2024 07:53 AM

    I am using a HPC cluster with LSF Resource manager. The task is to train a tensorflow model. The graphics queue gq (Max jobs 96) has two hosts with 4 GPUs each. I want to use all the 8 GPUs together. But the task is allotted to only one host? 

    The command i used to schedule the job is as follows :- 

     bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

    The bjobs -l output is as follows :- 

     bjobs -l 229990
    
    Job <229990>, User <abc>, Project <default>, Status <RUN>, Queue <gq>, Co
                         mmand <module load cuda; ./maya2.sh>, Share group charged
                         </GPU>
    Wed Feb 21 13:43:13: Submitted from host <host1>, CWD <$HOME>, Output File <
                         output2.log>, Error File <error2.log>, 96 Task(s), Request
                         ed GPU <num=4:mode=shared:j_exclusive=no>;
    Wed Feb 21 13:43:13: Started 96 Task(s) on Host(s) <48*host1> <48*host2>,
                          Allocated 96 Slot(s) on Host(s) <48*host1> <48*host2
                         k>, Execution Home </users/sys/abc>, Execution
                         CWD </users/sys/abc>;
    Wed Feb 21 13:44:24: Resource usage collected.
                         The CPU time used is 18 seconds.
                         MEM: 1 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 78
    
     RUNLIMIT
     360.0 min
    
     MEMORY USAGE:
     MAX MEM: 1 Gbytes;  AVG MEM: 480 Mbytes; MEM Efficiency: 0.00%
    
     CPU USAGE:
     CPU PEAK: 0.25 ;  CPU Efficiency: 0.26%
    
     SCHEDULING PARAMETERS:
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
     loadSched   -     -     -     -       -     -    -     -     -      -      -
     loadStop    -     -     -     -       -     -    -     -     -      -      -
    
     EXTERNAL MESSAGES:
     MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT
     0      e40070822  Feb 21 13:43   host1:gpus=1,2,3,0;host2:gpus     N
    
     RESOURCE REQUIREMENT DETAILS:
     Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys
                         ical=4.00]
     Effective: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phy
                         sical=4.00]
    
     GPU REQUIREMENT DETAILS:
     Combined: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia
     Effective: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia
    

    No GPUs from host 2 are allocated to the task. I checked the same using nvidea-smi command also. The task is run only on host1. What should I do so that I can use all 8 GPUs across both the hosts? Is there some mistake in the bsub command I am using??



    ------------------------------
    Subin Pillai
    ------------------------------


  • 2.  RE: Accessing multiple GPUs on different hosts using LSF

    Posted Thu February 22, 2024 11:53 AM

    Try following see if one node job can get GPU allocation on host2.

    • On host1, bsub -I -gpu "num=4" -m host2 nvidia-smi

    If above test is positive, try following see if it works on two nodes.

    • On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

    Also make sure on lsload -gpu and bhosts -gpu on host2 report correct GPU info.



    ------------------------------
    YI SUN
    ------------------------------



  • 3.  RE: Accessing multiple GPUs on different hosts using LSF

    Posted Sun March 03, 2024 01:06 PM

    Hi Yi Sun, 

    One node job can get GPU allocation on host2... The lsload is as below 

    lsload -gpu
    HOST_NAME   status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical
    host1     ok     4                 0%                0%              4
    host2     ok     4                 0%                0%              4

    When I tried 

    • On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

    I am getting no suitable hosts as below

    Job <232873>, User <abc>, Project <default>, Status <PEND>, Queue <cq>, C
                         ommand <module load cuda; ./maya2.sh>
    Sun Mar  3 23:32:59: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re
                         quested Resources <type==any span[ptile=1]>, Requested GPU
                          <num=4/host>;
     PENDING REASONS:
     There are no suitable hosts for the job;
    
    
     SCHEDULING PARAMETERS:
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
     loadSched   -     -     -     -       -     -    -     -     -      -      -
     loadStop    -     -     -     -       -     -    -     -     -      -      -
    
     RESOURCE REQUIREMENT DETAILS:
     Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys
                         ical=4.00/host] span[ptile=1]
     Effective: -
    
     GPU REQUIREMENT DETAILS:
     Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia
     Effective: -
    





    ------------------------------
    Subin Pillai
    ------------------------------



  • 4.  RE: Accessing multiple GPUs on different hosts using LSF

    Posted Sun March 03, 2024 04:00 PM

    Seems you should submit test job to queue "gq" rather "cq"



    ------------------------------
    YI SUN
    ------------------------------



  • 5.  RE: Accessing multiple GPUs on different hosts using LSF

    Posted Sun March 03, 2024 11:03 PM

    Sorry my bad. This is the output. But somehow it still uses GPUs only from one host. Also  

    EXTERNAL MESSAGES:
     MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT
     0      e40070822  Mar  4 09:26   host1:gpus=0,1,2,3;host2:gpus     N

    If you see these are GPUs that are not allocated. All 4 gpus from host2 are running the task but none from host1

    $ bjobs -l 232878
    
    Job <232878>, User <abcd>, Project <default>, Status <RUN>, Queue <gq>, Co
                         mmand <module load cuda; ./demo.sh>, Share group charged
                         </GPU>
    Mon Mar  4 09:25:36: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re
                         quested Resources <type==any span[ptile=1]>, Requested GPU
                          <num=4/host>;
    Mon Mar  4 09:25:57: Started 2 Task(s) on Host(s) <1*host2> <1*host1>, Al
                         located 2 Slot(s) on Host(s) <1*host2> <1*host1>, Ex
                         ecution Home </users/analysis/e40070822>, Execution CWD </
                         users/analysis/e40070822>;
    Mon Mar  4 09:28:03: Resource usage collected.
                         The CPU time used is 73 seconds.
                         MEM: 12.3 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 197
    
    
     MEMORY USAGE:
     MAX MEM: 12.3 Gbytes;  AVG MEM: 7.5 Gbytes; MEM Efficiency: 0.00%
    
     CPU USAGE:
     CPU PEAK: 0.98 ;  CPU Efficiency: 49.21%
    
     SCHEDULING PARAMETERS:
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
     loadSched   -     -     -     -       -     -    -     -     -      -      -
     loadStop    -     -     -     -       -     -    -     -     -      -      -
    
     EXTERNAL MESSAGES:
     MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT
     0      e40070822  Mar  4 09:26   host1:gpus=0,1,2,3;host2:gpus     N
    
     RESOURCE REQUIREMENT DETAILS:
     Combined: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:p
                         g] rusage[ngpus_physical=4.00/host] span[ptile=1]
     Effective: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:
                         pg] rusage[ngpus_physical=4.00/host] span[ptile=1]
    
     GPU REQUIREMENT DETAILS:
     Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia
     Effective: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia
    


    ------------------------------
    Subin Pillai
    ------------------------------



  • 6.  RE: Accessing multiple GPUs on different hosts using LSF

    Posted Mon March 04, 2024 04:19 PM

    How did you launch the 2nd task on host1?



    ------------------------------
    YI SUN
    ------------------------------



  • 7.  RE: Accessing multiple GPUs on different hosts using LSF

    Posted Mon March 04, 2024 04:24 PM

    You may check this link.

    https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes#bmcfa6563e-098c-4ad0-88f7-a0615a97de40



    ------------------------------
    YI SUN
    ------------------------------



  • 8.  RE: Accessing multiple GPUs on different hosts using LSF

    Posted Tue March 05, 2024 01:13 AM
    Edited by Subin Pillai Tue March 05, 2024 04:29 AM

    This really helped. I didn't use blaunch. That was the mssing piece of the puzzle. This is the updated command I used.


    bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log blaunch -z "host1 host2" "module load cuda; ./file.sh"

    But the problem now is it runs the program twice. That is I am getting the outputs twice, instead of it running parallel. 


    ------------------------------
    Subin Pillai
    ------------------------------



  • 9.  RE: Accessing multiple GPUs on different hosts using LSF

    Posted Wed March 06, 2024 10:49 AM

    Yes - blaunch executes the command you are giving on each node.  You will need to set up the environment variables for Tensorflow, TF_CONFIG, to recognize this a "cluster".  In your file.sh add something like this:

    MYTASKID=$((LSF_PM_TASKID-1))
    
    TF_CONFIG='{"cluster": {"worker": ["host1:12345", "host2:12345"]}, "task": {"index": '${MYTASKID}', "type": "worker"}}'

    using the TASKID of the blaunch tasks to create task IDs for Tensorflow.  You  might also need to choose another port number than 12345.

    In your Python code, you have to choose the corresponding strategy:

    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    

    This will then pick up the TC_CONFIG settings, to distribute the work!

    I haven't tested this myself, but there are some tutorials  on the net, that show how to setup TF_CONFIG.



    ------------------------------
    Bernd Dammann
    ------------------------------



  • 10.  RE: Accessing multiple GPUs on different hosts using LSF