Accessing multiple GPUs on different hosts using LSF

5. RE: Accessing multiple GPUs on different hosts using LSF

Like

Subin Pillai

Posted Sun March 03, 2024 11:03 PM

Sorry my bad. This is the output. But somehow it still uses GPUs only from one host. Also

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 e40070822 Mar 4 09:26 host1:gpus=0,1,2,3;host2:gpus N

If you see these are GPUs that are not allocated. All 4 gpus from host2 are running the task but none from host1

$ bjobs -l 232878

Job <232878>, User <abcd>, Project <default>, Status <RUN>, Queue <gq>, Co
                     mmand <module load cuda; ./demo.sh>, Share group charged
                     </GPU>
Mon Mar  4 09:25:36: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re
                     quested Resources <type==any span[ptile=1]>, Requested GPU
                      <num=4/host>;
Mon Mar  4 09:25:57: Started 2 Task(s) on Host(s) <1*host2> <1*host1>, Al
                     located 2 Slot(s) on Host(s) <1*host2> <1*host1>, Ex
                     ecution Home </users/analysis/e40070822>, Execution CWD </
                     users/analysis/e40070822>;
Mon Mar  4 09:28:03: Resource usage collected.
                     The CPU time used is 73 seconds.
                     MEM: 12.3 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 197


 MEMORY USAGE:
 MAX MEM: 12.3 Gbytes;  AVG MEM: 7.5 Gbytes; MEM Efficiency: 0.00%

 CPU USAGE:
 CPU PEAK: 0.98 ;  CPU Efficiency: 49.21%

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

 EXTERNAL MESSAGES:
 MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT
 0      e40070822  Mar  4 09:26   host1:gpus=0,1,2,3;host2:gpus     N

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:p
                     g] rusage[ngpus_physical=4.00/host] span[ptile=1]
 Effective: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:
                     pg] rusage[ngpus_physical=4.00/host] span[ptile=1]

 GPU REQUIREMENT DETAILS:
 Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia
 Effective: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

------------------------------
Subin Pillai
------------------------------

Original Message

Original Message:
Sent: Sun March 03, 2024 03:59 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Seems you should submit test job to queue "gq" rather "cq"

------------------------------
YI SUN

Original Message:
Sent: Sun March 03, 2024 01:05 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Hi Yi Sun,

One node job can get GPU allocation on host2... The lsload is as below

lsload -gpu
HOST_NAME status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical
host1 ok 4 0% 0% 4
host2 ok 4 0% 0% 4

When I tried

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

I am getting no suitable hosts as below

Job <232873>, User <abc>, Project <default>, Status <PEND>, Queue <cq>, C                     ommand <module load cuda; ./maya2.sh>Sun Mar  3 23:32:59: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>; PENDING REASONS: There are no suitable hosts for the job; SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00/host] span[ptile=1] Effective: - GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: -

------------------------------
Subin Pillai

Original Message:
Sent: Thu February 22, 2024 11:53 AM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Try following see if one node job can get GPU allocation on host2.

On host1, bsub -I -gpu "num=4" -m host2 nvidia-smi

If above test is positive, try following see if it works on two nodes.

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

Also make sure on lsload -gpu and bhosts -gpu on host2 report correct GPU info.

------------------------------
YI SUN

Original Message:
Sent: Wed February 21, 2024 03:25 AM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

I am using a HPC cluster with LSF Resource manager. The task is to train a tensorflow model. The graphics queue gq (Max jobs 96) has two hosts with 4 GPUs each. I want to use all the 8 GPUs together. But the task is allotted to only one host?

The command i used to schedule the job is as follows :-

 bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

The bjobs -l output is as follows :-

 bjobs -l 229990Job <229990>, User <abc>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./maya2.sh>, Share group charged                     </GPU>Wed Feb 21 13:43:13: Submitted from host <host1>, CWD <$HOME>, Output File <                     output2.log>, Error File <error2.log>, 96 Task(s), Request                     ed GPU <num=4:mode=shared:j_exclusive=no>;Wed Feb 21 13:43:13: Started 96 Task(s) on Host(s) <48*host1> <48*host2>,                      Allocated 96 Slot(s) on Host(s) <48*host1> <48*host2                     k>, Execution Home </users/sys/abc>, Execution                     CWD </users/sys/abc>;Wed Feb 21 13:44:24: Resource usage collected.                     The CPU time used is 18 seconds.                     MEM: 1 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 78 RUNLIMIT 360.0 min MEMORY USAGE: MAX MEM: 1 Gbytes;  AVG MEM: 480 Mbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.25 ;  CPU Efficiency: 0.26% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Feb 21 13:43   host1:gpus=1,2,3,0;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00] Effective: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phy                     sical=4.00] GPU REQUIREMENT DETAILS: Combined: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

No GPUs from host 2 are allocated to the task. I checked the same using nvidea-smi command also. The task is run only on host1. What should I do so that I can use all 8 GPUs across both the hosts? Is there some mistake in the bsub command I am using??

------------------------------
Subin Pillai
------------------------------

6. RE: Accessing multiple GPUs on different hosts using LSF

Like

YI SUN

Posted Mon March 04, 2024 04:19 PM

How did you launch the 2nd task on host1?

------------------------------
YI SUN
------------------------------

Original Message

Original Message:
Sent: Sun March 03, 2024 11:03 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Sorry my bad. This is the output. But somehow it still uses GPUs only from one host. Also

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 e40070822 Mar 4 09:26 host1:gpus=0,1,2,3;host2:gpus N

If you see these are GPUs that are not allocated. All 4 gpus from host2 are running the task but none from host1

$ bjobs -l 232878Job <232878>, User <abcd>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./demo.sh>, Share group charged                     </GPU>Mon Mar  4 09:25:36: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>;Mon Mar  4 09:25:57: Started 2 Task(s) on Host(s) <1*host2> <1*host1>, Al                     located 2 Slot(s) on Host(s) <1*host2> <1*host1>, Ex                     ecution Home </users/analysis/e40070822>, Execution CWD </                     users/analysis/e40070822>;Mon Mar  4 09:28:03: Resource usage collected.                     The CPU time used is 73 seconds.                     MEM: 12.3 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 197 MEMORY USAGE: MAX MEM: 12.3 Gbytes;  AVG MEM: 7.5 Gbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.98 ;  CPU Efficiency: 49.21% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Mar  4 09:26   host1:gpus=0,1,2,3;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:p                     g] rusage[ngpus_physical=4.00/host] span[ptile=1] Effective: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:                     pg] rusage[ngpus_physical=4.00/host] span[ptile=1] GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

------------------------------
Subin Pillai

Original Message:
Sent: Sun March 03, 2024 03:59 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Seems you should submit test job to queue "gq" rather "cq"

------------------------------
YI SUN

Original Message:
Sent: Sun March 03, 2024 01:05 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Hi Yi Sun,

One node job can get GPU allocation on host2... The lsload is as below

lsload -gpu
HOST_NAME status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical
host1 ok 4 0% 0% 4
host2 ok 4 0% 0% 4

When I tried

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

I am getting no suitable hosts as below

Job <232873>, User <abc>, Project <default>, Status <PEND>, Queue <cq>, C                     ommand <module load cuda; ./maya2.sh>Sun Mar  3 23:32:59: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>; PENDING REASONS: There are no suitable hosts for the job; SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00/host] span[ptile=1] Effective: - GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: -

------------------------------
Subin Pillai

Original Message:
Sent: Thu February 22, 2024 11:53 AM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Try following see if one node job can get GPU allocation on host2.

On host1, bsub -I -gpu "num=4" -m host2 nvidia-smi

If above test is positive, try following see if it works on two nodes.

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

Also make sure on lsload -gpu and bhosts -gpu on host2 report correct GPU info.

------------------------------
YI SUN

Original Message:
Sent: Wed February 21, 2024 03:25 AM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

I am using a HPC cluster with LSF Resource manager. The task is to train a tensorflow model. The graphics queue gq (Max jobs 96) has two hosts with 4 GPUs each. I want to use all the 8 GPUs together. But the task is allotted to only one host?

The command i used to schedule the job is as follows :-

 bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

The bjobs -l output is as follows :-

 bjobs -l 229990Job <229990>, User <abc>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./maya2.sh>, Share group charged                     </GPU>Wed Feb 21 13:43:13: Submitted from host <host1>, CWD <$HOME>, Output File <                     output2.log>, Error File <error2.log>, 96 Task(s), Request                     ed GPU <num=4:mode=shared:j_exclusive=no>;Wed Feb 21 13:43:13: Started 96 Task(s) on Host(s) <48*host1> <48*host2>,                      Allocated 96 Slot(s) on Host(s) <48*host1> <48*host2                     k>, Execution Home </users/sys/abc>, Execution                     CWD </users/sys/abc>;Wed Feb 21 13:44:24: Resource usage collected.                     The CPU time used is 18 seconds.                     MEM: 1 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 78 RUNLIMIT 360.0 min MEMORY USAGE: MAX MEM: 1 Gbytes;  AVG MEM: 480 Mbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.25 ;  CPU Efficiency: 0.26% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Feb 21 13:43   host1:gpus=1,2,3,0;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00] Effective: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phy                     sical=4.00] GPU REQUIREMENT DETAILS: Combined: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

No GPUs from host 2 are allocated to the task. I checked the same using nvidea-smi command also. The task is run only on host1. What should I do so that I can use all 8 GPUs across both the hosts? Is there some mistake in the bsub command I am using??

------------------------------
Subin Pillai
------------------------------

7. RE: Accessing multiple GPUs on different hosts using LSF

Like

YI SUN

Posted Mon March 04, 2024 04:24 PM

You may check this link.

https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes#bmcfa6563e-098c-4ad0-88f7-a0615a97de40

------------------------------
YI SUN
------------------------------

Original Message

Original Message:
Sent: Mon March 04, 2024 04:19 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

How did you launch the 2nd task on host1?

------------------------------
YI SUN

Original Message:
Sent: Sun March 03, 2024 11:03 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Sorry my bad. This is the output. But somehow it still uses GPUs only from one host. Also

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 e40070822 Mar 4 09:26 host1:gpus=0,1,2,3;host2:gpus N

If you see these are GPUs that are not allocated. All 4 gpus from host2 are running the task but none from host1

$ bjobs -l 232878Job <232878>, User <abcd>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./demo.sh>, Share group charged                     </GPU>Mon Mar  4 09:25:36: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>;Mon Mar  4 09:25:57: Started 2 Task(s) on Host(s) <1*host2> <1*host1>, Al                     located 2 Slot(s) on Host(s) <1*host2> <1*host1>, Ex                     ecution Home </users/analysis/e40070822>, Execution CWD </                     users/analysis/e40070822>;Mon Mar  4 09:28:03: Resource usage collected.                     The CPU time used is 73 seconds.                     MEM: 12.3 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 197 MEMORY USAGE: MAX MEM: 12.3 Gbytes;  AVG MEM: 7.5 Gbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.98 ;  CPU Efficiency: 49.21% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Mar  4 09:26   host1:gpus=0,1,2,3;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:p                     g] rusage[ngpus_physical=4.00/host] span[ptile=1] Effective: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:                     pg] rusage[ngpus_physical=4.00/host] span[ptile=1] GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

------------------------------
Subin Pillai

Original Message:
Sent: Sun March 03, 2024 03:59 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Seems you should submit test job to queue "gq" rather "cq"

------------------------------
YI SUN

Original Message:
Sent: Sun March 03, 2024 01:05 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Hi Yi Sun,

One node job can get GPU allocation on host2... The lsload is as below

lsload -gpu
HOST_NAME status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical
host1 ok 4 0% 0% 4
host2 ok 4 0% 0% 4

When I tried

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

I am getting no suitable hosts as below

Job <232873>, User <abc>, Project <default>, Status <PEND>, Queue <cq>, C                     ommand <module load cuda; ./maya2.sh>Sun Mar  3 23:32:59: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>; PENDING REASONS: There are no suitable hosts for the job; SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00/host] span[ptile=1] Effective: - GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: -

------------------------------
Subin Pillai

Original Message:
Sent: Thu February 22, 2024 11:53 AM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Try following see if one node job can get GPU allocation on host2.

On host1, bsub -I -gpu "num=4" -m host2 nvidia-smi

If above test is positive, try following see if it works on two nodes.

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

Also make sure on lsload -gpu and bhosts -gpu on host2 report correct GPU info.

------------------------------
YI SUN

Original Message:
Sent: Wed February 21, 2024 03:25 AM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

I am using a HPC cluster with LSF Resource manager. The task is to train a tensorflow model. The graphics queue gq (Max jobs 96) has two hosts with 4 GPUs each. I want to use all the 8 GPUs together. But the task is allotted to only one host?

The command i used to schedule the job is as follows :-

 bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

The bjobs -l output is as follows :-

 bjobs -l 229990Job <229990>, User <abc>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./maya2.sh>, Share group charged                     </GPU>Wed Feb 21 13:43:13: Submitted from host <host1>, CWD <$HOME>, Output File <                     output2.log>, Error File <error2.log>, 96 Task(s), Request                     ed GPU <num=4:mode=shared:j_exclusive=no>;Wed Feb 21 13:43:13: Started 96 Task(s) on Host(s) <48*host1> <48*host2>,                      Allocated 96 Slot(s) on Host(s) <48*host1> <48*host2                     k>, Execution Home </users/sys/abc>, Execution                     CWD </users/sys/abc>;Wed Feb 21 13:44:24: Resource usage collected.                     The CPU time used is 18 seconds.                     MEM: 1 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 78 RUNLIMIT 360.0 min MEMORY USAGE: MAX MEM: 1 Gbytes;  AVG MEM: 480 Mbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.25 ;  CPU Efficiency: 0.26% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Feb 21 13:43   host1:gpus=1,2,3,0;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00] Effective: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phy                     sical=4.00] GPU REQUIREMENT DETAILS: Combined: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

No GPUs from host 2 are allocated to the task. I checked the same using nvidea-smi command also. The task is run only on host1. What should I do so that I can use all 8 GPUs across both the hosts? Is there some mistake in the bsub command I am using??

------------------------------
Subin Pillai
------------------------------

8. RE: Accessing multiple GPUs on different hosts using LSF

Like

Subin Pillai

Posted Tue March 05, 2024 01:13 AM
Edited by Subin Pillai Tue March 05, 2024 04:29 AM

This really helped. I didn't use blaunch. That was the mssing piece of the puzzle. This is the updated command I used.

bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log blaunch -z "host1 host2" "module load cuda; ./file.sh"

But the problem now is it runs the program twice. That is I am getting the outputs twice, instead of it running parallel.

------------------------------
Subin Pillai
------------------------------

Original Message

Original Message:
Sent: Mon March 04, 2024 04:24 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

You may check this link.

https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes#bmcfa6563e-098c-4ad0-88f7-a0615a97de40

------------------------------
YI SUN

Original Message:
Sent: Mon March 04, 2024 04:19 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

How did you launch the 2nd task on host1?

------------------------------
YI SUN

Original Message:
Sent: Sun March 03, 2024 11:03 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Sorry my bad. This is the output. But somehow it still uses GPUs only from one host. Also

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 e40070822 Mar 4 09:26 host1:gpus=0,1,2,3;host2:gpus N

If you see these are GPUs that are not allocated. All 4 gpus from host2 are running the task but none from host1

$ bjobs -l 232878Job <232878>, User <abcd>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./demo.sh>, Share group charged                     </GPU>Mon Mar  4 09:25:36: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>;Mon Mar  4 09:25:57: Started 2 Task(s) on Host(s) <1*host2> <1*host1>, Al                     located 2 Slot(s) on Host(s) <1*host2> <1*host1>, Ex                     ecution Home </users/analysis/e40070822>, Execution CWD </                     users/analysis/e40070822>;Mon Mar  4 09:28:03: Resource usage collected.                     The CPU time used is 73 seconds.                     MEM: 12.3 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 197 MEMORY USAGE: MAX MEM: 12.3 Gbytes;  AVG MEM: 7.5 Gbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.98 ;  CPU Efficiency: 49.21% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Mar  4 09:26   host1:gpus=0,1,2,3;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:p                     g] rusage[ngpus_physical=4.00/host] span[ptile=1] Effective: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:                     pg] rusage[ngpus_physical=4.00/host] span[ptile=1] GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

------------------------------
Subin Pillai

Original Message:
Sent: Sun March 03, 2024 03:59 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Seems you should submit test job to queue "gq" rather "cq"

------------------------------
YI SUN

Original Message:
Sent: Sun March 03, 2024 01:05 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Hi Yi Sun,

One node job can get GPU allocation on host2... The lsload is as below

lsload -gpu
HOST_NAME status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical
host1 ok 4 0% 0% 4
host2 ok 4 0% 0% 4

When I tried

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

I am getting no suitable hosts as below

Job <232873>, User <abc>, Project <default>, Status <PEND>, Queue <cq>, C                     ommand <module load cuda; ./maya2.sh>Sun Mar  3 23:32:59: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>; PENDING REASONS: There are no suitable hosts for the job; SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00/host] span[ptile=1] Effective: - GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: -

------------------------------
Subin Pillai

Original Message:
Sent: Thu February 22, 2024 11:53 AM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Try following see if one node job can get GPU allocation on host2.

On host1, bsub -I -gpu "num=4" -m host2 nvidia-smi

If above test is positive, try following see if it works on two nodes.

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

Also make sure on lsload -gpu and bhosts -gpu on host2 report correct GPU info.

------------------------------
YI SUN

Original Message:
Sent: Wed February 21, 2024 03:25 AM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

I am using a HPC cluster with LSF Resource manager. The task is to train a tensorflow model. The graphics queue gq (Max jobs 96) has two hosts with 4 GPUs each. I want to use all the 8 GPUs together. But the task is allotted to only one host?

The command i used to schedule the job is as follows :-

 bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

The bjobs -l output is as follows :-

 bjobs -l 229990Job <229990>, User <abc>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./maya2.sh>, Share group charged                     </GPU>Wed Feb 21 13:43:13: Submitted from host <host1>, CWD <$HOME>, Output File <                     output2.log>, Error File <error2.log>, 96 Task(s), Request                     ed GPU <num=4:mode=shared:j_exclusive=no>;Wed Feb 21 13:43:13: Started 96 Task(s) on Host(s) <48*host1> <48*host2>,                      Allocated 96 Slot(s) on Host(s) <48*host1> <48*host2                     k>, Execution Home </users/sys/abc>, Execution                     CWD </users/sys/abc>;Wed Feb 21 13:44:24: Resource usage collected.                     The CPU time used is 18 seconds.                     MEM: 1 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 78 RUNLIMIT 360.0 min MEMORY USAGE: MAX MEM: 1 Gbytes;  AVG MEM: 480 Mbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.25 ;  CPU Efficiency: 0.26% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Feb 21 13:43   host1:gpus=1,2,3,0;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00] Effective: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phy                     sical=4.00] GPU REQUIREMENT DETAILS: Combined: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

No GPUs from host 2 are allocated to the task. I checked the same using nvidea-smi command also. The task is run only on host1. What should I do so that I can use all 8 GPUs across both the hosts? Is there some mistake in the bsub command I am using??

------------------------------
Subin Pillai
------------------------------

9. RE: Accessing multiple GPUs on different hosts using LSF

Like

Bernd Dammann

Posted Wed March 06, 2024 10:49 AM

Yes - blaunch executes the command you are giving on each node. You will need to set up the environment variables for Tensorflow, TF_CONFIG, to recognize this a "cluster". In your file.sh add something like this:

MYTASKID=$((LSF_PM_TASKID-1))

TF_CONFIG='{"cluster": {"worker": ["host1:12345", "host2:12345"]}, "task": {"index": '${MYTASKID}', "type": "worker"}}'

using the TASKID of the blaunch tasks to create task IDs for Tensorflow. You might also need to choose another port number than 12345.

In your Python code, you have to choose the corresponding strategy:

strategy = tf.distribute.MultiWorkerMirroredStrategy()

This will then pick up the TC_CONFIG settings, to distribute the work!

I haven't tested this myself, but there are some tutorials on the net, that show how to setup TF_CONFIG.

------------------------------
Bernd Dammann
------------------------------

Original Message

Original Message:
Sent: Tue March 05, 2024 01:12 AM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

This really helped. I didn't use blaunch. That was the mssing piece of the puzzle. This is the updated command I used.

bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log blaunch -z "host1 host2" "module load cuda; ./file.sh"

But the problem now is it runs the program twice. That is I am getting the outputs twice, instead of it running parallel.

------------------------------
Subin Pillai

Original Message:
Sent: Mon March 04, 2024 04:24 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

You may check this link.

https://community.ibm.com/community/user/cloud/discussion/how-to-run-pytorch-ddp-job-on-multi-nodes#bmcfa6563e-098c-4ad0-88f7-a0615a97de40

------------------------------
YI SUN

Original Message:
Sent: Mon March 04, 2024 04:19 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

How did you launch the 2nd task on host1?

------------------------------
YI SUN

Original Message:
Sent: Sun March 03, 2024 11:03 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Sorry my bad. This is the output. But somehow it still uses GPUs only from one host. Also

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 e40070822 Mar 4 09:26 host1:gpus=0,1,2,3;host2:gpus N

If you see these are GPUs that are not allocated. All 4 gpus from host2 are running the task but none from host1

$ bjobs -l 232878Job <232878>, User <abcd>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./demo.sh>, Share group charged                     </GPU>Mon Mar  4 09:25:36: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>;Mon Mar  4 09:25:57: Started 2 Task(s) on Host(s) <1*host2> <1*host1>, Al                     located 2 Slot(s) on Host(s) <1*host2> <1*host1>, Ex                     ecution Home </users/analysis/e40070822>, Execution CWD </                     users/analysis/e40070822>;Mon Mar  4 09:28:03: Resource usage collected.                     The CPU time used is 73 seconds.                     MEM: 12.3 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 197 MEMORY USAGE: MAX MEM: 12.3 Gbytes;  AVG MEM: 7.5 Gbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.98 ;  CPU Efficiency: 49.21% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Mar  4 09:26   host1:gpus=0,1,2,3;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:p                     g] rusage[ngpus_physical=4.00/host] span[ptile=1] Effective: select[((type == any ) && (type == any )) && (ngpus>0)] order[r15s:                     pg] rusage[ngpus_physical=4.00/host] span[ptile=1] GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

------------------------------
Subin Pillai

Original Message:
Sent: Sun March 03, 2024 03:59 PM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Seems you should submit test job to queue "gq" rather "cq"

------------------------------
YI SUN

Original Message:
Sent: Sun March 03, 2024 01:05 PM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

Hi Yi Sun,

One node job can get GPU allocation on host2... The lsload is as below

lsload -gpu
HOST_NAME status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical
host1 ok 4 0% 0% 4
host2 ok 4 0% 0% 4

When I tried

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

I am getting no suitable hosts as below

Job <232873>, User <abc>, Project <default>, Status <PEND>, Queue <cq>, C                     ommand <module load cuda; ./maya2.sh>Sun Mar  3 23:32:59: Submitted from host <host1>, CWD <$HOME>, 2 Task(s), Re                     quested Resources <type==any span[ptile=1]>, Requested GPU                      <num=4/host>; PENDING REASONS: There are no suitable hosts for the job; SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00/host] span[ptile=1] Effective: - GPU REQUIREMENT DETAILS: Combined: num=4/host:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: -

------------------------------
Subin Pillai

Original Message:
Sent: Thu February 22, 2024 11:53 AM
From: YI SUN
Subject: Accessing multiple GPUs on different hosts using LSF

Try following see if one node job can get GPU allocation on host2.

On host1, bsub -I -gpu "num=4" -m host2 nvidia-smi

If above test is positive, try following see if it works on two nodes.

On host1, bsub -n 2 -gpu "num=4/host" -R "type==any span[ptile=1]" blaunch nvdia-smi

Also make sure on lsload -gpu and bhosts -gpu on host2 report correct GPU info.

------------------------------
YI SUN

Original Message:
Sent: Wed February 21, 2024 03:25 AM
From: Subin Pillai
Subject: Accessing multiple GPUs on different hosts using LSF

I am using a HPC cluster with LSF Resource manager. The task is to train a tensorflow model. The graphics queue gq (Max jobs 96) has two hosts with 4 GPUs each. I want to use all the 8 GPUs together. But the task is allotted to only one host?

The command i used to schedule the job is as follows :-

 bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

The bjobs -l output is as follows :-

 bjobs -l 229990Job <229990>, User <abc>, Project <default>, Status <RUN>, Queue <gq>, Co                     mmand <module load cuda; ./maya2.sh>, Share group charged                     </GPU>Wed Feb 21 13:43:13: Submitted from host <host1>, CWD <$HOME>, Output File <                     output2.log>, Error File <error2.log>, 96 Task(s), Request                     ed GPU <num=4:mode=shared:j_exclusive=no>;Wed Feb 21 13:43:13: Started 96 Task(s) on Host(s) <48*host1> <48*host2>,                      Allocated 96 Slot(s) on Host(s) <48*host1> <48*host2                     k>, Execution Home </users/sys/abc>, Execution                     CWD </users/sys/abc>;Wed Feb 21 13:44:24: Resource usage collected.                     The CPU time used is 18 seconds.                     MEM: 1 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 78 RUNLIMIT 360.0 min MEMORY USAGE: MAX MEM: 1 Gbytes;  AVG MEM: 480 Mbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 0.25 ;  CPU Efficiency: 0.26% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      - loadStop    -     -     -     -       -     -    -     -     -      -      - EXTERNAL MESSAGES: MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT 0      e40070822  Feb 21 13:43   host1:gpus=1,2,3,0;host2:gpus     N RESOURCE REQUIREMENT DETAILS: Combined: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phys                     ical=4.00] Effective: select[(type == any ) && (ngpus>0)] order[r15s:pg] rusage[ngpus_phy                     sical=4.00] GPU REQUIREMENT DETAILS: Combined: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia Effective: num=4:mode=shared:mps=no:j_exclusive=no:gvendor=nvidia

No GPUs from host 2 are allocated to the task. I checked the same using nvidea-smi command also. The task is run only on host1. What should I do so that I can use all 8 GPUs across both the hosts? Is there some mistake in the bsub command I am using??

------------------------------
Subin Pillai
------------------------------

High Performance Computing Group

High Performance Computing Group

Subin PillaiThu February 22, 2024 07:53 AM

YI SUNThu February 22, 2024 11:53 AM

Subin PillaiSun March 03, 2024 01:06 PM

YI SUNSun March 03, 2024 04:00 PM

Subin PillaiSun March 03, 2024 11:03 PM

YI SUNMon March 04, 2024 04:19 PM

YI SUNMon March 04, 2024 04:24 PM

Subin PillaiTue March 05, 2024 01:13 AM

Bernd DammannWed March 06, 2024 10:49 AM

Bernd DammannFri February 23, 2024 09:15 AM

Subin PillaiSun March 03, 2024 11:04 PM

1. Accessing multiple GPUs on different hosts using LSF

2. RE: Accessing multiple GPUs on different hosts using LSF

3. RE: Accessing multiple GPUs on different hosts using LSF

4. RE: Accessing multiple GPUs on different hosts using LSF

5. RE: Accessing multiple GPUs on different hosts using LSF

6. RE: Accessing multiple GPUs on different hosts using LSF

7. RE: Accessing multiple GPUs on different hosts using LSF

8. RE: Accessing multiple GPUs on different hosts using LSF

9. RE: Accessing multiple GPUs on different hosts using LSF

10. RE: Accessing multiple GPUs on different hosts using LSF

11. RE: Accessing multiple GPUs on different hosts using LSF

Additional
Resources

Office

Quick Links

High Performance Computing Group

High Performance Computing Group