IBM Spectrum Computing Group

 View Only

DynaMIG management of NVIDIA DGX A100 with IBM Spectrum LSF

By Bill McMillan posted Mon January 04, 2021 08:59 AM

  

When I first saw the specs of NVIDIA A100 GPU my initial thought was how do you use that many cores and GPU memory efficiently? The 40GB and the newer 80GB version allow you to run very large models and applications, but there are many smaller workloads that can't utilize the full computing capacity and memory of the A100.

With so much compute power availabile, it is more important than ever to ensure that it’s use is effectively aligned with business priorities. For decades we've worked with NVIDIA, enhancing IBM Spectrum LSF to ensure that our users get the most out of using these GPU's with LSF.

Using MIG

As GPU’s become ever more powerful, it can be challenging to keep them busy for some workloads that may not saturate the compute capacity of the GPU. Some examples include low batch inferencing or some HPC workloads. CUDA provides some mechanisms for achieving concurrency through techniques such as CUDA streams or CUDA MPS but these technologies can have limitations for running parallel work.

With the launch of the A100, a new capability known as Multi-Instance GPU (MIG) was introduced.  This allows the A100 to be subdivided to create up to 7 GPU instances, and on a DGX A100 we can create up to 56 GPU instances. MIG offers the flexibility of creating different sizes of the GPU instances and thus allowing Administrators to configure the A100  to better match the workload, and thus increase overall utilization. These instances are fully isolated each fully isolated with their own high-bandwidth memory, cache, and compute cores.

MIG relies on the Administrator to manually define and change the configuration if required. If you have a relatively static/homogeneous workload then there is probably an optimal set of predefined MIG configurations that will work for you.  LSF will detect the existing MIG configuration, and schedule work to the pre-defined instances. However, if you have a highly variable workload, the Administrator is going to be very busy manually changing the configuration to meet workload requirements.

Let LSF automate that for you

For NVIDIA datacenter GPU’s, LSF automatically detects GPU’s, and it will dynamically change the compute mode of the GPU based on job requirements. You can read more about those capabilities here.

For A100 we have extended this capability to have LSF dynamically reconfigure MIG based on workload requirements.   For example, if you have a 40GB A100, and there is a job that needs the full 40GB, then LSF will configure it to be one 40GB GPU.   But if the job only needs 20GB, then MIG will be configured to create an instance with half the GPU memory etc.

And of course, if you have DGX A100, this can be dynamically reconfigured between 8 GPU’s and 56 GPU’s, or anywhere in between based on the workload requirements – all without any Administrator intervention.

This new capability allows the A100 configuration to be automatically right sized to match the requirements of the workload, delivering a greater GPU ROI.

In the rest of this blog, we will examine this new capability in more detail.

A look under the covers

Before looking at what we are doing for A100, let's take a quick recap of the existing support for NVIDIA datacenter GPUs:

  • Auto-detection/configuration: LSF automatically detects the presence of GPU's and auto-configures to support them - no need manually create resources, cgroups, resource maps, whatever - it is done automatically. The GPU is ready to be used as soon as LSF starts up.
  • Workload driven mode-selection: Jobs can specify whether they need a GPU in shared or exclusive mode, and the mode will be automatically set before the job starts.
  • Isolation: The CPU’s and GPU’s assigned to a job are contained within a control group. This ensures that jobs cannot use GPU’s they were not allocated, and it also means that if a job did not request a GPU, it cannot use one.
  • Accounting: Full accounting of how the GPU is used, including support for NVIDIA DCGM.

These are just four of the many capabilities LSF provides, if you are interested in a full account, please see the “LSF on DGX guide”.

 Static MIG Configuration

Let’s look at how this works on a DGX A100.   First, we will consider the case where the Administrator has statically configured MIG, and LSF will use the predefined instances.  From the “nvidia-smi” output, we can see that the Administrator has manually configured 2 MIG instances on GPU0, one each on GPU’s 1, 2, 3 and 4. 

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |                      | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0   11   0   0  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   1  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   0  |      7MiB /  9984MiB | 28      0 |  2   0    1    0    0 |
+------------------+----------------------+-----------+-----------------------+
|  2    2   0   0  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
+------------------+----------------------+-----------+-----------------------+
|  3    1   0   0  |     14MiB / 20096MiB | 56      0 |  4   0    2    0    0 |
+------------------+----------------------+-----------+-----------------------+
|  4    0   0   0  |      0MiB / 40537MiB | 98      0 |  7   0    5    1    1 |
+------------------+----------------------+-----------+-----------------------+

 

Using LSF’s “lshosts” command we can see the hardware configuration, and the “-gpu” flag shows the details of the GPU’s that have been detected: 

$ lshosts -gpu
HOST_NAME   gpu_id       gpu_model   gpu_driver   gpu_factor      numa_id       vendor          mig
Dgxa      8      0 TeslaA100_SXM4_    450.51.06          8.0            3       Nvidia            Y
                 1 TeslaA100_SXM4_    450.51.06          8.0            3       Nvidia            Y
                 2 TeslaA100_SXM4_    450.51.06          8.0            1       Nvidia            Y
                 3 TeslaA100_SXM4_    450.51.06          8.0            1       Nvidia            Y
                 4 TeslaA100_SXM4_    450.51.06          8.0            7       Nvidia            Y
                 5 TeslaA100_SXM4_    450.51.06          8.0            7       Nvidia            Y
                 6 TeslaA100_SXM4_    450.51.06          8.0            5       Nvidia            Y
                 7 TeslaA100_SXM4_    450.51.06          8.0            5       Nvidia            Y

 
As expected, there are 8 physical A100 GPU’s in the system, and MIG has been enabled for all of them.   We can view the MIG configuration with the “-mig” flag:

$ lshosts -gpu -mig
HOST_NAME gpu_id  gpu_model        gpu_driver  gpu_factor  numa_id  vendor  devid  gid  cid  inst_name
dgxa      0       TeslaA100_SXM4_  450.51.06   8.0         3        Nvidia  0      11   0    1g.5gb
          0       TeslaA100_SXM4_  450.51.06   8.0         3        Nvidia  0      13   0    1g.5gb
          1       TeslaA100_SXM4_  450.51.06   8.0         3        Nvidia  0      3    0    2g.10gb
          2       TeslaA100_SXM4_  450.51.06   8.0         1        Nvidia  0      2    0    3g.20gb
          3       TeslaA100_SXM4_  450.51.06   8.0         1        Nvidia  0      1    0    4g.20gb
          4       TeslaA100_SXM4_  450.51.06   8.0         7        Nvidia  0      0    0    7g.40gb
          5       TeslaA100_SXM4_  450.51.06   8.0         7        Nvidia  -      -    -    - 
          6       TeslaA100_SXM4_  450.51.06   8.0         5        Nvidia  -      -    -    -
          7       TeslaA100_SXM4_  450.51.06   8.0         5        Nvidia  -      -    -    -


 This matches the nvidia-smi output, and we can also clearly see that GPU’s 5 ,6 and 7 have no instances defined.

 
We can submit jobs and request the amount of GPU memory required, or the specific MIG configuration/slices desired.  First, let’s submit a job requesting a single MIG slice: 

$ bsub -gpu “num=1:mig=1/1” ./e06-gpu

Job <416> is submitted to default queue <normal>.
 
$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
416     rladmin RUN   normal     dgxa        dgxa        e06-gpu    Nov 25 06:19

  

We can view the job detail to see what has been allocated:

 $ bjobs -l -gpu 416
Job <416>, User <rladmin>, Project <default>, Status <RUN>, Queue <normal>, Command <./e06-gpu>, Share group charged </rladmin>
Wed Nov 25 06:19:01: Submitted from host <dgxa-c18-u19-enp226s0>, CWD <$HOME>
                     Requested GPU <num=1:mig=1/1>;
Wed Nov 25 06:19:01: Started 1 Task(s) on Host(s) <dgxa>, Allocated 1 Slot(s)
                     on Host(s) <dgxa>,
                     Execution Home </home/rladmin>,
                     Execution CWD </home/rladmin>;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

 RESOURCE REQUIREMENT DETAILS:
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_physical=1.00:mig=1/1]
Effective: select[((ngpus>0)) && (type == local)] order[r15s:pg] rusage[ngpus_physical=1.00:mig=1/1]

 GPU REQUIREMENT DETAILS:
 Combined: num=1:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia:mig=1/1
 Effective: num=1:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia:mig=1/1

 GPU_ALLOCATION:
 HOST TASK GPU_ID  GI_ID/SIZE    CI_ID/SIZE    MODEL        MTOTAL  FACTOR MRSV    SOCKET NVLINK/XGMI
dgxa 0    0       4/1           4/1           TeslaA100_SX 39.5G   8.0    0M      3      -


We can also use LSF’s “bhosts” command to get detailed information about how jobs have been allocated to GPU’s: 

$ bhosts -gpu -l
HOST: dgxa
NGPUS NGPUS_SHARED_AVAIL NGPUS_EXCLUSIVE_AVAIL
8     8                  8

STATIC ATTRIBUTES
GPU_ID MODEL                MTOTAL    FACTOR   SOCKET VENDOR   MIG    NVLINK/XGMI
0     TeslaA100_SXM4_40GB  39.5G     8.0      3      Nvidia   Y      -/N/N/N/N/N/N/N
1      TeslaA100_SXM4_40GB  39.5G     8.0      3      Nvidia   Y      N/-/N/N/N/N/N/N
2      TeslaA100_SXM4_40GB  39.5G     8.0      1      Nvidia   Y      N/N/-/N/N/N/N/N
3      TeslaA100_SXM4_40GB  39.5G     8.0      1      Nvidia   Y      N/N/N/-/N/N/N/N
4      TeslaA100_SXM4_40GB  39.5G     8.0      7      Nvidia   Y      N/N/N/N/-/N/N/N
5      TeslaA100_SXM4_40GB  39.5G     8.0      7      Nvidia   Y      N/N/N/N/N/-/N/N
6      TeslaA100_SXM4_40GB  39.5G     8.0      5      Nvidia   Y      N/N/N/N/N/N/-/N
7      TeslaA100_SXM4_40GB  39.5G     8.0      5      Nvidia   Y      N/N/N/N/N/N/N/-

DYNAMIC ATTRIBUTES
GPU_ID MODE               MUSED     MRSV      TEMP   ECC    UT     MUT    PSTATE STATUS   ERROR
0      SHARED             87M       0M        31C    0      0%      0%   0      ok       -
1      SHARED             7M        0M        27C    0      0%      0%   0      ok       -
2      SHARED             11M       0M        28C    0      0%      0%   0      ok       -
3      SHARED             14M       0M        28C    0      0%      0%   0      ok       -
4      SHARED             0M        0M        31C    0      0%      0%   0      ok       -
5      SHARED             0M        0M        30C    0      0%      0%   0      ok       -
6      SHARED             0M        0M        30C    0      0%      0%   0      ok       -
7      SHARED             0M        0M        30C    0      0%      0%   0      ok       -

GPU JOB INFORMATION
GPU_ID JEXCL  RUNJOBIDS          SUSPJOBIDS         RSVJOBIDS          GI_ID/SIZE   CI_ID/SIZE
0      Y      416                -                  -                  4/1          4/1
1      -      -                  -                  -                  -            -
2      -      -                  -                  -                  -            -
3      -      -                  -                  -                  -            -
4      -      -                  -                  -                  -            -
5      -      -                  -                  -                  -            -
6      -      -                  -                  -                  -            -
7      -      -                  -                  -                  -            -

 

It is worth noting the output shows no NVLINK connectivity between the GPU’s.  This is not currently available when the GPU is in MIG mode.

Now let’s submit two more jobs, specifying the desired GPU memory rather than slices:

$ bsub -gpu “num=1:gmem=30G” ./e06-gpu
Job <417> is submitted to default queue <normal>.

$ bsub -gpu “num=1:gmem=30G” ./e06-gpu
Job <418> is submitted to default queue <normal>.

$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
417     rladmin RUN   normal     dgxa        dgxa        e06-gpu    Nov 25 06:25
418     rladmin PEND  normal     dgxa                    e06-gpu    Nov 25 06:25

 

Job 417 is dispatched, but job 418 cannot be dispatched due to insufficient GPU resources.

$ bjobs -l 417
Job <418>, User <rladmin>, Project <default>, Status <PEND>, Queue <normal>, Command <./e06-gpu>
Wed Nov 25 06:25:15: Submitted from host <dgxa>, CWD <$HOME>, Requested GPU <num=1:gmem=30000.00>;

 PENDING REASONS:
 Host's available GPU resources cannot meet the job's requirements: dgxa;

  

Not enough memory?   But we’re only running one 30GB job, and there are 7 GPU’s sitting idle!   But there are no other predefined MIG instances that can accommodate job 418 – it will have to wait till 417 completes….or you get your friendly admin to change the MIG config for you. 


Or we can let LSF manage MIG creation/sizing.

 Dynamic MIG

To enable LSF’s dynamic management of MIG, the Administrator needs to enable LSF_MANAGE_MIG=Y in lsf.conf and then reconfigure LSF. Note that this change should be made when no applications are running on the GPUs.


And it’s that simple!

Now MIG instances will be created on demand to meet the job’s GPU requirements.For example, consider this workload submitted to a first come first served (FCFS) queue:

  • 8 jobs each requiring most of the memory on a GPU
  • 56 jobs each requiring 4GB
  • And another 8 requiring a full GPU
$ repeat  8 bsub -gpu "num=1:gmem=32G" -J “large1”./e06-gpu
$ repeat 56 bsub -gpu "num=1:gmem=4G" -J “small” ./e06-gpu
$ repeat  8 bsub -gpu "num=1:gmem=26G" -J “large2”./e06-gpu

 

The GPU’s are automatically reconfigured to have a single MIG instance per GPU, and the first 8 jobs run: 

$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
2001    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2002    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2003    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2004    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2005    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2006    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2007    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2008    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2009    rladmin PEND  normal     dgxa        -           small      Nov 25 18:50
2010    rladmin PEND  normal     dgxa        -           small      Nov 25 18:50

2064    rladmin PEND  normal     dgxa        -           small      Nov 25 18:50
2065    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2066    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2067    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2068    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2069    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2070    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2071    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2072    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50

$ bhosts -gpu
HOST_NAME              ID           MODEL     MUSED      MRSV  NJOBS    RUN   SUSP    RSV
dgxa                    0 TeslaA100_SXM4_       16M       32G      1      1      0      0
                        1 TeslaA100_SXM4_       16M       32G      1      1      0      0
                        2 TeslaA100_SXM4_       16M       32G      1      1      0      0
                        3 TeslaA100_SXM4_       16M       32G      1      1      0      0
                        4 TeslaA100_SXM4_       16M       32G      1      1      0      0
                        5 TeslaA100_SXM4_       16M       32G      1      1      0      0
                        6 TeslaA100_SXM4_       16M       32G      1      1      0      0
                        7 TeslaA100_SXM4_       16M       32G      1      1      0      0

 

As they finish, LSF reconfigures MIG to meet the requirements of the small jobs, in this case each job will fit in a single MIG slice, so we can create 7 MIG instances per GPU.

$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
2009    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2010    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2011    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2012    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50

2064    rladmin RUN   normal     dgxa        dgxa        small      Nov 25 18:50
2065    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2066    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2067    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2068    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2069    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2070    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2071    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50
2072    rladmin PEND  normal     dgxa        -           large2     Nov 25 18:50

$ bhosts -gpu
HOST_NAME              ID           MODEL     MUSED      MRSV  NJOBS    RUN   SUSP    RSV
dgxa                    0 TeslaA100_SXM4_      106M       28G      7      7      0      0
                        1 TeslaA100_SXM4_      106M       28G      7      7      0      0
                        2 TeslaA100_SXM4_      106M       28G      7      7      0      0
                        3 TeslaA100_SXM4_      106M       28G      7      7      0      0
                        4 TeslaA100_SXM4_      106M       28G      7      7      0      0
                        5 TeslaA100_SXM4_      106M       28G      7      7      0      0
                        6 TeslaA100_SXM4_      106M       28G      7      7      0      0
                       7 TeslaA100_SXM4_      106M       28G      7      7      0  0  

 

And as those finish, the GPU’s are one again are automatically reconfigured, to allow the “large2” jobs to run:

$ bjobs

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
2065    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2066    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2067    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2068    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2069    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2070    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2071    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50
2072    rladmin RUN   normal     dgxa        dgxa        large1     Nov 25 18:50

$ bhosts -gpu
HOST_NAME              ID           MODEL     MUSED      MRSV  NJOBS    RUN   SUSP    RSV
dgxa                    0 TeslaA100_SXM4_       16M       26G      1      1      0      0
                        1 TeslaA100_SXM4_       16M       26G      1      1      0      0
                        2 TeslaA100_SXM4_       16M       26G      1      1      0      0
                        3 TeslaA100_SXM4_       16M       26G      1      1      0      0
                        4 TeslaA100_SXM4_       16M       26G      1      1      0      0
                        5 TeslaA100_SXM4_       16M       26G      1      1      0      0
                        6 TeslaA100_SXM4_       16M       26G      1      1      0      0
                        7 TeslaA100_SXM4_       16M       26G      1      1      0      0

 

By enabling LSF to manage MIG configuration, the capabilities of the A100 card and DGX A100 system can be automatically rightsized to fit the incoming workload, enabling greater utilization and throughput.

Acknowledgements

I’d like to thank Hongzhong Luan of IBM for implementing this capability, and I’d also like to thank for providing us access to A100 and DGX A100.

The patch to enable this capability can be downloaded from IBM Fix Central.

0 comments
21 views

Permalink