When I first saw the specs of NVIDIA A100 GPU my initial thought was how do you use that many cores and GPU memory efficiently? The 40GB and the newer 80GB version allow you to run very large models and applications, but there are many smaller workloads that can't utilize the full computing capacity and memory of the A100.
With so much compute power availabile, it is more important than ever to ensure that it’s use is effectively aligned with business priorities. For decades we've worked with NVIDIA, enhancing IBM Spectrum LSF to ensure that our users get the most out of using these GPU's with LSF.
Using MIG
As GPU’s become ever more powerful, it can be challenging to keep them busy for some workloads that may not saturate the compute capacity of the GPU. Some examples include low batch inferencing or some HPC workloads. CUDA provides some mechanisms for achieving concurrency through techniques such as CUDA streams or CUDA MPS but these technologies can have limitations for running parallel work.
With the launch of the A100, a new capability known as Multi-Instance GPU (MIG) was introduced. This allows the A100 to be subdivided to create up to 7 GPU instances, and on a DGX A100 we can create up to 56 GPU instances. MIG offers the flexibility of creating different sizes of the GPU instances and thus allowing Administrators to configure the A100 to better match the workload, and thus increase overall utilization. These instances are fully isolated each fully isolated with their own high-bandwidth memory, cache, and compute cores.
MIG relies on the Administrator to manually define and change the configuration if required. If you have a relatively static/homogeneous workload then there is probably an optimal set of predefined MIG configurations that will work for you. LSF will detect the existing MIG configuration, and schedule work to the pre-defined instances. However, if you have a highly variable workload, the Administrator is going to be very busy manually changing the configuration to meet workload requirements.
Let LSF automate that for you
For NVIDIA datacenter GPU’s, LSF automatically detects GPU’s, and it will dynamically change the compute mode of the GPU based on job requirements. You can read more about those capabilities here.
For A100 we have extended this capability to have LSF dynamically reconfigure MIG based on workload requirements. For example, if you have a 40GB A100, and there is a job that needs the full 40GB, then LSF will configure it to be one 40GB GPU. But if the job only needs 20GB, then MIG will be configured to create an instance with half the GPU memory etc.
And of course, if you have DGX A100, this can be dynamically reconfigured between 8 GPU’s and 56 GPU’s, or anywhere in between based on the workload requirements – all without any Administrator intervention.
This new capability allows the A100 configuration to be automatically right sized to match the requirements of the workload, delivering a greater GPU ROI.
In the rest of this blog, we will examine this new capability in more detail.
A look under the covers
Before looking at what we are doing for A100, let's take a quick recap of the existing support for NVIDIA datacenter GPUs:
- Auto-detection/configuration: LSF automatically detects the presence of GPU's and auto-configures to support them - no need manually create resources, cgroups, resource maps, whatever - it is done automatically. The GPU is ready to be used as soon as LSF starts up.
- Workload driven mode-selection: Jobs can specify whether they need a GPU in shared or exclusive mode, and the mode will be automatically set before the job starts.
- Isolation: The CPU’s and GPU’s assigned to a job are contained within a control group. This ensures that jobs cannot use GPU’s they were not allocated, and it also means that if a job did not request a GPU, it cannot use one.
- Accounting: Full accounting of how the GPU is used, including support for NVIDIA DCGM.
These are just four of the many capabilities LSF provides, if you are interested in a full account, please see the “LSF on DGX guide”.
Static MIG Configuration
Let’s look at how this works on a DGX A100. First, we will consider the case where the Administrator has statically configured MIG, and LSF will use the predefined instances. From the “nvidia-smi” output, we can see that the Administrator has manually configured 2 MIG instances on GPU0, one each on GPU’s 1, 2, 3 and 4.
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 11 0 0 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 1 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
+------------------+----------------------+-----------+-----------------------+
| 1 3 0 0 | 7MiB / 9984MiB | 28 0 | 2 0 1 0 0 |
+------------------+----------------------+-----------+-----------------------+
| 2 2 0 0 | 11MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
+------------------+----------------------+-----------+-----------------------+
| 3 1 0 0 | 14MiB / 20096MiB | 56 0 | 4 0 2 0 0 |
+------------------+----------------------+-----------+-----------------------+
| 4 0 0 0 | 0MiB / 40537MiB | 98 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
Using LSF’s “lshosts” command we can see the hardware configuration, and the “-gpu” flag shows the details of the GPU’s that have been detected:
$ lshosts -gpu
HOST_NAME gpu_id gpu_model gpu_driver gpu_factor numa_id vendor mig
Dgxa 8 0 TeslaA100_SXM4_ 450.51.06 8.0 3 Nvidia Y
1 TeslaA100_SXM4_ 450.51.06 8.0 3 Nvidia Y
2 TeslaA100_SXM4_ 450.51.06 8.0 1 Nvidia Y
3 TeslaA100_SXM4_ 450.51.06 8.0 1 Nvidia Y
4 TeslaA100_SXM4_ 450.51.06 8.0 7 Nvidia Y
5 TeslaA100_SXM4_ 450.51.06 8.0 7 Nvidia Y
6 TeslaA100_SXM4_ 450.51.06 8.0 5 Nvidia Y
7 TeslaA100_SXM4_ 450.51.06 8.0 5 Nvidia Y
As expected, there are 8 physical A100 GPU’s in the system, and MIG has been enabled for all of them. We can view the MIG configuration with the “-mig” flag:
$ lshosts -gpu -mig
HOST_NAME gpu_id gpu_model gpu_driver gpu_factor numa_id vendor devid gid cid inst_name
dgxa 0 TeslaA100_SXM4_ 450.51.06 8.0 3 Nvidia 0 11 0 1g.5gb
0 TeslaA100_SXM4_ 450.51.06 8.0 3 Nvidia 0 13 0 1g.5gb
1 TeslaA100_SXM4_ 450.51.06 8.0 3 Nvidia 0 3 0 2g.10gb
2 TeslaA100_SXM4_ 450.51.06 8.0 1 Nvidia 0 2 0 3g.20gb
3 TeslaA100_SXM4_ 450.51.06 8.0 1 Nvidia 0 1 0 4g.20gb
4 TeslaA100_SXM4_ 450.51.06 8.0 7 Nvidia 0 0 0 7g.40gb
5 TeslaA100_SXM4_ 450.51.06 8.0 7 Nvidia - - - -
6 TeslaA100_SXM4_ 450.51.06 8.0 5 Nvidia - - - -
7 TeslaA100_SXM4_ 450.51.06 8.0 5 Nvidia - - - -
This matches the nvidia-smi output, and we can also clearly see that GPU’s 5 ,6 and 7 have no instances defined.
We can submit jobs and request the amount of GPU memory required, or the specific MIG configuration/slices desired. First, let’s submit a job requesting a single MIG slice:
$ bsub -gpu “num=1:mig=1/1” ./e06-gpu
Job <416> is submitted to default queue <normal>.
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
416 rladmin RUN normal dgxa dgxa e06-gpu Nov 25 06:19
We can view the job detail to see what has been allocated:
$ bjobs -l -gpu 416
Job <416>, User <rladmin>, Project <default>, Status <RUN>, Queue <normal>, Command <./e06-gpu>, Share group charged </rladmin>
Wed Nov 25 06:19:01: Submitted from host <dgxa-c18-u19-enp226s0>, CWD <$HOME>
Requested GPU <num=1:mig=1/1>;
Wed Nov 25 06:19:01: Started 1 Task(s) on Host(s) <dgxa>, Allocated 1 Slot(s)
on Host(s) <dgxa>,
Execution Home </home/rladmin>,
Execution CWD </home/rladmin>;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
RESOURCE REQUIREMENT DETAILS:
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_physical=1.00:mig=1/1]
Effective: select[((ngpus>0)) && (type == local)] order[r15s:pg] rusage[ngpus_physical=1.00:mig=1/1]
GPU REQUIREMENT DETAILS:
Combined: num=1:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia:mig=1/1
Effective: num=1:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia:mig=1/1
GPU_ALLOCATION:
HOST TASK GPU_ID GI_ID/SIZE CI_ID/SIZE MODEL MTOTAL FACTOR MRSV SOCKET NVLINK/XGMI
dgxa 0 0 4/1 4/1 TeslaA100_SX 39.5G 8.0 0M 3 -
We can also use LSF’s “bhosts” command to get detailed information about how jobs have been allocated to GPU’s:
$ bhosts -gpu -l
HOST: dgxa
NGPUS NGPUS_SHARED_AVAIL NGPUS_EXCLUSIVE_AVAIL
8 8 8
STATIC ATTRIBUTES
GPU_ID MODEL MTOTAL FACTOR SOCKET VENDOR MIG NVLINK/XGMI
0 TeslaA100_SXM4_40GB 39.5G 8.0 3 Nvidia Y -/N/N/N/N/N/N/N
1 TeslaA100_SXM4_40GB 39.5G 8.0 3 Nvidia Y N/-/N/N/N/N/N/N
2 TeslaA100_SXM4_40GB 39.5G 8.0 1 Nvidia Y N/N/-/N/N/N/N/N
3 TeslaA100_SXM4_40GB 39.5G 8.0 1 Nvidia Y N/N/N/-/N/N/N/N
4 TeslaA100_SXM4_40GB 39.5G 8.0 7 Nvidia Y N/N/N/N/-/N/N/N
5 TeslaA100_SXM4_40GB 39.5G 8.0 7 Nvidia Y N/N/N/N/N/-/N/N
6 TeslaA100_SXM4_40GB 39.5G 8.0 5 Nvidia Y N/N/N/N/N/N/-/N
7 TeslaA100_SXM4_40GB 39.5G 8.0 5 Nvidia Y N/N/N/N/N/N/N/-
DYNAMIC ATTRIBUTES
GPU_ID MODE MUSED MRSV TEMP ECC UT MUT PSTATE STATUS ERROR
0 SHARED 87M 0M 31C 0 0% 0% 0 ok -
1 SHARED 7M 0M 27C 0 0% 0% 0 ok -
2 SHARED 11M 0M 28C 0 0% 0% 0 ok -
3 SHARED 14M 0M 28C 0 0% 0% 0 ok -
4 SHARED 0M 0M 31C 0 0% 0% 0 ok -
5 SHARED 0M 0M 30C 0 0% 0% 0 ok -
6 SHARED 0M 0M 30C 0 0% 0% 0 ok -
7 SHARED 0M 0M 30C 0 0% 0% 0 ok -
GPU JOB INFORMATION
GPU_ID JEXCL RUNJOBIDS SUSPJOBIDS RSVJOBIDS GI_ID/SIZE CI_ID/SIZE
0 Y 416 - - 4/1 4/1
1 - - - - - -
2 - - - - - -
3 - - - - - -
4 - - - - - -
5 - - - - - -
6 - - - - - -
7 - - - - - -
It is worth noting the output shows no NVLINK connectivity between the GPU’s. This is not currently available when the GPU is in MIG mode.
Now let’s submit two more jobs, specifying the desired GPU memory rather than slices:
$ bsub -gpu “num=1:gmem=30G” ./e06-gpu
Job <417> is submitted to default queue <normal>.
$ bsub -gpu “num=1:gmem=30G” ./e06-gpu
Job <418> is submitted to default queue <normal>.
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
417 rladmin RUN normal dgxa dgxa e06-gpu Nov 25 06:25
418 rladmin PEND normal dgxa e06-gpu Nov 25 06:25
Job 417 is dispatched, but job 418 cannot be dispatched due to insufficient GPU resources.
$ bjobs -l 417
Job <418>, User <rladmin>, Project <default>, Status <PEND>, Queue <normal>, Command <./e06-gpu>
Wed Nov 25 06:25:15: Submitted from host <dgxa>, CWD <$HOME>, Requested GPU <num=1:gmem=30000.00>;
PENDING REASONS:
Host's available GPU resources cannot meet the job's requirements: dgxa;
Not enough memory? But we’re only running one 30GB job, and there are 7 GPU’s sitting idle! But there are no other predefined MIG instances that can accommodate job 418 – it will have to wait till 417 completes….or you get your friendly admin to change the MIG config for you.
Or we can let LSF manage MIG creation/sizing.
Dynamic MIG
To enable LSF’s dynamic management of MIG, the Administrator needs to enable LSF_MANAGE_MIG=Y in lsf.conf and then reconfigure LSF. Note that this change should be made when no applications are running on the GPUs.
And it’s that simple!
Now MIG instances will be created on demand to meet the job’s GPU requirements.For example, consider this workload submitted to a first come first served (FCFS) queue:
- 8 jobs each requiring most of the memory on a GPU
- 56 jobs each requiring 4GB
- And another 8 requiring a full GPU
$ repeat 8 bsub -gpu "num=1:gmem=32G" -J “large1”./e06-gpu
$ repeat 56 bsub -gpu "num=1:gmem=4G" -J “small” ./e06-gpu
$ repeat 8 bsub -gpu "num=1:gmem=26G" -J “large2”./e06-gpu
The GPU’s are automatically reconfigured to have a single MIG instance per GPU, and the first 8 jobs run:
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2001 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2002 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2003 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2004 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2005 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2006 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2007 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2008 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2009 rladmin PEND normal dgxa - small Nov 25 18:50
2010 rladmin PEND normal dgxa - small Nov 25 18:50
…
2064 rladmin PEND normal dgxa - small Nov 25 18:50
2065 rladmin PEND normal dgxa - large2 Nov 25 18:50
2066 rladmin PEND normal dgxa - large2 Nov 25 18:50
2067 rladmin PEND normal dgxa - large2 Nov 25 18:50
2068 rladmin PEND normal dgxa - large2 Nov 25 18:50
2069 rladmin PEND normal dgxa - large2 Nov 25 18:50
2070 rladmin PEND normal dgxa - large2 Nov 25 18:50
2071 rladmin PEND normal dgxa - large2 Nov 25 18:50
2072 rladmin PEND normal dgxa - large2 Nov 25 18:50
$ bhosts -gpu
HOST_NAME ID MODEL MUSED MRSV NJOBS RUN SUSP RSV
dgxa 0 TeslaA100_SXM4_ 16M 32G 1 1 0 0
1 TeslaA100_SXM4_ 16M 32G 1 1 0 0
2 TeslaA100_SXM4_ 16M 32G 1 1 0 0
3 TeslaA100_SXM4_ 16M 32G 1 1 0 0
4 TeslaA100_SXM4_ 16M 32G 1 1 0 0
5 TeslaA100_SXM4_ 16M 32G 1 1 0 0
6 TeslaA100_SXM4_ 16M 32G 1 1 0 0
7 TeslaA100_SXM4_ 16M 32G 1 1 0 0
As they finish, LSF reconfigures MIG to meet the requirements of the small jobs, in this case each job will fit in a single MIG slice, so we can create 7 MIG instances per GPU.
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2009 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2010 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2011 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2012 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
…
2064 rladmin RUN normal dgxa dgxa small Nov 25 18:50
2065 rladmin PEND normal dgxa - large2 Nov 25 18:50
2066 rladmin PEND normal dgxa - large2 Nov 25 18:50
2067 rladmin PEND normal dgxa - large2 Nov 25 18:50
2068 rladmin PEND normal dgxa - large2 Nov 25 18:50
2069 rladmin PEND normal dgxa - large2 Nov 25 18:50
2070 rladmin PEND normal dgxa - large2 Nov 25 18:50
2071 rladmin PEND normal dgxa - large2 Nov 25 18:50
2072 rladmin PEND normal dgxa - large2 Nov 25 18:50
$ bhosts -gpu
HOST_NAME ID MODEL MUSED MRSV NJOBS RUN SUSP RSV
dgxa 0 TeslaA100_SXM4_ 106M 28G 7 7 0 0
1 TeslaA100_SXM4_ 106M 28G 7 7 0 0
2 TeslaA100_SXM4_ 106M 28G 7 7 0 0
3 TeslaA100_SXM4_ 106M 28G 7 7 0 0
4 TeslaA100_SXM4_ 106M 28G 7 7 0 0
5 TeslaA100_SXM4_ 106M 28G 7 7 0 0
6 TeslaA100_SXM4_ 106M 28G 7 7 0 0
7 TeslaA100_SXM4_ 106M 28G 7 7 0 0
And as those finish, the GPU’s are one again are automatically reconfigured, to allow the “large2” jobs to run:
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2065 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2066 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2067 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2068 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2069 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2070 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2071 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
2072 rladmin RUN normal dgxa dgxa large1 Nov 25 18:50
$ bhosts -gpu
HOST_NAME ID MODEL MUSED MRSV NJOBS RUN SUSP RSV
dgxa 0 TeslaA100_SXM4_ 16M 26G 1 1 0 0
1 TeslaA100_SXM4_ 16M 26G 1 1 0 0
2 TeslaA100_SXM4_ 16M 26G 1 1 0 0
3 TeslaA100_SXM4_ 16M 26G 1 1 0 0
4 TeslaA100_SXM4_ 16M 26G 1 1 0 0
5 TeslaA100_SXM4_ 16M 26G 1 1 0 0
6 TeslaA100_SXM4_ 16M 26G 1 1 0 0
7 TeslaA100_SXM4_ 16M 26G 1 1 0 0
By enabling LSF to manage MIG configuration, the capabilities of the A100 card and DGX A100 system can be automatically rightsized to fit the incoming workload, enabling greater utilization and throughput.
Acknowledgements
I’d like to thank Hongzhong Luan of IBM for implementing this capability, and I’d also like to thank for providing us access to A100 and DGX A100.
The patch to enable this capability can be downloaded from IBM Fix Central.
#SpectrumComputingGroup