High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to discussions

Expand all | Collapse all

Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

Chulmin KIM2 days ago

We've encountered some issues when assigning GPUs using LSF in environments where both MIG-enabled and ...

YI SUN2 days ago

You may try setting LSB_ENFORCE_GPU_MIG_JOB =Y in lsb.params file. I got this from LSF patch build601927 ...

1. Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

Chulmin KIM

Posted 2 days ago
Edited by Chulmin KIM 22 hours ago

We've encountered some issues when assigning GPUs using LSF in environments where both MIG-enabled and regular GPUs coexist. The system under test is equipped with 8 H100 GPUs - 4 with MIG enabled, and 4 in standard mode.

[root@gpgn14 ~]# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-3ca34270-8679-0d13-b6bc-1fbde96f46d6)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-59865de3-ab00-1089-568b-a5e2e10ff965)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-8cd4fcbf-2b89-44cc-b07b-177499452fcf)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4f852944-3f9c-e74e-bf54-db2cb6561c2c)
  MIG 3g.40gb     Device  0: (UUID: MIG-7d5c6e34-6183-5389-9015-3e0160617d58)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-11d7cddf-37d1-4d0f-6b47-0d0d31d8077c)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-401115e9-18f5-ef7d-3e42-4a492f351a8d)
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-983d1edb-e5ff-8f92-7ea7-0a43da2eb385)
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-09735f83-5d36-1f19-7411-4c69f6f7fe17)
[root@gpgn14 ~]# nvidia-smi
Tue Jul 15 15:19:01 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                   On |
| N/A   31C    P0              52W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                   On |
| N/A   30C    P0              47W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:48:00.0 Off |                   On |
| N/A   28C    P0              48W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:4C:00.0 Off |                   On |
| N/A   30C    P0              46W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:88:00.0 Off |                    0 |
| N/A   30C    P0              58W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:8B:00.0 Off |                    0 |
| N/A   33C    P0              60W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:C8:00.0 Off |                    0 |
| N/A   33C    P0              64W / 400W |   1517MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:CB:00.0 Off |                    0 |
| N/A   33C    P0              66W / 400W |  22419MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+







+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  3    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

When using the following LSF directive:

#BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes

LSF internally sets the following environment variables:

declare -x CUDA_VISIBLE_DEVICES="4"
declare -x CUDA_VISIBLE_DEVICES1="4"
declare -x CUDA_VISIBLE_DEVICES_ORIG="4"

The problem here is that the value 4 does not refer to a specific physical GPU. Instead, it denotes the first visible GPU as seen by the CUDA runtime, among those currently available. Therefore, it does not guarantee which physical GPU or MIG instance will be assigned.

However, when we specify MIG explicitly, like so:

#BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes:mig=3/3"

LSF correctly targets a MIG device, and the environment variable is set in the form:

CUDA_VISIBLE_DEVICES=MIG-<GPU-UUID>/<GPU instance ID>/<Compute instance ID>

Now, here's where things get problematic. If you have a mix of MIG-enabled and standard GPUs, and you do not use the mig keyword, LSF may still assign a MIG instance - potentially leading to confusion or unexpected behavior.

In the above situation, the job runs on a MIG instance.


[test@gpgn14 singularity]$ export | grep CUDA
declare -x CUDA_VISIBLE_DEVICES="4"
declare -x CUDA_VISIBLE_DEVICES1="4"
declare -x CUDA_VISIBLE_DEVICES_ORIG="4"
[test@gpgn14 singularity]$ python3 gpu_test.py &
[1] 18213
CUDA initialized. Sleeping for 10 seconds...
[test@gpgn14 singularity]$ nvidia-smi
Thu Jul 17 15:23:21 2025


+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    1    0      17883      C   python3                                     174MiB |

Users would reasonably expect that each job will occupy one full physical GPU. However, upon tracking CUDA_VISIBLE_DEVICES values, we observed the following:

CUDA_VISIBLE_DEVICES=0, 1, 2, 3, corresponding to GPUs #4, #5, #6, and #7 - all of which are standard (non-MIG) GPUs
CUDA_VISIBLE_DEVICES=4 corresponding to mig instance.
This makes sense, as GPUs #0 to #3 are MIG-enabled and likely skipped.

The issue arises when It was assigned CUDA_VISIBLE_DEVICES=4, which unexpectedly maps to a MIG instance (MIG 3g.40gb). Despite not using the mig keyword, the job ends up utilizing a MIG device.

We verified that setting CUDA_VISIBLE_DEVICES=4 on this system does indeed run the job on a MIG device. We were unable to find any NVIDIA documentation stating that CUDA device index values (like 4) can resolve to MIG instances. NVIDIA officially recommends specifying MIG devices using their full identifiers (e.g., MIG-<UUID> or MIG-<GPU-UUID>/<GI-ID>/<CI-ID>).

This indicates that the CUDA_VISIBLE_DEVICES=<index> format may unintentionally target MIG devices - not just physical GPUs. This can lead to unexpected job behavior when MIG and non-MIG GPUs coexist on the same host.

To be clear: when the job explicitly requests a MIG device using the -gpu "num=1:mig=..." option, LSF correctly isolates and assigns a MIG instance. No issue arises in such cases.

Summary:

When submitting jobs to a server with both MIG-enabled and standard GPUs, and not using the mig keyword, your job might still end up running on a MIG instance.
It's unclear whether this is intended behavior from LSF, but from an end-user perspective, it can be very confusing.
Most users would assume that omitting the mig keyword ensures the job runs on a full physical GPU - which is not always the case.

All of the above findings are based on our internal testing, and there may be inaccuracies. If anything I've described is incorrect, I would greatly appreciate your corrections. Also, if there is a documented or recommended method to clearly separate MIG and non-MIG devices in such mixed environments - without having to manually override environment variables - please let me know.

We are trying to avoid scripting custom wrappers to alter the environment, as it would require significant changes across our job submission infrastructure.

------------------------------
Chulmin KIM
------------------------------

2. RE: Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

YI SUN

Posted 2 days ago

You may try setting LSB_ENFORCE_GPU_MIG_JOB=Y in lsb.params file. I got this from LSF patch build601927. From its README this parameter ensures MIG enabled GPU will not be allocated to GPU job submitted without mig keyword.

------------------------------
YI SUN
------------------------------

Original Message

Original Message:
Sent: Wed July 16, 2025 06:13 AM
From: Chulmin KIM
Subject: Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

[root@gpgn14 ~]# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-3ca34270-8679-0d13-b6bc-1fbde96f46d6)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-59865de3-ab00-1089-568b-a5e2e10ff965)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-8cd4fcbf-2b89-44cc-b07b-177499452fcf)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4f852944-3f9c-e74e-bf54-db2cb6561c2c)
  MIG 3g.40gb     Device  0: (UUID: MIG-7d5c6e34-6183-5389-9015-3e0160617d58)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-11d7cddf-37d1-4d0f-6b47-0d0d31d8077c)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-401115e9-18f5-ef7d-3e42-4a492f351a8d)
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-983d1edb-e5ff-8f92-7ea7-0a43da2eb385)
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-09735f83-5d36-1f19-7411-4c69f6f7fe17)
[root@gpgn14 ~]# nvidia-smi
Tue Jul 15 15:19:01 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                   On |
| N/A   31C    P0              52W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                   On |
| N/A   30C    P0              47W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:48:00.0 Off |                   On |
| N/A   28C    P0              48W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:4C:00.0 Off |                   On |
| N/A   30C    P0              46W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:88:00.0 Off |                    0 |
| N/A   30C    P0              58W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:8B:00.0 Off |                    0 |
| N/A   33C    P0              60W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:C8:00.0 Off |                    0 |
| N/A   33C    P0              64W / 400W |   1517MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:CB:00.0 Off |                    0 |
| N/A   33C    P0              66W / 400W |  22419MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  3    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    6   N/A  N/A     56644      C   python                                     1504MiB |
|    7   N/A  N/A     41441      C   python                                    22406MiB |
+---------------------------------------------------------------------------------------+

When using the following LSF directive:

#BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes

LSF internally sets the following environment variables:

declare -x CUDA_VISIBLE_DEVICES="1"declare -x CUDA_VISIBLE_DEVICES1="1"declare -x CUDA_VISIBLE_DEVICES_ORIG="1"

The problem here is that the value 1 does no1t refer to a specific physical GPU. Instead, it denotes the first visible GPU as seen by the CUDA runtime, among those currently available. Therefore, it does not guarantee which physical GPU or MIG instance will be assigned.

However, when we specify MIG explicitly, like so:

#BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes:mig=3/3"

LSF correctly targets a MIG device, and the environment variable is set in the form:

CUDA_VISIBLE_DEVICES=MIG-<GPU-UUID>/<GPU instance ID>/<Compute instance ID>

For example, suppose we submit five jobs that each request one GPU in exclusive mode using:

#BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes"

Users would reasonably expect that each job will occupy one full physical GPU. However, upon tracking CUDA_VISIBLE_DEVICES values, we observed the following:

The first four jobs were assigned CUDA_VISIBLE_DEVICES=0, 1, 2, 3, corresponding to GPUs #4, #5, #6, and #7 - all of which are standard (non-MIG) GPUs.
This makes sense, as GPUs #0 to #3 are MIG-enabled and likely skipped.

The issue arises with the fifth job. It was assigned CUDA_VISIBLE_DEVICES=4, which unexpectedly maps to a MIG instance (MIG 3g.40gb). Despite not using the mig keyword, the job ends up utilizing a MIG device.

To be clear: when the job explicitly requests a MIG device using the mig=... option, LSF correctly isolates and assigns a MIG instance. No issue arises in such cases.

Summary:

When submitting jobs to a server with both MIG-enabled and standard GPUs, and not using the mig keyword, your job might still end up running on a MIG instance.
It's unclear whether this is intended behavior from LSF, but from an end-user perspective, it can be very confusing.
Most users would assume that omitting the mig keyword ensures the job runs on a full physical GPU - which is not always the case.

We are trying to avoid scripting custom wrappers to alter the environment, as it would require significant changes across our job submission infrastructure.

------------------------------
Chulmin KIM
------------------------------

High Performance Computing Group

High Performance Computing Group

Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

Chulmin KIM2 days ago

YI SUN2 days ago

1. Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

2. RE: Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

Additional
Resources

Office

Quick Links

High Performance Computing Group

High Performance Computing Group

Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

Chulmin KIM2 days ago

YI SUN2 days ago

1. Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

2. RE: Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

Related Content

IBM Spectrum LSF with Nvidia DGX systems

DynaMIG management of NVIDIA DGX A100 with IBM Spectrum LSF

Workload-driven dynamic reconfiguration of NVIDIA MIG

Replay: What's New in LSF 10 Service Pack 12

What's New in LSF Service Pack 12

Additional Resources

Office

Quick Links

Additional
Resources