We've encountered some issues when assigning GPUs using LSF in environments where both MIG-enabled and regular GPUs coexist. The system under test is equipped with 8 H100 GPUs - 4 with MIG enabled, and 4 in standard mode.
[root@gpgn14 ~]# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-3ca34270-8679-0d13-b6bc-1fbde96f46d6)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-59865de3-ab00-1089-568b-a5e2e10ff965)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-8cd4fcbf-2b89-44cc-b07b-177499452fcf)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4f852944-3f9c-e74e-bf54-db2cb6561c2c)
MIG 3g.40gb Device 0: (UUID: MIG-7d5c6e34-6183-5389-9015-3e0160617d58)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-11d7cddf-37d1-4d0f-6b47-0d0d31d8077c)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-401115e9-18f5-ef7d-3e42-4a492f351a8d)
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-983d1edb-e5ff-8f92-7ea7-0a43da2eb385)
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-09735f83-5d36-1f19-7411-4c69f6f7fe17)
[root@gpgn14 ~]# nvidia-smi
Tue Jul 15 15:19:01 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | On |
| N/A 31C P0 52W / 400W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:0B:00.0 Off | On |
| N/A 30C P0 47W / 400W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:48:00.0 Off | On |
| N/A 28C P0 48W / 400W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:4C:00.0 Off | On |
| N/A 30C P0 46W / 400W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:88:00.0 Off | 0 |
| N/A 30C P0 58W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:8B:00.0 Off | 0 |
| N/A 33C P0 60W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C8:00.0 Off | 0 |
| N/A 33C P0 64W / 400W | 1517MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:CB:00.0 Off | 0 |
| N/A 33C P0 66W / 400W | 22419MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 3 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
When using the following LSF directive:
#BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes
LSF internally sets the following environment variables:
The problem here is that the value 4
does not refer to a specific physical GPU. Instead, it denotes the first visible GPU as seen by the CUDA runtime, among those currently available. Therefore, it does not guarantee which physical GPU or MIG instance will be assigned.
However, when we specify MIG explicitly, like so:
#BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes:mig=3/3"
LSF correctly targets a MIG device, and the environment variable is set in the form:
CUDA_VISIBLE_DEVICES=MIG-<GPU-UUID>/<GPU instance ID>/<Compute instance ID>
Now, here's where things get problematic. If you have a mix of MIG-enabled and standard GPUs, and you do not use the mig
keyword, LSF may still assign a MIG instance - potentially leading to confusion or unexpected behavior.
In the above situation, the job runs on a MIG instance.
[test@gpgn14 singularity]$ export | grep CUDA
declare -x CUDA_VISIBLE_DEVICES="4"
declare -x CUDA_VISIBLE_DEVICES1="4"
declare -x CUDA_VISIBLE_DEVICES_ORIG="4"
[test@gpgn14 singularity]$ python3 gpu_test.py &
[1] 18213
CUDA initialized. Sleeping for 10 seconds...
[test@gpgn14 singularity]$ nvidia-smi
Thu Jul 17 15:23:21 2025
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 1 0 17883 C python3 174MiB |
Users would reasonably expect that each job will occupy one full physical GPU. However, upon tracking CUDA_VISIBLE_DEVICES
values, we observed the following:
CUDA_VISIBLE_DEVICES=0
, 1
, 2
, 3
, corresponding to GPUs #4, #5, #6, and #7 - all of which are standard (non-MIG) GPUs
CUDA_VISIBLE_DEVICES=4
corresponding to mig instance.
-
This makes sense, as GPUs #0 to #3 are MIG-enabled and likely skipped.
The issue arises when It was assigned CUDA_VISIBLE_DEVICES=4
, which unexpectedly maps to a MIG instance (MIG 3g.40gb
). Despite not using the mig
keyword, the job ends up utilizing a MIG device.
We verified that setting CUDA_VISIBLE_DEVICES=4
on this system does indeed run the job on a MIG device. We were unable to find any NVIDIA documentation stating that CUDA device index values (like 4
) can resolve to MIG instances. NVIDIA officially recommends specifying MIG devices using their full identifiers (e.g., MIG-<UUID>
or MIG-<GPU-UUID>/<GI-ID>/<CI-ID>
).
This indicates that the CUDA_VISIBLE_DEVICES=<index>
format may unintentionally target MIG devices - not just physical GPUs. This can lead to unexpected job behavior when MIG and non-MIG GPUs coexist on the same host.
To be clear: when the job explicitly requests a MIG device using the -gpu "num=1:mig=..."
option, LSF correctly isolates and assigns a MIG instance. No issue arises in such cases.
Summary:
-
When submitting jobs to a server with both MIG-enabled and standard GPUs, and not using the mig
keyword, your job might still end up running on a MIG instance.
-
It's unclear whether this is intended behavior from LSF, but from an end-user perspective, it can be very confusing.
-
Most users would assume that omitting the mig
keyword ensures the job runs on a full physical GPU - which is not always the case.
All of the above findings are based on our internal testing, and there may be inaccuracies. If anything I've described is incorrect, I would greatly appreciate your corrections. Also, if there is a documented or recommended method to clearly separate MIG and non-MIG devices in such mixed environments - without having to manually override environment variables - please let me know.
We are trying to avoid scripting custom wrappers to alter the environment, as it would require significant changes across our job submission infrastructure.
------------------------------
Chulmin KIM
------------------------------