High Performance Computing Group

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only
  • 1.  Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

    Posted 2 days ago
    Edited by Chulmin KIM 22 hours ago

    We've encountered some issues when assigning GPUs using LSF in environments where both MIG-enabled and regular GPUs coexist. The system under test is equipped with 8 H100 GPUs - 4 with MIG enabled, and 4 in standard mode.

    [root@gpgn14 ~]# nvidia-smi -L
    GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-3ca34270-8679-0d13-b6bc-1fbde96f46d6)
    GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-59865de3-ab00-1089-568b-a5e2e10ff965)
    GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-8cd4fcbf-2b89-44cc-b07b-177499452fcf)
    GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4f852944-3f9c-e74e-bf54-db2cb6561c2c)
      MIG 3g.40gb     Device  0: (UUID: MIG-7d5c6e34-6183-5389-9015-3e0160617d58)
    GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-11d7cddf-37d1-4d0f-6b47-0d0d31d8077c)
    GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-401115e9-18f5-ef7d-3e42-4a492f351a8d)
    GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-983d1edb-e5ff-8f92-7ea7-0a43da2eb385)
    GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-09735f83-5d36-1f19-7411-4c69f6f7fe17)
    [root@gpgn14 ~]# nvidia-smi
    Tue Jul 15 15:19:01 2025
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                   On |
    | N/A   31C    P0              52W / 400W |     87MiB / 81920MiB |     N/A      Default |
    |                                         |                      |              Enabled |
    +-----------------------------------------+----------------------+----------------------+
    |   1  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                   On |
    | N/A   30C    P0              47W / 400W |     87MiB / 81920MiB |     N/A      Default |
    |                                         |                      |              Enabled |
    +-----------------------------------------+----------------------+----------------------+
    |   2  NVIDIA A100-SXM4-80GB          On  | 00000000:48:00.0 Off |                   On |
    | N/A   28C    P0              48W / 400W |     87MiB / 81920MiB |     N/A      Default |
    |                                         |                      |              Enabled |
    +-----------------------------------------+----------------------+----------------------+
    |   3  NVIDIA A100-SXM4-80GB          On  | 00000000:4C:00.0 Off |                   On |
    | N/A   30C    P0              46W / 400W |     87MiB / 81920MiB |     N/A      Default |
    |                                         |                      |              Enabled |
    +-----------------------------------------+----------------------+----------------------+
    |   4  NVIDIA A100-SXM4-80GB          On  | 00000000:88:00.0 Off |                    0 |
    | N/A   30C    P0              58W / 400W |      4MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   5  NVIDIA A100-SXM4-80GB          On  | 00000000:8B:00.0 Off |                    0 |
    | N/A   33C    P0              60W / 400W |      4MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   6  NVIDIA A100-SXM4-80GB          On  | 00000000:C8:00.0 Off |                    0 |
    | N/A   33C    P0              64W / 400W |   1517MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   7  NVIDIA A100-SXM4-80GB          On  | 00000000:CB:00.0 Off |                    0 |
    | N/A   33C    P0              66W / 400W |  22419MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+

    +---------------------------------------------------------------------------------------+
    | MIG devices:                                                                          |
    +------------------+--------------------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
    |      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
    |                  |                                |        ECC|                       |
    |==================+================================+===========+=======================|
    |  3    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
    |                  |               0MiB / 65535MiB  |           |                       |
    +------------------+--------------------------------+-----------+-----------------------+

    When using the following LSF directive:

    #BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes
    

    LSF internally sets the following environment variables:

    declare -x CUDA_VISIBLE_DEVICES="4" declare -x CUDA_VISIBLE_DEVICES1="4" declare -x CUDA_VISIBLE_DEVICES_ORIG="4"

    The problem here is that the value 4 does not refer to a specific physical GPU. Instead, it denotes the first visible GPU as seen by the CUDA runtime, among those currently available. Therefore, it does not guarantee which physical GPU or MIG instance will be assigned.

    However, when we specify MIG explicitly, like so: 

    #BSUB -gpu "num=1:mode=exclusive_process:j_exclusive=yes:mig=3/3"
    

    LSF correctly targets a MIG device, and the environment variable is set in the form:

    CUDA_VISIBLE_DEVICES=MIG-<GPU-UUID>/<GPU instance ID>/<Compute instance ID>
    

    Now, here's where things get problematic. If you have a mix of MIG-enabled and standard GPUs, and you do not use the mig keyword, LSF may still assign a MIG instance - potentially leading to confusion or unexpected behavior.

    In the above situation, the job runs on a MIG instance.

    
    

    [test@gpgn14 singularity]$ export | grep CUDA
    declare -x CUDA_VISIBLE_DEVICES="4"
    declare -x CUDA_VISIBLE_DEVICES1="4"
    declare -x CUDA_VISIBLE_DEVICES_ORIG="4"
    [test@gpgn14 singularity]$ python3 gpu_test.py &
    [1] 18213
    CUDA initialized. Sleeping for 10 seconds...
    [test@gpgn14 singularity]$ nvidia-smi
    Thu Jul 17 15:23:21 2025

    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    |    0    1    0      17883      C   python3                                     174MiB |

    Users would reasonably expect that each job will occupy one full physical GPU. However, upon tracking CUDA_VISIBLE_DEVICES values, we observed the following:

    • CUDA_VISIBLE_DEVICES=0, 1, 2, 3, corresponding to GPUs #4, #5, #6, and #7 - all of which are standard (non-MIG) GPUs

      CUDA_VISIBLE_DEVICES=4  corresponding to mig instance.

    • This makes sense, as GPUs #0 to #3 are MIG-enabled and likely skipped.

    The issue arises when It was assigned CUDA_VISIBLE_DEVICES=4, which unexpectedly maps to a MIG instance (MIG 3g.40gb). Despite not using the mig keyword, the job ends up utilizing a MIG device.

    We verified that setting CUDA_VISIBLE_DEVICES=4 on this system does indeed run the job on a MIG device. We were unable to find any NVIDIA documentation stating that CUDA device index values (like 4) can resolve to MIG instances. NVIDIA officially recommends specifying MIG devices using their full identifiers (e.g., MIG-<UUID> or MIG-<GPU-UUID>/<GI-ID>/<CI-ID>).

    This indicates that the CUDA_VISIBLE_DEVICES=<index> format may unintentionally target MIG devices - not just physical GPUs. This can lead to unexpected job behavior when MIG and non-MIG GPUs coexist on the same host.

    To be clear: when the job explicitly requests a MIG device using the -gpu "num=1:mig=..." option, LSF correctly isolates and assigns a MIG instance. No issue arises in such cases.

    Summary:

    • When submitting jobs to a server with both MIG-enabled and standard GPUs, and not using the mig keyword, your job might still end up running on a MIG instance.

    • It's unclear whether this is intended behavior from LSF, but from an end-user perspective, it can be very confusing.

    • Most users would assume that omitting the mig keyword ensures the job runs on a full physical GPU - which is not always the case.

    All of the above findings are based on our internal testing, and there may be inaccuracies. If anything I've described is incorrect, I would greatly appreciate your corrections. Also, if there is a documented or recommended method to clearly separate MIG and non-MIG devices in such mixed environments - without having to manually override environment variables - please let me know.

    We are trying to avoid scripting custom wrappers to alter the environment, as it would require significant changes across our job submission infrastructure.



    ------------------------------
    Chulmin KIM
    ------------------------------



  • 2.  RE: Potential Issues When Assigning GPUs in LSF with Mixed MIG and Non-MIG GPUs

    Posted 2 days ago

    You may try setting LSB_ENFORCE_GPU_MIG_JOB=Y in lsb.params file. I got this from LSF patch build601927. From its README this parameter ensures MIG enabled GPU will not be allocated to GPU job submitted without mig keyword. 



    ------------------------------
    YI SUN
    ------------------------------