High Performance Computing Group

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only
  • 1.  Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

    Posted Mon March 31, 2025 10:15 AM

    Hi,

    For a host with 8 GPUs ,    we want to tell LSF not to use specific two of them.

    Is there an option to configure it ?

    Thanks,

    -- Shali --



    ------------------------------
    Shali Boharon
    ------------------------------


  • 2.  RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

    Posted 14 days ago

    I'd like to share a response we previously received from IBM.

    There is an LSF configuration to restrict the maximum number of GPUs that can be used by LSF, but it is not possible to specify which individual GPUs should not be used. For example, if there are 8 GPUs on a host, you can configure LSF to use only 7 of them, but you cannot configure it to exclude a specific GPU, such as GPU #4.



    ------------------------------
    Chulmin KIM
    ------------------------------



  • 3.  RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

    Posted 14 days ago

    Yes, let's unpack and analyze the question and the answer to clarify the situation and possible solutions.


    Question Summary:

    • You have a host with multiple GPUs (e.g., 8).

    • You want to restrict LSF (IBM Spectrum LSF) from using specific GPUs (e.g., GPU #4 or GPU #6 and #7).

    • LSF can restrict the number of GPUs used on a host but not which specific ones.

    • So, can you configure LSF to exclude specific GPUs, not just limit the total number?


    Key Facts about LSF and CUDA GPU Handling:

    1. LSF GPU Scheduling Basics:

      • LSF can schedule GPUs using the resource management system.

      • You can control GPU usage through CUDA_VISIBLE_DEVICES, which LSF sets automatically when managing GPU resources.

      • LSF doesn't inherently know about GPU topology or health - it sees GPU slots as abstract resources (e.g., gpu).

    2. Limiting Total GPUs on Host:

      • You can configure the number of GPUs available per host using lsf.conf or host-level resource settings.

      • Example:

        pgsql
        Begin Resource RESOURCENAME TYPE INTERVAL DURATION DESCRIPTION gpu Boolean 60 60 (gpu) End Resource Begin Host HOST_NAME MXJ ncpus ngpus hostA 8 16 6 End Host
      • This restricts the total number of GPUs available for jobs but not specific ones.


    �� Answer Analysis (Rephrased):

    "You can configure LSF to use only 7 of 8 GPUs, but not configure it to exclude a specific GPU like GPU #4."

    This is accurate: LSF itself cannot directly specify which GPU indexes (e.g., 0–7) are to be used or excluded. It just counts GPUs as generic resources.


    ��️ How to Exclude Specific GPUs: Workarounds

    To exclude specific GPUs (e.g., GPU #4), you can use host-level workarounds in combination with LSF's environment variable handling:

    ✅ Option 1: Use cgroups and NVIDIA GPU visibility restrictions

    • You can configure the system or job environment to hide specific GPUs from the driver using:

      bash
      export CUDA_VISIBLE_DEVICES=0,1,2,3,5,6,7 # Exclude GPU #4 (which is index 4)
    • You could script this into LSF's job prolog or enforce it via job submission wrappers.

    ✅ Option 2: Configure a Node as Having Fewer GPUs

    • Modify the number of GPUs seen by LSF:

      • Create a wrapper for LSF GPU detection (e.g., override /usr/bin/nvidia-smi or use nvidia-smi topo -m).

      • Create virtual GPU pools on the host using nvidia-container-toolkit or nvidia-smi with MIG (on supported GPUs).

      • These methods can mask specific devices from the system or container environment.

    ✅ Option 3: Custom CUDA_VISIBLE_DEVICES via bsub

    You can override what LSF sets using a submission command:

    bash
    bsub -gpu "num=6" -env "CUDA_VISIBLE_DEVICES=0,1,2,3,5,6" my_script.sh

    This tells your job which GPUs to use explicitly.

    ⚠️ Important: LSF will still think it's using all 6 GPUs, but your job will only "see" those specific ones.


    To the Question: "Is there an option to configure it?"

    Short Answer:
    LSF does not natively support excluding specific GPUs by ID/index, but you can work around it by masking GPUs using:

    • CUDA_VISIBLE_DEVICES in job submission or prolog

    • cgroups or device-level visibility settings

    • node-level configuration (setting ngpus lower, or using MIG on Ampere+ GPUs)


    ✅ For the Answer: "We want to tell LSF not to use specific two of them." - is it possible?

    Directly, no. Indirectly, yes, using the same workarounds.


    �� Final Summary

    FeatureNative LSF Support?Workaround Available?
    Limit total number of GPUs per host✅ Yes-
    Exclude specific GPU IDs (e.g., GPU #4)❌ No✅ Yes (via env vars, masking, MIG)
    Use specific GPUs in job❌ No direct support✅ Yes via CUDA_VISIBLE_DEVICES in job env
    Best.
    Vaia Tausiani.
    IT.
    SLP. M.Sc.





  • 4.  RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

    Posted 14 days ago

    What you can do:

    You can configure LSF to recognize and allocate up to N GPUs per host by modifying:

    • lsf.conf

    • lsf.shared and lsf.cluster.<clustername>

    • GPU resource definition (e.g., ngpus, RES_REQ, etc.)

    Example:

    text
    Begin Host HOST_NAME ngpus host123 7 End Host

    This tells LSF: "only 7 GPUs are available on this host."


    �� What you cannot do:

    You cannot specify which GPU indexes (e.g., GPU 4 or GPU 7) to exclude in LSF configuration files.

    LSF treats GPUs as anonymous resources. It doesn't have a native mechanism to bind jobs to specific device IDs (like GPU 0, 1, 2...).


    �� Why LSF Cannot Exclude Specific GPUs

    • LSF's GPU support is resource-based (gpu, ngpus), not device-ID aware.

    • GPU selection is delegated to the application via CUDA_VISIBLE_DEVICES, which LSF sets automatically based on GPU allocation.

    • LSF doesn't interrogate or act on:

      • GPU health (e.g., failed or degraded GPUs),

      • topology (e.g., NVLink connections),

      • or specific GPU indexes.


    ��️ Workarounds to Exclude Specific GPUs (e.g., GPU #4)

    If you really need to exclude individual GPU(s) (e.g., faulty or reserved), you can use the following methods:

    1. Mask GPUs via CUDA_VISIBLE_DEVICES (Recommended)

    Use a prolog script or job wrapper to exclude GPU 4:

    bash
    # Show only GPUs 0-3, 5-7 export CUDA_VISIBLE_DEVICES=0,1,2,3,5,6,7

    You can inject this into every job using:

    • LSB_JOB_ENVFILE in a job prolog (see earlier script),

    • or user-defined job wrappers.


    2. Use cgroups or Docker/NVIDIA runtime restrictions

    If you're using cgroups or containers, restrict visible GPUs:

    • Use NVIDIA Docker runtime with --gpus '"device=0,1,2,3,5,6,7"'

    • Or restrict /dev/nvidia4 from container mount points.


    3. Disable a GPU at the OS Level

    If GPU #4 is faulty or reserved:

    • Use nvidia-smi to put GPU 4 into persistence off or compute exclusive mode with no process access.

    • Or physically disable it via BIOS or kernel boot args (advanced).


    �� Summary Table

    FeatureSupported by LSF?Workaround Needed?
    Limit total GPUs per host✅ YesNo
    Exclude specific GPU index (e.g., GPU #4)❌ No✅ Yes
    Force job to use only certain GPUs❌ No (natively)✅ Yes (env/cgroup)