High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to discussions

Expand all | Collapse all

Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

1. Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

Like
Shali Boharon
Posted Mon March 31, 2025 10:15 AM

Reply
Hi,

For a host with 8 GPUs , we want to tell LSF not to use specific two of them.

Is there an option to configure it ?

Thanks,

-- Shali --

------------------------------
Shali Boharon
------------------------------
2. RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

Like
Chulmin KIM
Posted Tue July 15, 2025 09:18 AM

Reply
I'd like to share a response we previously received from IBM.

There is an LSF configuration to restrict the maximum number of GPUs that can be used by LSF, but it is not possible to specify which individual GPUs should not be used. For example, if there are 8 GPUs on a host, you can configure LSF to use only 7 of them, but you cannot configure it to exclude a specific GPU, such as GPU #4.

------------------------------
Chulmin KIM
------------------------------

Original Message

3. RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

Vaia Tausiani

Posted Tue July 15, 2025 09:26 AM

Yes, let's unpack and analyze the question and the answer to clarify the situation and possible solutions.

❓ Question Summary:

You have a host with multiple GPUs (e.g., 8).
You want to restrict LSF (IBM Spectrum LSF) from using specific GPUs (e.g., GPU #4 or GPU #6 and #7).
LSF can restrict the number of GPUs used on a host but not which specific ones.
So, can you configure LSF to exclude specific GPUs, not just limit the total number?

✅ Key Facts about LSF and CUDA GPU Handling:

LSF GPU Scheduling Basics:
- LSF can schedule GPUs using the resource management system.
- You can control GPU usage through CUDA_VISIBLE_DEVICES, which LSF sets automatically when managing GPU resources.
- LSF doesn't inherently know about GPU topology or health - it sees GPU slots as abstract resources (e.g., gpu).

Limiting Total GPUs on Host:

You can configure the number of GPUs available per host using lsf.conf or host-level resource settings.

Example:

pgsql
Begin Resource RESOURCENAME  TYPE     INTERVAL  DURATION  DESCRIPTION gpu           Boolean  60        60        (gpu) End Resource  Begin Host HOST_NAME   MXJ   ncpus  ngpus hostA       8     16     6 End Host

This restricts the total number of GPUs available for jobs but not specific ones.

�� Answer Analysis (Rephrased):

"You can configure LSF to use only 7 of 8 GPUs, but not configure it to exclude a specific GPU like GPU #4."

This is accurate: LSF itself cannot directly specify which GPU indexes (e.g., 0–7) are to be used or excluded. It just counts GPUs as generic resources.

��️ How to Exclude Specific GPUs: Workarounds

To exclude specific GPUs (e.g., GPU #4), you can use host-level workarounds in combination with LSF's environment variable handling:

✅ Option 1: Use cgroups and NVIDIA GPU visibility restrictions

You can configure the system or job environment to hide specific GPUs from the driver using:

bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,5,6,7  # Exclude GPU #4 (which is index 4)

You could script this into LSF's job prolog or enforce it via job submission wrappers.

✅ Option 2: Configure a Node as Having Fewer GPUs

Modify the number of GPUs seen by LSF:
- Create a wrapper for LSF GPU detection (e.g., override /usr/bin/nvidia-smi or use nvidia-smi topo -m).
- Create virtual GPU pools on the host using nvidia-container-toolkit or nvidia-smi with MIG (on supported GPUs).
- These methods can mask specific devices from the system or container environment.

✅ Option 3: Custom CUDA_VISIBLE_DEVICES via bsub

You can override what LSF sets using a submission command:

bash
bsub -gpu "num=6" -env "CUDA_VISIBLE_DEVICES=0,1,2,3,5,6" my_script.sh

This tells your job which GPUs to use explicitly.

⚠️ Important: LSF will still think it's using all 6 GPUs, but your job will only "see" those specific ones.

✅ To the Question: "Is there an option to configure it?"

Short Answer:
LSF does not natively support excluding specific GPUs by ID/index, but you can work around it by masking GPUs using:

CUDA_VISIBLE_DEVICES in job submission or prolog
cgroups or device-level visibility settings
node-level configuration (setting ngpus lower, or using MIG on Ampere+ GPUs)

✅ For the Answer: "We want to tell LSF not to use specific two of them." - is it possible?

Directly, no. Indirectly, yes, using the same workarounds.

�� Final Summary

Feature	Native LSF Support?	Workaround Available?
Limit total number of GPUs per host	✅ Yes	-
Exclude specific GPU IDs (e.g., GPU #4)	❌ No	✅ Yes (via env vars, masking, MIG)
Use specific GPUs in job	❌ No direct support	✅ Yes via `CUDA_VISIBLE_DEVICES` in job env

Best.

Vaia Tausiani.

IT.

SLP. M.Sc.

Original Message

4. RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

Vaia Tausiani

Posted Tue July 15, 2025 09:41 AM

What you can do:

You can configure LSF to recognize and allocate up to N GPUs per host by modifying:

lsf.conf
lsf.shared and lsf.cluster.<clustername>
GPU resource definition (e.g., ngpus, RES_REQ, etc.)

Example:

text
Begin Host HOST_NAME   ngpus host123     7 End Host

This tells LSF: "only 7 GPUs are available on this host."

�� What you cannot do:

You cannot specify which GPU indexes (e.g., GPU 4 or GPU 7) to exclude in LSF configuration files.

LSF treats GPUs as anonymous resources. It doesn't have a native mechanism to bind jobs to specific device IDs (like GPU 0, 1, 2...).

�� Why LSF Cannot Exclude Specific GPUs

LSF's GPU support is resource-based (gpu, ngpus), not device-ID aware.
GPU selection is delegated to the application via CUDA_VISIBLE_DEVICES, which LSF sets automatically based on GPU allocation.
LSF doesn't interrogate or act on:
- GPU health (e.g., failed or degraded GPUs),
- topology (e.g., NVLink connections),
- or specific GPU indexes.

��️ Workarounds to Exclude Specific GPUs (e.g., GPU #4)

If you really need to exclude individual GPU(s) (e.g., faulty or reserved), you can use the following methods:

1. Mask GPUs via `CUDA_VISIBLE_DEVICES` (Recommended)

Use a prolog script or job wrapper to exclude GPU 4:

bash
# Show only GPUs 0-3, 5-7 export CUDA_VISIBLE_DEVICES=0,1,2,3,5,6,7

You can inject this into every job using:

LSB_JOB_ENVFILE in a job prolog (see earlier script),
or user-defined job wrappers.

2. Use cgroups or Docker/NVIDIA runtime restrictions

If you're using cgroups or containers, restrict visible GPUs:

Use NVIDIA Docker runtime with --gpus '"device=0,1,2,3,5,6,7"'
Or restrict /dev/nvidia4 from container mount points.

3. Disable a GPU at the OS Level

If GPU #4 is faulty or reserved:

Use nvidia-smi to put GPU 4 into persistence off or compute exclusive mode with no process access.
Or physically disable it via BIOS or kernel boot args (advanced).

�� Summary Table

Feature	Supported by LSF?	Workaround Needed?
Limit total GPUs per host	✅ Yes	No
Exclude specific GPU index (e.g., GPU #4)	❌ No	✅ Yes
Force job to use only certain GPUs	❌ No (natively)	✅ Yes (env/cgroup)

Original Message

High Performance Computing Group

High Performance Computing Group

Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

Shali BoharonMon March 31, 2025 10:15 AM

Chulmin KIMTue July 15, 2025 09:18 AM

Vaia TausianiTue July 15, 2025 09:26 AM

Vaia TausianiTue July 15, 2025 09:41 AM

1. Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

2. RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

3. RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

❓ Question Summary:

✅ Key Facts about LSF and CUDA GPU Handling:

�� Answer Analysis (Rephrased):

��️ How to Exclude Specific GPUs: Workarounds

✅ Option 1: Use cgroups and NVIDIA GPU visibility restrictions

✅ Option 2: Configure a Node as Having Fewer GPUs

✅ Option 3: Custom CUDA_VISIBLE_DEVICES via bsub

✅ To the Question: "Is there an option to configure it?"

✅ For the Answer: "We want to tell LSF not to use specific two of them." - is it possible?

�� Final Summary

4. RE: Exclude certain GPU from available GPUs for CUDA_VISIBLE_DEVICES

What you can do:

�� What you cannot do:

�� Why LSF Cannot Exclude Specific GPUs

��️ Workarounds to Exclude Specific GPUs (e.g., GPU #4)

1. Mask GPUs via CUDA_VISIBLE_DEVICES (Recommended)

2. Use cgroups or Docker/NVIDIA runtime restrictions

3. Disable a GPU at the OS Level

�� Summary Table

❓ Question Summary:

✅ Key Facts about LSF and CUDA GPU Handling:

�� Answer Analysis (Rephrased):

��️ How to Exclude Specific GPUs: Workarounds

✅ Option 1: Use cgroups and NVIDIA GPU visibility restrictions

✅ Option 2: Configure a Node as Having Fewer GPUs

✅ Option 3: Custom CUDA_VISIBLE_DEVICES via bsub

✅ To the Question: "Is there an option to configure it?"

✅ For the Answer: "We want to tell LSF not to use specific two of them." - is it possible?

�� Final Summary

Related Content

A quick look at automatic GPU configuration in IBM Spectrum LSF

GPU usage information for jobs in IBM Spectrum LSF

Workload-driven dynamic reconfiguration of NVIDIA MIG

DynaMIG management of NVIDIA DGX A100 with IBM Spectrum LSF

Running TensorFlow benchmark with Horovod across IBM Power servers in containers in an LSF cluster

Additional Resources

Office

Quick Links

1. Mask GPUs via `CUDA_VISIBLE_DEVICES` (Recommended)

Additional
Resources