High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to Blog List

Fine tuning AI models with InstructLab on IBM LSF

By Gábor Samu posted Mon January 06, 2025 11:41 AM

All the best for 2025! This blog looks back on a demo which I created for SC24 last November to demonstrate InstructLab workflows running on an IBM LSF cluster. Let’s begin with a bit of background. I’d like to thank Michael Spriggs, STSM, IBM LSF for his contributions to this blog.

Released by IBM and Red Hat in May 2024, InstructLab is an open-source project which provides the ability to fine-tune LLMs by adding skills and knowledge, without having to retrain the model from scratch. InstructLab can run on resource-constrained systems such as laptops, but also supports GPUs. Much has been written about InstructLab and this blog is not intended to provide an in-depth look at InstructLab. Rather, the objective here is to demonstrate how InstructLab workloads can be distributed and managed in a high-performance computing cluster with GPUs using the IBM LSF workload scheduler. Recently, IBM published a paper describing the infrastructure used to train the Granite family of AI foundation models. The paper describes the Vela and Blue Vela environments in detail. In particular, the Blue Vela environment is built on a software stack using Red Hat Enterprise Linux, IBM LSF and Storage Scale. Learn more in the detailed paper here.

The demo workflow consists of two LSF jobs. The first job generates synthetic data, which is used to teach the LLM new skills or knowledge. The second job, which depends upon the successful completion of the first, is the training job, where the new skills or knowledge are incorporated into an existing base model. A simple LSF job dependency may be used to ensure the training job only runs after the successful completion of the synthetic data generation step.

The environment used is equipped with Nvidia GPUs. InstructLab jobs will be run with the options for GPU support, and the jobs will be submitted to LSF with the appropriate GPU scheduling directives. Furthermore, it is assumed that the users' $HOME directory is available on all hosts in the cluster. Note that I require neither root access, nor a user account that is an LSF administrator, to install and use InstructLab on the LSF cluster.

Configuration

The HPC cluster is configured as follows:

Intel® Xeon® Platinum 8468 based compute nodes equipped with 8 x Nvidia H100 GPUs
Red Hat Enterprise Linux v8.8
IBM LSF v10.0.1.15
InstructLab v0.19.4
Miniforge v3 (24.9.0-0)
Nvidia CUDA v12.6
Nvidia DCGM v3.3.9

Install InstructLab

1. Log in to a compute node in the LSF cluster equipped with GPUs. If ssh access is disabled to compute nodes, then submit an interactive LSF batch job. This job requests 8 GPUs on a single system and will set them to exclusive execution mode.

$ bsub -Is -R "span[hosts=1]" -gpu "num=8:j_exclusive=yes" bash

2. Install and set up a Conda environment. This will enable you to install a self-contained Conda environment for your user account with the necessary Python version needed for InstructLab. Miniforge is installed in the default location and the option to update the users shell profile to start the Conda environment are selected. We assume here a shared $HOME directory.

$ cd $HOME

$ curl -L -O "https://github.com/conda-

forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"

$ bash Miniforge3-$(uname)-$(uname -m).sh

3. Before proceeding, you must logout and log back in to activate the environment. Next, a Conda environment is created with name my_env. Here we’ll specify Python v3.11, which is a requirement for InstructLab.

conda create --name my_env -c anaconda python=3.11

conda activate my_env

4. Next, install InstructLab. Here, version 0.19.4 of InstructLab is specified. This was the version of InstructLab available in the timeframe preceding the SC24 event. Follow the installation steps in the official InstructLab documentation here.

$ pip install instructlab==0.19.4

5. Next, perform the installation of InstructLab with Nvidia CUDA support. This is required for InstructLab to utilize the GPUs. Without this step, InstructLab will run on the CPUs. Note that CUDA v12.6 is installed on the system and the variables set below reflect this.

$ export CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.6 -DCUDAToolkit_ROOT=/usr/local/cuda-12.6 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.6/lib64"

$ export PATH=/usr/local/cuda-12.6/bin:$PATH

$ pip cache remove llama_cpp_python

$ CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_NATIVE=off" pip install 'instructlab[cuda]'

$ pip install vllm@git+https://github.com/opendatahub-io/vllm@v0.6.2

Configure InstructLab

1. With the installation of InstructLab complete, the next step is to run the initialization. This will setup paths to models, taxonomy repo as well as the GPU configuration.

$ ilab config init

2. By default InstructLab stores models, training checkpoints and other files within ~/.cache and ~/.local/share/instructlab. If you have limited storage capacity available in $HOME, then you may opt to disable training checkpoint files. This can be done by setting the following option in ~/.config/instructlab/config.yaml as follows.

train:

checkpoint_at_epoch: false

3. Next, we download the required models. The ilab model list command can be used to list the models which are available. Note that a HuggingFace token is required to download certain models. Please set HF_TOKEN in the environment with the appropriate token.

$ export HF_TOKEN=<HuggingFace token>

$ ilab model download

$ ilab model download --repository=instructlab/granite-7b-lab

$ ilab model list

+--------------------------------------+---------------------+---------+

| Model Name | Last Modified | Size |

+--------------------------------------+---------------------+---------+

| instructlab/granite-7b-lab | 2024-12-27 20:37:29 | 12.6 GB |

| mistral-7b-instruct-v0.2.Q4_K_M.gguf | 2024-12-27 16:55:46 | 4.1 GB |

| merlinite-7b-lab-Q4_K_M.gguf | 2024-12-27 16:48:39 | 4.1 GB |

+--------------------------------------+---------------------+---------+

Generate synthetic data & model training

Next, is the synthetic data generation step, which will be executed on GPUs. This step is a prerequisite to teaching the LLM new skills/knowledge via training.

1. Here we use example knowledge from the InstructLab github about Taylor Swift fans, who are known as “Swifties”. This is timely because Taylor Swift recently wrapped up 6 concerts in Toronto, Canada, where I happen to be based. Copy attribution.txt and qna.yaml from the following location.

2. By default, the InstructLab taxonomy is found in ~/.local/share/instructlab/taxonomy. Here we create the directories fandom/swifties under ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom and copy the files from step 1 into this location.

$ mkdir -p ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swifties

$ cp <path_to>/attribution.txt ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swifties

$ cp <path_to>/qna.yaml ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swifties

3. With the Swifties taxonomy in place, check for any syntax errors with the command ilab taxonomy diff. It should report that the taxonomy is valid if there are no syntax errors.

$ ilab taxonomy diff

knowledge/arts/fandom/swifties/qna.yaml

Taxonomy in /u/gsamu/.local/share/instructlab/taxonomy is valid :)

4. With the taxonomy in place and having confirmed that the syntax is valid, it’s now time to run the synthetic data generation job through LSF. Here we will request 8 GPUs on a single server in exclusive execution mode. For the InstructLab ilab command, specify the --gpus 8 and --pipeline full options. Standard output is written to the $HOME/job-output with filename specification <LSF_JOBID>.out. The $HOME/job-output directory must already exist.

$ mkdir -p $HOME/job-output

$ bsub -o $HOME/job-output/%J.out -R "span[hosts=1]" -gpu "num=8:j_exclusive=yes" ilab data generate --pipeline full --gpus 8

Job <1131> is submitted to default queue <normal>.

5. During job execution, the LSF bpeek command can be used to monitor the job standard output.

$ bpeek -f 1131

<< output from stdout >>

INFO 2025-01-02 09:51:29,503 numexpr.utils:146: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.

INFO 2025-01-02 09:51:29,504 numexpr.utils:149: Note: NumExpr detected 96 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.

INFO 2025-01-02 09:51:29,504 numexpr.utils:162: NumExpr defaulting to 16 threads.

INFO 2025-01-02 09:51:30,038 datasets:59: PyTorch version 2.3.1 available.

INFO 2025-01-02 09:51:31,226 instructlab.model.backends.llama_cpp:100: Trying to connect to model server at http://127.0.0.1:8000/v1

WARNING 2025-01-02 09:51:56,356 instructlab.data.generate:270: Disabling SDG batching - unsupported with llama.cpp serving

Generating synthetic data using 'full' pipeline, '/u/gsamu/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/u/gsamu/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:55779/v1 server

INFO 2025-01-02 09:51:56,861 instructlab.sdg.generate_data:356: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.

INFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:153: Running pipeline single-threaded

INFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:197: Running block: duplicate_document_col

INFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:198: Dataset({

features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],

num_rows: 35

})

INFO 2025-01-02 09:51:58,286 instructlab.sdg.llmblock:51: LLM server supports batched inputs: False

INFO 2025-01-02 09:51:58,286 instructlab.sdg.pipeline:197: Running block: gen_spellcheck

INFO 2025-01-02 09:51:58,286 instructlab.sdg.pipeline:198: Dataset({

features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'base_document'],

num_rows: 35

})

/u/gsamu/miniforge3/envs/my_env/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...

warnings.warn(

INFO 2025-01-02 09:57:42,264 instructlab.sdg.pipeline:197: Running block: flatten_auxiliary_columns

INFO 2025-01-02 09:57:42,264 instructlab.sdg.pipeline:198: Dataset({

features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'base_document', 'spellcheck'],

num_rows: 35

})

INFO 2025-01-02 09:57:42,279 instructlab.sdg.pipeline:197: Running block: rename_to_document_column

INFO 2025-01-02 09:57:42,279 instructlab.sdg.pipeline:198: Dataset({

features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'dataset_type', 'corrected_document'],

num_rows: 70

})

INFO 2025-01-02 09:57:42,282 instructlab.sdg.pipeline:197: Running block: gen_knowledge

INFO 2025-01-02 09:57:42,282 instructlab.sdg.pipeline:198: Dataset({

features: ['icl_document', 'raw_document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'dataset_type', 'document'],

num_rows: 70

})

…

6. During the runtime of the job, it’s possible to view GPU related metrics using the LSF lsload and bhosts commands. First, we need to identify the host where the job has been dispatched to using the LSF bjobs command. In this case the job was dispatched to host p1-r01-n4. Note that details GPU accounting metrics are available once the job runs to completion.

$ bjobs -w

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME

1131 gsamu RUN normal rmf-login-1 p1-r01-n4 ilab data generate --pipeline full --gpus 8 Jan 2 14:51

$ lsload -w -gpu p1-r01-n4

HOST_NAME status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical

p1-r01-n4 ok 8 2% 7% 8

$ bhosts -w -gpu p1-r01-n4

HOST_NAME GPU_ID MODEL MUSED MRSV NJOBS RUN SUSP RSV

p1-r01-n4 0 NVIDIAH10080GBHBM3 2G 0G 1 1 0 0

1 NVIDIAH10080GBHBM3 2G 0G 1 1 0 0

2 NVIDIAH10080GBHBM3 2G 0G 1 1 0 0

3 NVIDIAH10080GBHBM3 2G 0G 1 1 0 0

4 NVIDIAH10080GBHBM3 2G 0G 1 1 0 0

5 NVIDIAH10080GBHBM3 2G 0G 1 1 0 0

6 NVIDIAH10080GBHBM3 2G 0G 1 1 0 0

7 NVIDIAH10080GBHBM3 2G 0G 1 1 0 0

7. After job completion, it’s possible to view details about the job including GPU utilization which LSF collects by leveraging NVIDIA DCGM. These metrics are available upon job completion using both the LSF bhist and bjobs commands.

$ bhist -l -gpu 1131

Job <1131>, User <gsamu>, Project <default>, Command <ilab data generate --pipe

line full --gpus 8>

Thu Jan 2 14:51:23 2025: Submitted from host <rmf-login-1>, to Queue <normal>,

CWD <$HOME>, Output File </u/gsamu/job-output/%J.out

>, Requested Resources <span[hosts=1]>, Requested GPU

<num=8:j_exclusive=yes>;

Thu Jan 2 14:51:24 2025: Dispatched 1 Task(s) on Host(s) <p1-r01-n4>, Allocate

d 1 Slot(s) on Host(s) <p1-r01-n4>, Effective RES_REQ

<select[((ngpus>0)) && (type == local)] order[r15s:p

g] rusage[ngpus_physical=8.00] span[hosts=1] >;

Thu Jan 2 14:51:25 2025: Starting (Pid 3095851);

Thu Jan 2 14:51:25 2025: External Message "p1-r01-n4:gpus=0,1,2,3,4,5,6,7;EFFE

CTIVE GPU REQ: num=8:mode=shared:mps=no:j_exclusive=y

es:gvendor=nvidia;" was posted from "gsamu" to messag

e box 0;

Thu Jan 2 14:51:26 2025: Running with execution home </u/gsamu>, Execution CWD

</u/gsamu>, Execution Pid <3095851>;

Thu Jan 2 16:08:05 2025: Done successfully. The CPU time used is 4624.0 second

HOST: p1-r01-n4; CPU_TIME: 4624 seconds

GPU ID: 0

Total Execution Time: 4597 seconds

Energy Consumed: 579704 Joules

SM Utilization (%): Avg 9, Max 15, Min 0

Memory Utilization (%): Avg 2, Max 100, Min 0

Max GPU Memory Used: 1956642816 bytes

GPU ID: 1

Total Execution Time: 4597 seconds

Energy Consumed: 503956 Joules

SM Utilization (%): Avg 7, Max 11, Min 0

Memory Utilization (%): Avg 2, Max 5, Min 0

Max GPU Memory Used: 1767899136 bytes

GPU ID: 2

Total Execution Time: 4597 seconds

Energy Consumed: 501754 Joules

SM Utilization (%): Avg 7, Max 11, Min 0

Memory Utilization (%): Avg 2, Max 5, Min 0

Max GPU Memory Used: 1784676352 bytes

GPU ID: 3

Total Execution Time: 4597 seconds

Energy Consumed: 525195 Joules

SM Utilization (%): Avg 7, Max 11, Min 0

Memory Utilization (%): Avg 2, Max 54, Min 0

Max GPU Memory Used: 1767899136 bytes

GPU ID: 4

Total Execution Time: 4597 seconds

Energy Consumed: 525331 Joules

SM Utilization (%): Avg 7, Max 12, Min 0

Memory Utilization (%): Avg 2, Max 5, Min 0

Max GPU Memory Used: 1767899136 bytes

GPU ID: 5

Total Execution Time: 4597 seconds

Energy Consumed: 502416 Joules

SM Utilization (%): Avg 7, Max 11, Min 0

Memory Utilization (%): Avg 2, Max 5, Min 0

Max GPU Memory Used: 1784676352 bytes

GPU ID: 6

Total Execution Time: 4597 seconds

Energy Consumed: 508720 Joules

SM Utilization (%): Avg 7, Max 12, Min 0

Memory Utilization (%): Avg 2, Max 5, Min 0

Max GPU Memory Used: 1784676352 bytes

GPU ID: 7

Total Execution Time: 4597 seconds

Energy Consumed: 491041 Joules

SM Utilization (%): Avg 6, Max 12, Min 0

Memory Utilization (%): Avg 2, Max 4, Min 0

Max GPU Memory Used: 1933574144 bytes

GPU Energy Consumed: 4138117.000000 Joules

Thu Jan 2 16:08:05 2025: Post job process done successfully;

GPU_ALLOCATION:

HOST TASK GPU_ID GI_PLACEMENT/SIZE CI_PLACEMENT/SIZE MODEL MTOTAL FACTOR MRSV SOCKET NVLINK/XGMI

p1-r01-n4 0 0 - - NVIDIAH10080 80G 9.0 0G 0 -

0 1 - - NVIDIAH10080 80G 9.0 0G 0 -

0 2 - - NVIDIAH10080 80G 9.0 0G 0 -

0 3 - - NVIDIAH10080 80G 9.0 0G 0 -

0 4 - - NVIDIAH10080 80G 9.0 0G 1 -

0 5 - - NVIDIAH10080 80G 9.0 0G 1 -

0 6 - - NVIDIAH10080 80G 9.0 0G 1 -

0 7 - - NVIDIAH10080 80G 9.0 0G 1 -

MEMORY USAGE:

MAX MEM: 2 Gbytes; AVG MEM: 1 Gbytes; MEM Efficiency: 0.00%

CPU USAGE:

CPU PEAK: 1.69 ; CPU PEAK DURATION: 52 second(s)

CPU AVERAGE EFFICIENCY: 100.69% ; CPU PEAK EFFICIENCY: 169.23%

Summary of time in seconds spent in various states by Thu Jan 2 16:08:05 2025

PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL

1 0 4601 0 0 0 4602

8. When the synthetic data generation job completes, it’s output can be viewed at ~/job-output/<jobID>.out. The synthetic data sets will comprise files in the directory ~/.local/share/instructlab/datasets. These files will be named skills_train_msgs_*.jsonl and knowledge_train_msgs_*.jsonl.

9. With the synthetic data generation step complete, it’s now time to run the training. We first set 2 environment variables to point to the following files: ~/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl and ~./.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl.

Afterward, we submit the training job to LSF requesting 8 GPUs and with ilab options –-pipeline accelerated, –-gpus 8, --device cuda and -–data-path pointing to the two above data files that were produced in the synthetic data generation step.

$ export SKILLS_PATH=/u/gsamu/.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl

$ export KNOWLEDGE_PATH=/u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl

$ bsub -o $HOME/job-output/%J.out -R "span[hosts=1]" -gpu "num=8:j_exclusive=yes" ilab model train --pipeline accelerated --data-path $SKILLS_PATH --data-path $KNOWLEDGE_PATH --device cuda --gpus 8

Job <1135> is submitted to default queue <normal>.

10. During job execution, the LSF bpeek command can be used to monitor the job standard output.

$ bpeek -f 1135

<< output from stdout >>

LoRA is disabled (rank=0), ignoring all additional LoRA args

[2025-01-02 12:52:04,359] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)

INFO 2025-01-02 12:52:09,061 numexpr.utils:146: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.

INFO 2025-01-02 12:52:09,061 numexpr.utils:149: Note: NumExpr detected 96 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.

INFO 2025-01-02 12:52:09,061 numexpr.utils:162: NumExpr defaulting to 16 threads.

INFO 2025-01-02 12:52:09,304 datasets:59: PyTorch version 2.3.1 available.

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.

INFO 2025-01-02 12:52:09,653 root:617: Special tokens: eos: [32000], pad: [32001], bos: [32005], system: [32004], user: [32002], assistant: [32003]

INFO 2025-01-02 12:52:09,923 root:617: number of dropped samples: 0 -- out of 641

data arguments are:

{"data_path":"/u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl","data_output_path":"/u/gsamu/.local/share/instructlab/internal","max_seq_len":4096,"model_path":"/u/gsamu/.cache/instructlab/models/instructlab/granite-7b-lab","chat_tmpl_path":"/u/gsamu/miniforge3/envs/my_env/lib/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py","num_cpu_procs":16}

tokenizing the dataset with /u/gsamu/.cache/instructlab/models/instructlab/granite-7b-lab tokenizer...

ten largest length percentiles:

quantile 90th: 1459.0

quantile 91th: 1466.0

quantile 92th: 1469.6000000000001

quantile 93th: 1478.2

quantile 94th: 1483.0

quantile 95th: 1488.0

quantile 96th: 1497.1999999999998

quantile 97th: 1516.5999999999997

quantile 98th: 1540.6000000000001

quantile 99th: 1656.0000000000016

quantile 100th: 2578.0

at 4096 max sequence length, the number of samples to be dropped is 0

(0.00% of total)

quantile 0th: 368.0

quantile 1th: 393.0

quantile 2th: 411.2

quantile 3th: 421.2

quantile 4th: 427.2

quantile 5th: 442.0

quantile 6th: 604.4

quantile 7th: 631.8

quantile 8th: 653.8000000000001

quantile 9th: 679.8

quantile 10th: 742.0

at 20 min sequence length, the number of samples to be dropped is 0

checking the validity of the samples...

Categorizing training data type...

unmasking the appropriate message content...

Samples Previews...

…

11. During the runtime of the training job, we can observe some GPU utilization information using the LSF lsload and bhosts commands. First we need to identify the server on which the training job is running. This is done using the bjobs command and checking for the execution host of the job.

$ bjobs -w

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME

1135 gsamu RUN normal rmf-login-1 p1-r01-n1 ilab model train --pipeline accelerated --data-path /u/gsamu/.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl --data-path /u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl --device cuda --gpus 8 Jan 2 17:51

$ lsload -w -gpu p1-r01-n1

HOST_NAME status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical

p1-r01-n1 ok 8 0% 22% 8

$ bhosts -w -gpu p1-r01-n1

HOST_NAME GPU_ID MODEL MUSED MRSV NJOBS RUN SUSP RSV

p1-r01-n1 0 NVIDIAH10080GBHBM3 10G 0G 1 1 0 0

1 NVIDIAH10080GBHBM3 10G 0G 1 1 0 0

2 NVIDIAH10080GBHBM3 10G 0G 1 1 0 0

3 NVIDIAH10080GBHBM3 10G 0G 1 1 0 0

4 NVIDIAH10080GBHBM3 10G 0G 1 1 0 0

5 NVIDIAH10080GBHBM3 10G 0G 1 1 0 0

6 NVIDIAH10080GBHBM3 10G 0G 1 1 0 0

7 NVIDIAH10080GBHBM3 10G 0G 1 1 0 0

12. Once the job is complete, detailed GPU accounting can again be viewed using the LSF bhist command as follows below.

$ bhist -l -gpu 1135

Job <1135>, User <gsamu>, Project <default>, Command <ilab model train --pipeli

ne accelerated --data-path /u/gsamu/.local/share/inst

ructlab/datasets/skills_train_msgs_2025-01-02T09_51_5

6.jsonl --data-path /u/gsamu/.local/share/instructlab

/datasets/knowledge_train_msgs_2025-01-02T09_51_56.js

onl --device cuda --gpus 8>

Thu Jan 2 17:51:48 2025: Submitted from host <rmf-login-1>, to Queue <normal>,

CWD <$HOME/.local/share/instructlab/checkpoints>, Ou

tput File </u/gsamu/job-output/%J.out>, Requested Res

ources <span[hosts=1]>, Requested GPU <num=8:j_exclus

ive=yes>;

Thu Jan 2 17:51:48 2025: Dispatched 1 Task(s) on Host(s) <p1-r01-n1>, Allocate

d 1 Slot(s) on Host(s) <p1-r01-n1>, Effective RES_REQ

<select[((ngpus>0)) && (type == local)] order[r15s:p

g] rusage[ngpus_physical=8.00] span[hosts=1] >;

Thu Jan 2 17:51:49 2025: Starting (Pid 3462241);

Thu Jan 2 17:51:49 2025: Running with execution home </u/gsamu>, Execution CWD

</u/gsamu/.local/share/instructlab/checkpoints>, Exe

cution Pid <3462241>;

Thu Jan 2 17:51:49 2025: External Message "p1-r01-n1:gpus=0,1,2,3,4,5,6,7;EFFE

CTIVE GPU REQ: num=8:mode=shared:mps=no:j_exclusive=y

es:gvendor=nvidia;" was posted from "gsamu" to messag

e box 0;

Thu Jan 2 17:57:56 2025: Done successfully. The CPU time used is 3024.0 second

HOST: p1-r01-n1; CPU_TIME: 3024 seconds

GPU ID: 0