High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to Blog List

Job Starvation in Your HPC Cluster

By Michael Spriggs posted Mon April 15, 2024 03:21 PM

It can be extremely frustrating for the end-users of LSF to see their jobs pending for long periods of time. Jobs with large resource requirements, or special topology requirements are especially susceptible to long pend times. In this blog, I discuss the problem and one of the key tools available in LSF to help alleviate this.

What causes job starvation in LSF?

By default LSF acts as a greedy scheduler, in the sense that within each scheduling session it does the following:

Take a snapshot of the all pending jobs and available resources in the cluster.
Make job dispatch decisions, allocating idle cluster resources to jobs.
Execute the job dispatch decisions.

LSF uses job prioritization policies configured by the administrator (queue priorities, FCFS, fairshare, etc.) to influence the order in which jobs are dispatched. However, this job order is not strict. It is possible that a low-priority job can dispatch ahead of a high-priority job if the low-priority job requires fewer resources than the high-priority job.

For example, suppose you have a single host in your cluster with 8 GPUs. A user submits a 1-GPU job and an 8-GPU job to a FCFS queue. The 1-GPU job will dispatch, and the 8-GPU job will be forced to pend. While the 8-GPU job is pending, any number of other 1-GPU jobs may be submitted by other users to be dispatched ahead of the 8-GPU job, delaying it.

In many environments the greedy approach works well, achieving high levels of resource utilization while still avoiding job starvation. For example, it works well if all jobs in the cluster have similar resource requirements. It can also work well in very large clusters where there is lots of churn: if many jobs finish in each scheduling session, then often the right set of resources needed for any hard-to-place job will appear within a few scheduling sessions.

In other environments, the greedy approach can lead to very large pending times for hard-to-place jobs, and be a huge source of frustration for users.

A prime example is an AI training cluster shared by a team of data scientists. You might have some jobs that require a single GPU, while others may require all the GPUs on a node. Larger jobs may require multiple nodes. Jobs are typcially long-running and there is little churn.

In such cases, you can enable LSF’s plan-based reservation feature to help alleviate job starvation. The idea is that instead of looking only at the current resource availability when trying to place a job, the scheduler looks into the future to make a planned allocation for the job. LSF will reserve sufficient resources on the cluster to ensure that the planned allocation can be carried out.

To enable plan-based allocation on your cluster, configure the following parameters in lsb.params, and then run badmin reconfig.

ALLOCATION_PLANNER = Y

PLAN = Y

The ALLOCATION_PLANNER parameter enables the feature, while PLAN sets the scope as to which jobs are eligible for receiving planned allocations. You can instead configure PLAN parameters in queue or application configurations in order to reduce the scope of eligible jobs.

Let’s have a look at how this works. For simplicity, I will focus on running jobs on a single 8-GPU compute node in my LSF simulation cluster.

$ bhosts -gpu cccxc454

HOST_NAME GPU_ID MODEL MUSED MRSV NJOBS RUN SUSP RSV

cccxc454 0 TeslaV100_SXM2_ 267M 0M 0 0 0 0

1 TeslaV100_SXM2_ 267M 0M 0 0 0 0

2 TeslaV100_SXM2_ 267M 0M 0 0 0 0

3 TeslaV100_SXM2_ 267M 0M 0 0 0 0

4 TeslaV100_SXM2_ 267M 0M 0 0 0 0

5 TeslaV100_SXM2_ 267M 0M 0 0 0 0

6 TeslaV100_SXM2_ 267M 0M 0 0 0 0

7 TeslaV100_SXM2_ 267M 0M 0 0 0 0

First I submit a single-GPU job that simply sleeps for 10 minutes. Plan-based reservation requires estimates on job run times, so I set an estimated run time of 10 minutes for the job with the –We option.

$ bsub -q normal -m cccxc454 -gpu "num=1:j_exclusive=yes" -We 10 sleep 600

Job <615> is submitted to queue <normal>.

Run time estimates don’t need to be perfect, though the more accurate they are the better the scheduler can plan future job dispatches. If a job runs longer than its estimated run time, this might delay a pending job that the scheduler plans to put on the same host. Conversely, if a job finishes earlier than its estimated run time, then this may represent a lost opportunity for a pending job which has been planned to start on a different host.

The job runs, as expected on the empty host.

$ bjobs

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME

615 msprigg RUN normal gutsy1.fyre cccxc454 sleep 600 Apr 10 14:01

Next, I submit an 8-GPU job that requests the same host.

$ bsub -q normal -m cccxc454 -gpu "num=8:j_exclusive=yes" -We 10 sleep 600

Job <616> is submitted to queue <normal>.

As expected, this job pends due to lack of GPUs on the host. However, LSF has computed an allocation plan whereby it will run the job after the previous job completes. We can check the jobs with planned allocations using the following syntax.

$ bjobs -plan

bash-4.2$ bjobs -plan

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PLAN_START_TIME PLAN_FINISH_TIME

616 msprigg PEND normal gutsy1.fyre sleep 600 Apr 10 14:01 Apr 10 14:11 Apr 10 14:21

We can see from the bhosts output that the idle GPUs on the host have been reserved.

$ bhosts -gpu cccxc454

HOST_NAME GPU_ID MODEL MUSED MRSV NJOBS RUN SUSP RSV

cccxc454 0 TeslaV100_SXM2_ 267M 0M 1 1 0 0

1 TeslaV100_SXM2_ 267M 0M 1 0 0 1

2 TeslaV100_SXM2_ 267M 0M 1 0 0 1

3 TeslaV100_SXM2_ 267M 0M 1 0 0 1

4 TeslaV100_SXM2_ 267M 0M 1 0 0 1

5 TeslaV100_SXM2_ 267M 0M 1 0 0 1

6 TeslaV100_SXM2_ 267M 0M 1 0 0 1

7 TeslaV100_SXM2_ 267M 0M 1 0 0 1

Now let’s submit another 1-GPU job.

$ bsub -q normal -m cccxc454 -gpu "num=1:j_exclusive=yes" -We 10 sleep 600

Job <617> is submitted to queue <normal>.

When using a purely greedy policy, this job would be able to jump ahead of the 8-GPU job. However, since we have plan-based reservation enabled cluster-wide the job is forced to wait. In fact, it get’s its own plan.

$ bjobs -plan

bash-4.2$ bjobs -plan

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PLAN_START_TIME PLAN_FINISH_TIME

616 msprigg PEND normal gutsy1.fyre sleep 600 Apr 10 14:01 Apr 10 14:11 Apr 10 14:21

617 msprigg PEND normal gutsy1.fyre sleep 600 Apr 10 14:02 Apr 10 14:21 Apr 10 14:31

If we wait until the first job finishes, we can verify that the 8-GPU job gets dispatched ahead of the 1-GPU job. Perfect!

$ bjobs

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME

616 msprigg RUN normal gutsy1.fyre cccxc454 sleep 600 Apr 10 14:01

617 msprigg PEND normal gutsy1.fyre sleep 600 Apr 10 14:02

In summary, plan-based reservation can be enabled in your LSF cluster to help ensure that jobs can dispatch in a timely manner. However, note that in order to achieve this the scheduler may need to hold GPUs idle for periods of time, thereby reducing cluster utilization.

LSF offers a number of parameters that you can configure in order to strike a balance of high resource utilization, while still ensuring that jobs get started in a timely manner. I’ll leave that discussion for another blog.

Learn more about plan-based reservation here.

0 comments

26 views

Permalink

https://community.ibm.com/community/user/blogs/michael-spriggs/2024/04/15/job-starvation-in-your-hpc-cluster

High Performance Computing

High Performance Computing Group

Job Starvation in Your HPC Cluster

By Michael Spriggs posted Mon April 15, 2024 03:21 PM

Permalink

Additional
Resources

Office

Quick Links

High Performance Computing

High Performance Computing Group

Job Starvation in Your HPC Cluster

By Michael Spriggs posted Mon April 15, 2024 03:21 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources