Kubernetes is great for running services. What I mean by that is that it's great for an administrator to launch various services like web servers, databases, and applications. But when it comes to jobs, like simple processes, or at the other extreme massively parallel jobs Kubernetes has some challenges. The problems stem from the nature of the jobs. Running a job as a pod is not that hard, but that not the end of the story.
For simple jobs, the issue is in not necessarily the pod, but more in the governance. Job are run by people. They are running those jobs as part of there work. Who has not felt the pressure of a deadline? When users have a tool like Kubernetes they are going to discover:
- They can run lots of jobs
- They can run bigger jobs
Here's where the conflict comes in. Eventually they will use all the resources, and the harder questions creep up like:
- Why is a user hogging all the resources?
- Why can't I get a fair share of the resources?
- Why am I hitting a quota limit when there are idle machines?
- Why has the high priority work not run?
The people issues are much harder to deal with given the controls that Kubernetes provides.
For parallel jobs it easy to see the issues. A massively parallel job runs the work across many machines/pods. As calculations are completed the results are combined and distributed for the next step in the calculation. This means if one host/pod is slower, or starts later, the calculation stalls waiting for the data. Therefore, for parallel jobs it is important to start the calculations on the hosts/podsat the same time, and that the hosts/pods be similarly configured. You would not want to mix a slow machine with fast ones.
This is where integrating the batch scheduling and resource policies of IBM Spectrum LSF makes sense. LSF has been dealing with the governance issues associated with managing large clusters with thousands of users and machines, and millions of jobs. Also coming from a High-Performance Computing (HPC) it has been supporting massively parallel jobs for decades. Yes, it really has been around that long.
So how do we add LSF's batch and resource management capabilities to Kubernetes? Turns out Kubernetes was built with the provision to allow different schedulers. You can select the scheduler to use when you create the pod e.g.
schedulerName: lsf
This way the request gets routed through LSF, and LSF then manages that workload.
Parallel jobs needed a different approach. Kubernetes has a Job kind, but as we saw it needs a few enhancements to make it more usable for parallel jobs. So we created a Custom Resource Definition (CRD). Basically, this is an enhancement to the Kubernetes API to create a parallel job Kind. This parallel job kind allows us to express all the characteristics of a parallel job we want Kubernetes to run. It then provides the logic to help Kubernetes deploy that kind of job. The picture below shows how this all works it more detail.

- The LSF Scheduler components are packaged into containers and deployed into a Kubernetes environment.
- Users submits workload into K8S API via kubectl. To get the LSF Scheduler to be aware of the pod the "schedulerName" field must be set, otherwise the pod will be scheduled by the default scheduler. Scheduler directives can be specified using annotations in the pod. These scheduler directives can be provided by namespace annotations, mutating web hooks, or by the user. The workload may also be a parallel or elastic job.
- In order to be aware of the status of pods and nodes, the LSF Scheduler uses a driver that listens to Kubernetes API server and translates pod requests into jobs in the LSF Scheduler.
- Once the LSF Scheduler makes a policy decision on where to schedule the pod, the driver will bind the pod to specific node.
- The Kubelet will execute and manages pod lifecycle on target nodes in the normal fashion.
- The LSF Scheduler also supports jobs submitted from the native “bsub” CLI which are mapped to K8S pods and executed by Kubelet as well. In this way it is consistent.
The LSF scheduler adds other interesting capabilities to Kubernetes. If you are interested in trying it out visit the github site: https://github.com/IBMSpectrumComputing/lsf-kubernetes There you will find documentation, examples, and where to download it.
#SpectrumComputingGroup