IBM Cloud provides a fully managed, multi-tenant, serverless platform known as Code Engine which can be used to run batch jobs. Code Engine (CE) allows users to focus on their applications and data without needing to manage the underlying compute infrastructure used to run them. It automatically deploys and scales compute capacity as needed, and when jobs complete, the resources that were used are removed. Unlike a traditional HPC cluster, no cost is incurred for unused capacity, and you only pay for the compute resources you use. Furthermore, for applications that are processing regulated data, Code Engine supports all the additional security controls offered by the IBM Cloud for Financial Services.
What are Code Engine jobs?
Like their traditional HPC counterparts, Code Engine jobs are designed to run one time and exit. A typical job will pull its input data from a persistent data store such as Cloud Object Storage (COS). It will perform its computational work and then store the results of the calculation back to COS or send these to the user. This allows the supporting compute resources that ran the job to be freed up.
Jobs might run as a single or multiple instances with Code Engine supporting the running of parallel tasks to reduce compute time. Each task runs the same code but with different input data. The number of job instances to run in parallel can be specified when the batch job is submitted. If a job instance fails, Code Engine will restart it.
Code Engine runs containerised workloads. It can even build container images for you from your source code. A code engine job will run one or more instances of the container to perform its computational work. When you initially create the job, you specify the workload configuration such as the number of processor cores and memory that it needs. This will be used each time the job is run.
For more information and to get started with Code Engine, please see: https://cloud.ibm.com/docs/codeengine?topic=codeengine-getting-started
Code Engine and LSF
Code Engine batch jobs can be created and managed through the IBM Cloud console or via a command-line interface. However, most organisations running HPC batch applications will be using a workload scheduler such as LSF or Slurm to support batch workloads on their HPC clusters. These clusters have a finite size, so when business demands for resource exceed the capacity of the cluster, organisations might decide to burst compute workload to additional resources running in the cloud.
If an organisation only has an infrequent need for such burst capacity or would rather not undertake management of another HPC environment, one option is to use Code Engine as an extension to their existing cluster. This has been enabled through the work of Christof Westhues to create a gateway allowing LSF or Slurm jobs to be run using Code Engine. Think of this as an additional pool of compute resource allowing simple workloads with basic computational needs to be run on Code Engine serverless “compute nodes”. There is no ongoing commitment to pay for cloud compute infrastructure which improves overall cost efficiency. The typical use of this gateway is as shown in the diagram below:
The LSF Code Engine Gateway allows you to dynamically scale compute resources on-demand to support peak workloads and by using a serverless platform, offers improved cost efficiency through the “Pay for what you use” consumption model. The infrastructure and security management of the Code Engine “compute pool” helps reduce operational overheads and removes the requirement for needing to develop and maintain skills in IBM Cloud infrastructure. Finally, the integration with your existing workload scheduler (LSF or Slurm) makes for easy job submission and management using the tools you are familiar with, adding serverless compute to your existing HPC environment.
The use of serverless compute resources as an LSF compute pool is not new. Xun Pan, an LSF developer, extended LSF to use OpenWhisk serverless resources in 2016.
Code Engine as part of your overall HPC strategy
Code Engine compute can be a useful addition to your IBM Cloud HPC capabilities but it’s not the only way that an on-premises LSF cluster can be extended with cloud resources. LSF Standard Edition includes the LSF Resource Connector which can automatically scale up and down resources on the IBM Cloud based on workload demand as the business need for additional capacity varies over time. For a deeper discussion of the factors influencing the choice of serverless vs. virtual machines for your cloud compute and when you might use one or the other, please see this article published by Gabor Samu. Some workloads need the extra capabilities that virtual machines offer, but for those that work with serverless compute nodes, the advantages are significant and certainly worth trying out.
Christof’s code to implement the LSF-Code Engine Gateway is available as-is on GitHub at: https://github.com/cwesthues/cemaster.