Cloud Platform as a Service

 View Only

History and future of serverless containers - illustrated with IBM Cloud Code Engine

By Simon Daniel Moser posted 2 days ago

  

From PaaS and Serverless to a generic Compute Platform. And Why. 

Remember 2016 ? Yes, that was long before the pandemic, and LLMs were not on the radar of many yet. What was on the radar instead were “serverless functions”, and those things were presented, as Matt Dugan puts it, “as the undeniable future of infrastructure” [1] in the tech industry. And indeed, every major cloud provider came out with his derivate of a function service — AWS Lambda, Google Cloud Functions, IBM Open Whisk, Microsoft Functions and so on and so forth. 

To recap what serverless computing is, let’s look at the Wikipedia definition: “Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers” [2]. When, back in the day, I would have taken that definition and would have matched it against the service catalog of IBM Cloud, I would have gotten two matches: IBM Cloud Functions and IBM Cloud Foundry. Looking at them from higher up, they had quite some similarities — a developer pushed some code, somehow that code magically morphed into a container and that container was executed. Looking closer there were differences, of course — the scaling model, the pricing, the way how the container came into existence — but essentially, someone wanted to run some code, as a container, in a serverless way. As a senior technical leader for IBM Cloud, the question to ask was: Why do we build and maintain more than one service for that ?

Some time later, AI training scenarios became more present in the minds of people, product management came around and asked for a “serverless batch job as a service” capability. And that requirement made sense, because neither the characteristics of functions (short-running and memory constraint) nor of applications (long-running, http/s serving) were really a good fit for batch jobs which were potentially longer running, had high cpu and memory demands and are usually not http/s serving. 

Table 1 summarises the various characteristics: 

|             | Duration | CPU demand  | Mem demand  | http | Capacity |
|-------------|----------|-------------|-------------|------|----------|
| Application | long | medium/high | medium/high | Y | |
| Function | short | low | low | Y | |
| trad. Batch | long | medium/high | medium/high | N | medium |
| LLM batch | long | high | high | N | high |
| HPC batch | long | high | high | N | massive |

So in the minds of product management the conclusion was clear: Great, let’s build and operate a third service (that takes some code and runs it as a container). 
 
As you can imagine, I am telling you that story because in IBM Cloud we didn’t build that third service. In fact, we didn’t even want to have two services. We went back to the drawing board instead and said: Let’s create one service that can combine all three use-cases and potentially other, similar ones.

In March 2021, the first iteration of IBM Cloud Code Engine came to the market. At that time, it wasn’t even a replacement for the existing functions or cloud foundry services, but in fact it addressed a slightly orthogonal, third use-case: purely serverless containers. Imagine you have a container image built by yourself, and just want to push and run it without having to take care of e.g. a Kubernetes cluster. So for a short period of time, we did have a third serverless service (just to contradict my previous statement) — but only to create a landing zone for the users of the other two services, and with the clear goal to sunset the two other ones. 

Up next, we had to implement the use case that was the value proposition of Cloud Foundry: “Here’s my code, run it on the cloud for me, I don’t care how” — to quote the official CF Haiku. We developed a smooth migration on-ramp for Cloud Foundry users to Code Engine, and even developed an equally simple equivalent of the popular “cf push” command called “ce app create”. All that was done done on a modern, Kubernetes-based technology stack, extended with open source CNCF/CDF projects like Knative (to allow scale to zero), Paketo Buildpacks (to transform source code into containers) and Shipwright (to oversee, orchestrate and execute container builds) and glued together with some custom code, e.g. to integrate the service into IBM Cloud IAM, billing, logging and monitoring solutions. 

Once we had that, customers started to migrate over from Cloud Foundry to Code Engine, and we were in a position to tackle the next feature: The infamous “serverless batch job as a service” I spoke about earlier. We tagged along the Kubernetes Batch initiative, as it seemed like a good fit, and were pretty quickly able to run Batch Jobs and offer that as a feature of the Code Engine service. By then we could strike off two out of our three goals, and while we had delighted feedback from customers (“I had a solution where I had a web app that was periodically spawning batch job runs. I ran that on Bare Metal servers, on Cloud Foundry and other services, and now I can just have it all in Code Engine behind a single API, that is so cool!”), we also learned quickly that Kubernetes Batch is not up there yet with high-profile, HPC-like batch scheduling technologies. Keep that in mind for later … 

Fast Forward to today: In the last couple of months, we were able to ship the last missing piece to complete the three-services-to-one transition: Functions support in Code Engine. That one was a bit trickier, as all people familiar with Kubernetes would understand that in K8s proper it is very hard if not even impossible to get to a container deployed, started and responding to a http request in a few hundred milliseconds, so a few tricks of the trade had to be applied to get to the desired outcome.

So far, so good — mission accomplished ?! And indeed, we are pretty proud that we have built the most complete and unified serverless container service that we know of, and are able to run that with extraordinary efficiency. But remember, in the beginning I said “LLMs were not on the radar of many yet” and “Let’s create one service that can combine all three use-cases and potentially other, similar ones”. Now, when you think about it, a lot of the recent AI/LLM use cases like model training are quite similar in nature to serious HPC scenarios. So where are the challenges ? First, HPC/AI workloads often have different infrastructure requirements than web apps or functions — they need special hardware, like GPUs, extraordinary big machines or high network bandwidth, and they either need no scheduler or a very fault tolerant scheduler, especially designed for these kind of workloads. 

As you can see, more challenges ahead. But the design and architecture decisions we’ve been taken allow us to expand and grow our current generic container runtime platform. If we can think of ways how to address the use-cases from the previous paragraph with Code Engine (and yes, we have some ideas), we’ll be very close to a generic compute platform, instead of just a container platform. And then, only then, I’d call mission accomplished. So stay tuned on what’s next, I can promise you it will be very cool! 

References:

(1) https://matduggan.com/serverless-functions-post-mortem/

(2) https://en.wikipedia.org/wiki/Serverless_computing

0 comments
1 view

Permalink