Overview
Turbonomic can now automate scale actions for Serverless workloads managed by Knative Serving or KServe Serverless installation. This blog explains how we separate the scenario of zero to one vs one to many, and apply different approaches automatically.
Serverless Workloads and Challenges in Autoscaling
Knative is an Open-Source Enterprise-level solution to build Serverless and Event Driven Applications. Red Hat Openshift Serverless is built based on it; KServe Serverless installation is also leveraging KNative Serving for inferencing AI Models including LLM models. With Serverless framework, developers will be able to focus on coding applications instead of managing intricate infrastructure details.
Among all the functions provided by KNative Serving, Scale to Zero is one of the most attractive. If an application is receiving no traffic and scale to zero is enabled, Knative Serving scales it down to zero replicas and frees up all the resources. Replicas are scaled up as soon as traffic starts to hit the application.
Scale to zero is decided based on traffic and it needs to be scaled up very quickly once the user request comes. General horizontal scaling approaches has a lot of things to consider, staring from the metrics to use for scaling. Infrastructure metrics like CPU, Memory usage are good starting point, but in reality, user defined Service Level Objectives(SLOs) can provide more accurate reference, especially for modern AI inference service. Using the same metrics/criteria for Scale to Zero and general Horizontal Scaling is not practical in production environment.
KNative provides automatic scaling, or autoscaling, for applications to match incoming demand. It only supports the implementation of Horizontal Pod Autoscaler(HPA) or Knative Pod Autoscaler (KPA) autoscalers. KPA supports Scale to Zero but can only use "concurrency" and "rps"(request per second) as metrics; HPA supports "cpu", "memory" and custom metric, but it does not support Scale to Zero.
How Turbonomic Helped
Turbonomic supports horizontal scaling of cloud native application services as well as the underlying compute resources based on customer experience Service Level Objectives (SLOs). Now we married Turbonomic SLO based Horizontal scaling to the autoscalers supported by Knative to provide different solutions for different situations.
- When the desired replica number is 1, Turbonomic set scale bounds to [0,1] and set autoscaler KPA handle the Scale to Zero scenario based on rps;
- When the desired replica number is more than 1, Turbonomic set scale bounds to [n,n] so that autoscaler can only scale the workload to n replicas based on the SLO based decision from Turbonomic
This solution combines the strength from Turbonomic and KPA to provide the best outcome for workloads. Everything is done by Turbonomic automatically, user enjoys this autopilot without any additional actions.
However, there is a small flaw with the additional resource demand when changing the scale bounds from n1 to n2. Knative treat this scale bound change as a new version of the service. As an enterprise solution, new version triggers rolling update. In other words, Knative won't shutdown the old version until the new version is fully up and running. During this process, additional n2 replicas of resource is temporary needed. This is usually fine because there are headroom in the system. For safety, Turbonomic checks the free resource in the system and won't start the scaling if there is not enough headroom.
How to avoid temporary resource demand
In some extreme cases users don't have lots of headroom for critical resource like GPUs. In this case, turbonomic can still help by leveraging custom metric support of HPA plus automatically switching autoscalers in Knative. The updated solution is
- (No change) When the desired replica number is 1, Turbonomic set scale bounds to [0,1] and set autoscaler to KPA handle the Scale to Zero scenario based on rps;
- When the desired replica number is more than 1, Turbonomic set scale bounds to [1,max] and set autoscaler to HPA with custom metric. Turbonomic update the replicas directly and ensure HPA does not change it.
The simplest way to ensure HPA stays away from the workload is to set a custom metric does not exist. If the desired replicas stays within the scale bound HPA is happy with whatever number Turbonomic is set to.
This is an alpha feature of the product, reach out to your turbonomic representative if you need to enable it.
Next Step
Now you know how Turbonomic can help your workloads to dance with KNative Serving, scale from zero to any number with the best metrics. Go ahead and try it!