Turbonomic

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

View Only

Back to Blog List

Dance with Serverless Workload, from Nothing to Something to Everything

By Kuan Feng posted Fri April 04, 2025 03:21 PM

Overview

Turbonomic can now automate scale actions for Serverless workloads managed by Knative Serving or KServe Serverless installation. This blog explains how we separate the scenario of zero to one vs one to many, and apply different approaches automatically.

Serverless Workloads and Challenges in Autoscaling

Knative is an Open-Source Enterprise-level solution to build Serverless and Event Driven Applications. Red Hat Openshift Serverless is built based on it; KServe Serverless installation is also leveraging KNative Serving for inferencing AI Models including LLM models. With Serverless framework, developers will be able to focus on coding applications instead of managing intricate infrastructure details.

Among all the functions provided by KNative Serving, Scale to Zero is one of the most attractive. If an application is receiving no traffic and scale to zero is enabled, Knative Serving scales it down to zero replicas and frees up all the resources. Replicas are scaled up as soon as traffic starts to hit the application.

Scale to zero is decided based on traffic and it needs to be scaled up very quickly once the user request comes. General horizontal scaling approaches has a lot of things to consider, staring from the metrics to use for scaling. Infrastructure metrics like CPU, Memory usage are good starting point, but in reality, user defined Service Level Objectives(SLOs) can provide more accurate reference, especially for modern AI inference service. Using the same metrics/criteria for Scale to Zero and general Horizontal Scaling is not practical in production environment.

KNative provides automatic scaling, or autoscaling, for applications to match incoming demand. It only supports the implementation of Horizontal Pod Autoscaler(HPA) or Knative Pod Autoscaler (KPA) autoscalers. KPA supports Scale to Zero but can only use "concurrency" and "rps"(request per second) as metrics; HPA supports "cpu", "memory" and custom metric, but it does not support Scale to Zero.

How Turbonomic Helped

Turbonomic supports horizontal scaling of cloud native application services as well as the underlying compute resources based on customer experience Service Level Objectives (SLOs). Now we married Turbonomic SLO based Horizontal scaling to the autoscalers supported by Knative to provide different solutions for different situations.

When the desired replica number is 1, Turbonomic set scale bounds to [0,1] and set autoscaler KPA handle the Scale to Zero scenario based on rps;
When the desired replica number is more than 1, Turbonomic set scale bounds to [n,n] so that autoscaler can only scale the workload to n replicas based on the SLO based decision from Turbonomic

This solution combines the strength from Turbonomic and KPA to provide the best outcome for workloads. Everything is done by Turbonomic automatically, user enjoys this autopilot without any additional actions.

However, there is a small flaw with the additional resource demand when changing the scale bounds from n1 to n2. Knative treat this scale bound change as a new version of the service. As an enterprise solution, new version triggers rolling update. In other words, Knative won't shutdown the old version until the new version is fully up and running. During this process, additional n2 replicas of resource is temporary needed. This is usually fine because there are headroom in the system. For safety, Turbonomic checks the free resource in the system and won't start the scaling if there is not enough headroom.

How to avoid temporary resource demand

In some extreme cases users don't have lots of headroom for critical resource like GPUs. In this case, turbonomic can still help by leveraging custom metric support of HPA plus automatically switching autoscalers in Knative. The updated solution is

(No change) When the desired replica number is 1, Turbonomic set scale bounds to [0,1] and set autoscaler to KPA handle the Scale to Zero scenario based on rps;
When the desired replica number is more than 1, Turbonomic set scale bounds to [1,max] and set autoscaler to HPA with custom metric. Turbonomic update the replicas directly and ensure HPA does not change it.

The simplest way to ensure HPA stays away from the workload is to set a custom metric does not exist. If the desired replicas stays within the scale bound HPA is happy with whatever number Turbonomic is set to.

This is an alpha feature of the product, reach out to your turbonomic representative if you need to enable it.

Next Step

Now you know how Turbonomic can help your workloads to dance with KNative Serving, scale from zero to any number with the best metrics. Go ahead and try it!

Free trial sign up: https://www.ibm.com/account/reg/us-en/signup?formid=urx-52198
Product page: https://www.ibm.com/products/turbonomic/kubernetes
Docs: https://www.ibm.com/docs/en/tarm/latest?topic=controller-workload-scale-actions#WCActions_Scale__KnativeServing__title__1

0 comments

31 views

Permalink

https://community.ibm.com/community/user/blogs/kuan-feng/2025/04/04/dance-with-serverless-workload

Turbonomic

Turbonomic

Dance with Serverless Workload, from Nothing to Something to Everything

By Kuan Feng posted Fri April 04, 2025 03:21 PM

Overview

Serverless Workloads and Challenges in Autoscaling

How Turbonomic Helped

How to avoid temporary resource demand

Next Step

Permalink

Additional
Resources

Office

Quick Links

Turbonomic

Turbonomic

Dance with Serverless Workload, from Nothing to Something to Everything

By Kuan Feng posted Fri April 04, 2025 03:21 PM

Overview

Serverless Workloads and Challenges in Autoscaling

How Turbonomic Helped

How to avoid temporary resource demand

Next Step

Permalink

Additional Resources

Office

Quick Links

Additional
Resources