Data and AI on Power

Data and AI on Power

IBM Power systems provide a robust and scalable platform for a wide range of data and AI workloads, offering benefits in performance, security, and ease of use.

 View Only

Scaling inferencing workloads in Cloud Pak for Data: Key considerations

By Theresa Xu posted Sat January 06, 2024 11:48 AM

  

The Cloud Pak for Data (CP4D) deployment space offers an efficient solution for providing REST API interfaces through microservices, allowing independent scaling of each model deployment to handle highly parallelized inferencing workloads. This blog aims to cover the four crucial aspects for efficient scaling.

1. Number of replicas for each AI model deployment

The decision on the number of replicas depends on the desired level of concurrent inferencing users and the acceptable wait time.

For instance, in a scenario where we increased the replicas from 1 to 48 (REST API endpoints), the throughput scaling demonstrated an impressive 90%. However, pushing the boundaries to 96 replicas resulted in a noticeable drop to 70%. This underscores a critical trade-off: while 96 REST API endpoints can accommodate double the concurrent users, the response time for each user, on average, extends by 30%. A workaround for this scaling issue is to deploy the same model as a different deployment and then scale it to 48 replicas. This approach allows achieving 90% scaling across 96 replicas with the same compute and memory resources.

2. CP4D sizing and performance impact for pod scaling

DP4D infrastructure pods were deployed to support extra small cluster size by default. To support highly concurrent inferencing workloads with more than 100 REST API endpoints, refer the IBM CP4D documentation on manually scaling: Manually scaling resources for services.

Consider increasing the number of wml-deployment-envoy pods and adjusting the memory for the runtime-assemblies-operator pod for a better performance. For more details, refer to this blog: How to address the OOMKilled issue in the runtime-assemblies-operator within a Cloud Pak for Data pod.

3. Worker node pod limit of 250

For environments with large worker nodes capable of supporting more than 250 pods, refer to the Red Hat's instructions to increase the limit and related configuration parameters in Red Hat OpenShift: Recommended host practices

4. Database resource limits and overall performance

If inferencing pods share a common database for correlating information, the database's performance can significantly impact inferencing throughput and latency. Considerations include the number of concurrent connections supported by the database and the type of read locks used for data selection.

Using Db2u as an example, the number of vcores allocated to the database deployment influences the number of concurrent database connections and memory pool size. Complexity of queries and Db2u pod memory requirements may vary. For more information, refer to How to scale a Db2 pod in Red Hat OpenShift Container Platform and IBM Cloud Pak for Data.

Summary

This blog highlights essential considerations for efficiently scaling inferencing workloads in CP4D. Understanding replica counts, CP4D sizing, worker node limits, and database resource impacts is crucial for optimizing performance and response times. Implementing these considerations ensures a well-managed and scalable inferencing environment.

0 comments
16 views

Permalink