With fixed CPU and memory resources in worker nodes, the more compute and memory you give to each pod results in fewer pods that can be created; at the same time, the more compute and memory you give to each pod may improve its response time or throughput. In this blog, we explore the crucial performance and memory trade-offs associated with selecting between 2 GB and 4 GB RAM configurations for your inferencing pods on Cloud Pak for Data (CP4D) version 4.8.
By default, CP4D allows users to define the runtime configuration with 1 vCPU and either 2 GB or 4 GB RAM. Our experimental testing, conducted on IBM Power10, revealed significant differences in performance between 2 GB and 4 GB RAM, highlighting the importance of this decision. Continue reading to learn more.
Prerequisites
To complete the testing we outline in this blog, ensure the following prerequisites are in place:
Estimated time to complete
Assuming all prerequisites and permissions are met, setting the configuration for the inferencing pod is instantaneous. However, switching between memory options requires deleting and redeploying the pods under the updated deployment.
How to select different runtime configurations
Refer to the steps outlined in Coding an inferencing endpoint for Long-Short Term Memory model in Cloud Pak for Data deployment space when defining the configuration, choose runtimes with extra extra small, 1vCPU and 2 GB RAM, or extra small, 1 vCPU and 4 GB RAM.
Performance and memory trade-offs
Our tests were conducted on an IBM Power S1022 (2x20c with DDR4, 2 TB of memory). The throughput averages for inferencing with 48 and 96 pods for the same workload across different memory configurations are illustrated in Table 1. Forty eight (48) pods support 96 concurrent users, and 96 pods supports 192 concurrent users in our testing.
Memory configuration
|
Pods
|
Throughput (scores/second)
|
2 GB
|
96
|
57.80
|
4 GB
|
48
|
65.60
|
96
|
124.53
|
Table 1: Inferencing with 48 and 96 pods across different memory configurations.
If the OpenShift environment is memory constrained to around 192 to 200 GB, you can create the inferencing pod with either 2 GB or 4 GB of memory. You get the same memory comparison when comparing the 96 pods x 2 GB with 48 pods x 4 GB. The total respective memory in the cluster for both configurations is 192 GB. It is worth noting that in the 48 pod case, only 96 concurrent users are supported compared to the 96 pod case in which 192 concurrent users are supported. Thus, if you want to support the same number of users with 48 pods, there may potentially be more wait time for the 48 pod x 4 GB case. When comparing the throughputs, the 48 pods x 4 GB scenario offers 13.5% more advantage compared to the 96 pods x 2 GB scenario. If throughput is a critical factor and memory is limited, the 48 pods x 4 GB configuration appears to be more favorable. However, if supporting more pods and concurrent users is more important, the 96 pods x 2 GB configuration does just that at the expense of throughput.
In the event that your OpenShift environment has no memory constraints, you can test with more pods for the memory configurations. For example, we previously saw 48 pods x 4 GB that uses 192 GB in total. Now, with no memory constraints, you can use the 96 pods x 4 GB configuration that uses 384 GB in total. It's important to highlight that with 48 pods, the system can only accommodate 96 concurrent users, whereas with 96 pods, it supports 192 concurrent users. The 96 pods x 4 GB configuration provides 89.8% throughput advantage compared to the 96 pods x 2 GB configuration. If total memory is not a factor and higher throughput is the priority, the 96 Pods x 4 GB configuration is likely the better choice.
It is also noteworthy that the throughput during the beginning of the 2 GB RAM inferencing pod runs similar to that of the 4 GB RAM inferencing pod runs. However, as the 2 GB RAM inferencing pod runs continue, the throughput gradually decreases. Additionally, we noted at least 10 time more errors, with over 60% being gateway timeouts, for inferencing with the 2 GB RAM pod configuration compared to the 4 GB RAM pod configuration.
Choosing lower memory allows you to deploy more pods for your workload due to increased available resources. However, this comes at the cost of significantly hampered performance and increased errors. If the workload demands higher performance, the 4 GB memory configuration seems to be more suitable, especially in scenarios with a larger number of pods and higher throughput requirements. The 2 GB memory configuration might be suitable for less resource-intensive workloads or where cost considerations favor a lower memory footprint.
Summary
To summarize, this blog highlights the comparison between varying memory configurations on the Power10 system, specifically emphasizing inferencing throughput. If throughput is a critical factor and memory is limited, the higher memory configuration pod appears to be more favorable. If total memory is not a factor and higher throughput is the priority, the higher memory configuration and larger number of testing pods is favorable. While initially comparable, the throughput of the 2GB RAM configuration gradually decreased with at least 10 times more errors, predominantly gateway timeouts. Opting for lower memory enables you to deploy more pods but significantly hampers performance and increases errors.
For any questions or additional information, feel free to comments below or reach out to us at shadman.kaif@ibm.com or theresax@ca.ibm.com.