This blog provides a step-by-step approach to resolving the 'OOMKilled: Out of Memory Killed' issue encountered in the runtime-assemblies-operator pod within the Cloud Pak for Data (CP4D) platform. This issue generally occurs when the total amount of memory has been used by the pod/container.
Initial encounter
This issue arose during the horizontally scaling our Long Short-Term Memory (LSTM) model deployment to more than 96 API endpoints.
How to resolve the ‘OOMkilled’ issue
Step 1: Verify the deployment status
Ensure the deployment for the runtime-assemblies-operator pod is running using the following command in the OpenShift Container Platform (OCP) environment where CP4D is installed.
oc get deployments -n cpd-instance | grep runtime-assemblies-operator
Expected output:
[root@sphrapids2lp1 ~]# oc get deployments -n cpd-instance | grep runtime-assemblies-operator
runtime-assemblies-operator 1/1 1 1 14d
Step 2: Modify pod’s configuration
Modify the pod's configuration by running the following command.
oc edit deployment runtime-assemblies-operator -n cpd-instance
Expected block to modify:
resources:
limits:
cpu: 350m
ephemeral-storage: 1Gi
memory: 320Mi
requests:
cpu: 30m
ephemeral-storage: 10Mi
memory: 128Mi
securityContext:
Adjust the memory limit as needed; for example, increasing memory from 320Mi to 640Mi resolved our issue.
Note: The optimal memory amount may require further experimentation.
Step 3: Verify pod status
After modifying memory resources, observe the creation of a new pod in 'Container Creation' state, transitioning to 'Running' after 1-2 minutes. Run the following command to verify.
oc get pods -o wide -n cpd-instance | grep runtime-assemblies-operator
Expected output:
[root@sphrapids2lp1 ~]# oc get pods -o wide -n cpd-instance | grep runtime-assemblies-operator
runtime-assemblies-operator-6fcc886b5c-xz596 1/1 Running 0 17h 10.128.2.117 cp4i-w2.s2lp1.toropsp.com <none> <none>
Run the following command to verify whether the modified memory resources are assigned.
[root@sphrapids2lp1 ~]# oc get pods -o wide -n cpd-instance | grep runtime-assemblies-operator
runtime-assemblies-operator-6fcc886b5c-xz596 1/1 Running 5 (162m ago) 4d9h 10.128.2.117 cp4i-w2.s2lp1.toropsp.com <none> <none>
[root@sphrapids2lp1 ~]# oc describe pod runtime-assemblies-operator-6fcc886b5c-xz596 -n cpd-instance | grep Limits: -A8
Summary
In conclusion, this blog highlights the 'OOMKilled' issue encountered when scaling deployment pods beyond 96 in the Cloud Pak for Data platform and provides a detailed resolution. Note that this issue was observed on both Power10 and X86 systems simultaneously during scaling.
For any queries or additional information, feel free to comment below or reach out to me at theresax@ca.ibm.com. Co-author credits got to Revanth Atmakuri, who can be contacted at revanth.atmakuri@ibm.com.