Data and AI on Power

Data and AI on Power

IBM Power systems provide a robust and scalable platform for a wide range of data and AI workloads, offering benefits in performance, security, and ease of use.

 View Only

How to address the OOMKilled issue in the runtime-assemblies-operator within a Cloud Pak for Data pod

By Theresa Xu posted Sat January 06, 2024 11:41 AM

  

This blog provides a step-by-step approach to resolving the 'OOMKilled: Out of Memory Killed' issue encountered in the runtime-assemblies-operator pod within the Cloud Pak for Data (CP4D) platform. This issue generally occurs when the total amount of memory has been used by the pod/container.

Initial encounter

This issue arose during the horizontally scaling our Long Short-Term Memory (LSTM) model deployment to more than 96 API endpoints.

How to resolve the ‘OOMkilled’ issue

Step 1: Verify the deployment status

Ensure the deployment for the runtime-assemblies-operator pod is running using the following command in the OpenShift Container Platform (OCP) environment where CP4D is installed.

oc get deployments -n cpd-instance | grep runtime-assemblies-operator

Expected output:

[root@sphrapids2lp1 ~]# oc get deployments -n cpd-instance | grep runtime-assemblies-operator
runtime-assemblies-operator                             1/1     1            1           14d

Step 2: Modify pod’s configuration

Modify the pod's configuration by running the following command.

oc edit deployment runtime-assemblies-operator -n cpd-instance

Expected block to modify:

 resources:
          limits:
            cpu: 350m
            ephemeral-storage: 1Gi
            memory: 320Mi
          requests:
            cpu: 30m
            ephemeral-storage: 10Mi
            memory: 128Mi
        securityContext:

Adjust the memory limit as needed; for example, increasing memory from 320Mi to 640Mi resolved our issue.

Note: The optimal memory amount may require further experimentation.

Step 3: Verify pod status

After modifying memory resources, observe the creation of a new pod in 'Container Creation' state, transitioning to 'Running' after 1-2 minutes. Run the following command to verify.

oc get pods -o wide -n cpd-instance | grep runtime-assemblies-operator

Expected output:

[root@sphrapids2lp1 ~]# oc get pods -o wide -n cpd-instance | grep runtime-assemblies-operator
runtime-assemblies-operator-6fcc886b5c-xz596                      1/1     Running            0                 17h     10.128.2.117   cp4i-w2.s2lp1.toropsp.com   <none>           <none>

Run the following command to verify whether the modified memory resources are assigned.

[root@sphrapids2lp1 ~]# oc get pods -o wide -n cpd-instance | grep runtime-assemblies-operator
runtime-assemblies-operator-6fcc886b5c-xz596                      1/1     Running            5 (162m ago)       4d9h    10.128.2.117   cp4i-w2.s2lp1.toropsp.com   <none>           <none>
[root@sphrapids2lp1 ~]# oc describe pod runtime-assemblies-operator-6fcc886b5c-xz596 -n cpd-instance | grep Limits: -A8

Summary

In conclusion, this blog highlights the 'OOMKilled' issue encountered when scaling deployment pods beyond 96 in the Cloud Pak for Data platform and provides a detailed resolution. Note that this issue was observed on both Power10 and X86 systems simultaneously during scaling.

For any queries or additional information, feel free to comment below or reach out to me at theresax@ca.ibm.com. Co-author credits got to Revanth Atmakuri, who can be contacted at revanth.atmakuri@ibm.com.

0 comments
10 views

Permalink