Cloud Pak for Data Group

 View Only

CPD Performance Troubleshooting Best Practice Part 1

By Hong Wei Jia posted Sun September 12, 2021 08:07 AM

Cloud Pak for Data 3.5 Performance Troubleshooting Best Practice -Part 1


This is from a real-world Cloud Pak for Data case (Cloud Pak for Data 3.5 running on OpenShift Cloud Platform 4.6). As a public cloud provider, this CPD client is providing Data & AI service to its tenants. There are multiple Cloud Pak for Data instances (each in a separate namespace) built on top of the same one OpenShift 4 cluster. One of the CPD instance encountered performance problem when delivering a Notebook workshop for 30 concurrent users. It’s a very challenging task to do the performance troubleshooting in the customer’s multi-tenancy environment.  In this article, we summarized the useful approaches for the performance troubleshooting in multi-tenancy environment.

Performance troubleshooting approaches

Basically,  the performance troubleshooting  approaches are divided into the following aspects.

  • Confirm whether it’s a true performance problem
  • Cloud Pak for Data Application level analysis
  • Infrastructure level analysis
  • Advanced cluster-level analysis

Is it a true performance problem?

Addressing the following questions can help to identify whether it’s true performance problem.

1.Does this ‘performance problem’ only happen in scalability scenarios (many concurrent users)?
2.Is the Cloud Pak for Data cluster in healthy status?

1)Are the assemblies in Ready status?

/cpd-cli status -n targetnamespace

2)Any unhealthy pods?

oc get po --all-namespaces -o wide| grep -Ev '1/1 .* R|2/2 .* R|3/3 .* R|4/4 .* R' | grep -v 'Completed'

3.Are the end-users using the supported web browsers and with supported versions ?

You can get the information about the supported web browsers with this link.

4.Is the ‘performance problem’ encountered by all end-users?

If all the answers to the above questions are Yes, then we can proceed to the subsequent sections.

Cloud Pak for Data Application level analysis

Check the important metrics of system health and status

With the Platform Management page of Cloud Pak for Data console,  users can get an overview of the system health and status by checking the important metrics displayed there.

Firstly, this page provides a very high-level view of what’s running on this cluster.

Which services are installed and how many pods are running?

The vCPU section shows the “currently in use” CPU amount, the total CPU requests and limits. The Memory section shows the “currently in use” memory amount, the total memory requests and limits.

A healthy system with active loads can have the CPU and memory usage above the total requests by 30%-50%. But the usage should not be near the total limit, because total limit is expected to be over committed.

Secondly, on the monitoring page, there are four tabs. The “Services” tab shows more details at the individual service level. Other tabs cover more information for service instances, environments, and pods. 

known issues?

Check if there are any known issues which could impact the service performance with the following link.

If there’s any known issue related to it, then apply the fix or workaround if needed.

An example – there’s a known issue of zen-metastoredb in CPD 3.5.2 which may lead to the zen-metastoredb pod OOM error when there’s heavy workload. Upgrading the Lite to 3.5.3 can mitigate this problem and increate the concurrency capability.


Sizing & Scaling

Analyze the user scenarios and workload related to the performance problem. You may have to scale the services for improving the processing capability to get better performance.

This link could be for the reference.

Infrastructure level analysis

Basically, it includes these two parts which are key to the performance.

  • Storage (I/O performance)
  • Network bandwidth

I/O performance analysis

To ensure that the storage partition for Cloud Pak for Data has good disk I/O performance and can meet the storage requirements, run the disk latency test and the disk throughput test referring to below link.

Network performance analysis

Iperf is a free tool for doing the Network performance test.

The recommended network bandwidth is 10GiB/s. 

In this article, we introduced the following performance troubleshooting approaches.  

  • Confirm whether it’s a true performance problem
  • Cloud Pak for Data Application level analysis
  • Infrastructure level analysis
For the Advanced cluster-level analysis, we'll introduce it in details in Part 2.