Lessons from the Field #20: Things to check when there's a problem in an OpenShift environment

View Only

Lessons from the Field #20: Things to check when there's a problem in an OpenShift environment

By Kevin Grigorenko posted Tue August 16, 2022 08:00 AM

Like

If there’s a problem in an OpenShift environment, and you see symptoms in a particular application, then it’s most effective to directly investigate those application pods. But what do you do if you don’t know what the problem is? This post will cover some generally useful things to investigate in an OpenShift environment in our experience. This post focuses on the browser-based web console and a future post will cover similar techniques through the command line oc tool.

Step 1: Check the overall status of the cluster

We’ll be using the “Administrator” view throughout the following examples, so double check if you are in that view at the top left.

To check the overall status of the cluster, its control plane, and the installed operators, click Home } Overview. If any of these three items do not show a green check mark, then click on them for more details and investigate any potential issues.

Step 2: Review node resource usage

On the same Overview page, go to the Cluster Utilization section and review CPU, memory, filesystem, and network utilization:

This view is deceptively simple because there’s a lot of investigative power in the subtle hyperlinked utilization number. If you see a potential utilization issue, click on that utilization number (e.g. 6.67 in the CPU screenshot above), and from there you can group utilization in different ways. For example, the default “By Project” view is useful to understand utilization by namespace which may help isolate a problematic application or business unit:

However, another useful grouping is “By Node” to show particular worker nodes that are overutilized, and, similarly, “By Pod” to show particular applications using significant resources.

Step 3: Review critical and warning alerts

Click Observe } Alerting (or Monitoring } Alerting on older versions), and click Filter and check “Warning” and “Critical”. Review any such recent alerts to investigate any potential issues reported within the cluster.

Step 4: Review recent warning and error events

Click Home } Events, and change the “All types” drop down to “Warning”. This filter includes both warnings and errors. Review any recent warnings and errors for potential issues:

Step 5: Utilization deep dive

If the above steps haven’t identified an issue and you suspect a resource utilization issue, there are various ways to do a deeper dive into utilization. We suggest starting with Observe } Dashboards (or Monitoring } Dashboards in older versions) and changing the Dashboard to Node Exporter / USE Method / Cluster. The USE Method stands for Utilization, Saturation, and Errors and it’s a common technique to try to isolate the cause of resource issues.

Step 6: Review application monitoring

In addition to products such as Instana OpenShift monitoring, OpenShift has some monitoring capabilities built-in for application workloads by configuring enableUserWorkload which then integrates with the above metrics and dashboards.

In addition, consider enabling cluster logging for the ability to do cluster-wide log searches and access Kibana from the browser through the App launcher:

For WebSphere workloads, consider using some of IBM’s pre-built dashboards for such applications. For maximum value, add the mpMetrics-4.0 feature to WebSphere Liberty application configurations (<feature>mpMetrics-4.0</feature>) and install metrics.ear for WebSphere Application Server traditional Base >= 8.5.5.20 and 9.0.5.7.

You may also consider installing a Grafana instance and use some of IBM’s pre-built dashboards for WebSphere Liberty and WebSphere Application Server traditional Base.

Finally, consider the WebSphere Automation product for monitoring security vulnerabilities and potential issues such as memory leaks.

Conclusion

OpenShift has many built-in capabilities for investigating problems and instability. There are many different types of data and different ways of looking at it, and each has value in different circumstances. It’s useful to take some time to explore the different capabilities and become comfortable with how to interpret them so that when you have a problem that you’re not sure about, you won’t miss important symptoms that OpenShift exposes to you.

#automation-portfolio-specialists-app-platform

0 comments

53 views

IBM Application Runtimes Community

Come for answers, stay for best practices. All we're missing is you.

WebSphere Application Server & Liberty