Audience
Business Intelligence (BI) analysts, DevOps, Engineering and Operations, Leadership
Summary
In this blog post, we describe in some detail one of the reports offered by Turbonomic through the Thoughtspot interface. The report is the "Container cluster and Namespace Uilization" report which has the goal of providing the interested parties with a breakdown of how their Kubernetes and/or Red Hat OpenShift clusters are being utilized in terms of allocated resources.The two resources being tracked in this report are CPU and memory resources.
There are two utilization metrics per resource. The first metric represents the actual usage as a percentage of the available capacity in the cluster, while the second metric is concerned with contrasting the resource requests (anticipated usage) to the available capacity.
The second part of the report provides a granular breakdown of the same two metrics by namespace.
Background
IT workloads run either On-Prem or using Cloud Providers services. Recently there has been a growing effort to migrate applications to containerized workloads which led to the explosive popularity of using container platform clusters. Containerized workloads provide several advantages like isolation, portability, and ease of deployment while requiring a fraction of the resources of a full-fledged virtual machine. The Kubernetes orchestration over the past few years has become the de facto container workload management system. It aggregates resources across multiple physical or virtual machine and presents them as a single cluster which can then be used to schedule and manage all aspects of the container life cycle.
When workloads require distribution across multiple clusters and multiple providers, it becomes imperative to gauge the performance and efficiency associated with them to understand which ones are performing well vs. under performing or over-allocated for example.
The report described in this blog provides a clear insight of utilization vs capacity in a cluster and namespaces and it is divided into two parts.
The first part shows a summary utilization per cluster. The second part provides a more detailed breakdown by namespace. The capacity of the cluster is the allocatable memory and CPU resources available.
Utilization is the aggregation of used memory and CPU resources of workloads running in the cluster plus the overhead which makes cluster utilization slightly higher than the workload utilization.
The report also displays information about “requests”. Requests are a representation of the budget or guarantee that the DevOps team assumed that applications need in terms of memory and CPU resources. This value is taken directly from the resource requests in the workload definition created by DevOps and aggregated by cluster and by namespace in the second part of the report. The request percentage is the requests versus the cluster capacity.
Requests are a critical piece of information because a workload will not get scheduled in the cluster if the total requests exceed the cluster capacity and cannot be over-allocated. When in fact it is possible that the actual usage requirements of all the workloads combined is way less than the cluster capacity.
Insights Provided
Using this report, the DevOps team may quickly and easily gauge how their container platform clusters are measuring up to the workload requirements in terms of both actual utilization and request usage.
The data can be aggregated across varying time windows which enables the DevOps team to pinpoint anomalies in the resource utilization in terms of extremes in under or over utilization of CPU and memory resources.
Since the report also shows “requests” which is anticipated usage, it provides valuable insights about possibly unaccounted for memory leaks or higher than anticipated CPU loads so the DevOps team can take corrective measures if needed.
The report sections are user customizable as the user may filter on different time windows, select specific clusters and namespaces. These customizations can be saved for future reference.
We also provide another report which extends this one to show cost associated by namespace and individual workloads running in the public cloud container platform clusters.
Usage
The report has filters to select different time windows for the data aggregation, select all or specific sets of clusters and namespaces. Also, there are couple of parameters for display preference whether to represent CPU in cores or milli-cores and memory in mebibyte (10242) or gibibyte (10243 bytes).
The first table in the report shows the averages aggregated over the selected time window.
Requests columns are gathered from aggregating the requests sections in the yaml files of the workloads and compared to the cluster capacity to give also a request utilization percentage. It also shows peak values for the utilization and requests within the selected time window.
The second table provides a breakdown of the same information by namespace. Note that the totals here are expected to be a bit less than the utilization values in the first table because the first table values include the cluster overhead in addition to the workloads themselves while the second table accounts for workloads only.
In the second screenshot, one cluster has utilization of only about 5% in terms of the available CPU resources while the requests (anticipated need) are 41% of the cluster capacity. This is an example where the DevOps team may need to revisit the estimation of anticipated resource requirements and attempt to make better estimates which potentially leads to reducing the resource requirements, number of nodes and the cost associated. In other words, the goal is to move towards higher utilization percentages, better matched requests while not hurting the application performance.
The opposite also warrants investigation when for example the memory usage is more than double the requested memory. Drilling through the breakdown by namespace will help to pinpoint the workload using significantly higher resources than budgeted.
Conclusion
This report provides the team with continuous insight on containerized workload resource management and allocation at the cluster and namespace.
An available extension of this report; described in a future blog post, converts the utilization and request percentages into dollar amounts and cost based on the monthly cost for public cloud-based clusters only.
A future extension of this report is to plot the capacity and usage over the user selected time window instead of averaging all the values within the time window into one value to visualize spikes in usage.