Cloud Pak for AIOps (CP4AIOps) is often deployed in the heart of a client’s enterprise. It is monitoring its critical environments, and thus it is of utmost importance that CP4AIOps is always available. Availability encompasses many areas such as HA, MZ, DR and many other acronyms, but here we are focusing on self-monitoring.
At its most basic, administrators must have a view on the current health of a CP4AIOps environments, which includes not just the CP4AIOps application itself, but also the platform and storage it relies on. There are multiple ways to convey this information, through cluster local dashboards, internal UX inside the CP4AIOps UI, through 3rd party applications such as Instana etc. In fact, Gurpreet has authored a great blog on how to configure Instant to monitor AIOps however, in the case where Instana is not available to monitor CP4AIOps, we now have cluster local dashboards using a combination of Prometheus and Grafana.
Prometheus and Grafana
Prometheus is a lightweight data scraper that can be used in tandem with data visualization tools like Grafana to create real-time dashboards that can help users monitor the stability and performance of their CP4AIOps environment.
OCP comes with a Prometheus-based monitoring stack out of the box to track metrics about the cluster. It is possible to extend this monitoring to ‘user workloads’ (i.e. anything that is not a core OCP service), by enabling user workload monitoring. This same capability is also being added to the Linux based installation option for CP4AIOps in a future version.
Whilst all metrics can be explored in the OCP console, it is limited when it comes to visualization types and dashboarding options. This is where Grafana comes in, as it is a platform neutral, and industry standard way to create operational dashboards.
By installing Grafana into OpenShift and configuring it to connect to Thanos, it will have access to all metrics across all of the Prometheus instances (both cluster and user workload metrics).
Easy to follow instructions are available in the CP4AIOps documentation.
Which Metrics and Dashboards are included?
A large number of metrics are provided so administrators can create their own dashboards. With the dashboards that are provided as part of CP4AIOps, the focus was on providing easy to understand visualizations, each of which answers a specific questions administrators have, and which provide an actionable response.
The first dashboards answers the following questions:
-
Is the OCP Cluster on which CP4AIOps is deployed on healthy?
This fundamental question answered through the amount of CPU, Memory and Disk that is consumed/available, through the node availability, and their key health metrics in the cluster health dashboard.
Problems here are addressed typically by adding worker nodes (aka add resources), or improving a node’s health.
- Do I have enough storage for CP4AIOps?
One of the most common causes of service interruptions for CP4AIOps is running low or out of storage. As such the 2 key metrics being shown in the cluster health dashboard are whether all the PVCs used by CP4AIOps are available, and what each’s usage is.
The storage PVC’s can be expanded when one starts to fill thus preventing running out of storage room.
- Is the CP4AIOps application healthy?
The final question answered in the cluster health dashboard is whether the CP4AIOps application is healthy. This is done by looking at the state of its pods. The dashboard includes metrics on the number of healthy and unhealthy pods, and for those unhealthy what the cause is.
This allows the administrator to dive in and further troubleshoot the unhealthy pods.
4. What is the volume being processed by my CP4AIOps application?
The question answered by the usage dashboards is one that comes up frequently, namely how many events, alerts, incidents, metrics, logs etc are being consumed or created by CP4AIOps. This is often a key consideration when evaluating whether it has been deployed at the right scale. It is also the first question to answer when one is concerned about performance as an under sized cluster if the biggest cause of performance issues.
Looking forward
The first iteration’s focus is on enabling self-monitoring for OCP as that is where the current deployments are. In a future release support for Linux based deployments will be added. In addition, more dashboards are in development, diving deeper into all aspects of CP4AIOps.
We see this as the first step in the journey and look forward to suggestions on other metrics to monitor (or please share dashboards with us that you have created).
A special thanks to Jean Schunck, Arturo Cabre, Matt Thornhill and Hamel Khakhria