Managing application performance, troubleshooting, alerting, and controlling cluster capacity can become more complex when monitoring Kubernetes clusters. This blog post provides a detailed explanation of all the use cases for monitoring Kubernetes. Typically, you will need to manage different scenarios, including:
Cluster resource usage
Node health and availability
Missing or failed pods
Resource usage compared with the requests and limits
Application performance and health
If you're using a managed Kubernetes service such as IBM Cloud Kubernetes Service (IKS) or Red Hat OpenShift Kubernetes Service (ROKS), you'll need to consider shared responsibilities to create your monitoring stack. One option is to use the IBM Cloud Monitoring service (powered by Sysdig) to monitor your Kubernetes workloads and fulfill your responsibilities when using IBM Cloud Kubernetes Service.
Also, in certain cases, you may want to analyze Kubernetes API Server metrics that are accessible through the Kubernetes API Server Service. In the later sections of this blog, we also explain how you can configure Prometheus Remote Write to monitor the Kubernetes API Server.
Kubernetes workload monitoring
Golden Signals are the metrics that provide insight into the actual health and performance of your applications. In the context of Kubernetes, these signals are broken down into four metrics:
Latency: The time required by your system to respond to a request.
Errors: The rate at which errors are generated by your service. This metric is a useful indicator of specific issues, such as the number of server errors (HTTP code 500) or not found (404).
Saturation: The consumed capacity of your system. This can be measured by resource usage, such as CPU, storage, or memory, as well as the number of users or requests calculated or estimated through load testing.
Traffic/Connections: The amount of usage your service experiences within a given time frame, such as the number of requests to an API.
After installing the IBM Monitoring Sysdig Agent on your Kubernetes and/or OpenShift cluster, you will have access to various metrics in the "Service Golden Signals" dashboard. This dashboard allows you to view response times, request errors, the number of requests per service, and resource usage.
Capacity optimization and resource request/limits rightsizing
OOTB Kubernetes Dashboards
IBM Cloud Monitoring comes equipped with a variety of pre-built Kubernetes dashboards that make it easy to monitor your Kubernetes environment. These dashboards are designed with well-structured panels that guide you through your Kubernetes and OpenShift clusters, allowing you to monitor:
Cluster availability and capacity
Cluster and workload status
Resource usage compared to requests and limits
Each dashboard provides guidelines on how to interpret the panels, such as the Capacity Optimizer dashboard, which suggests best practices and provides tips to help you configure your requests and limits for optimized capacity.
The scope of each out-of-the-box dashboard allows for easy comparison between different clusters, namespaces, pods, or containers, enabling you to ensure that your application is working properly across all regions. This granularity of scope also enables you to troubleshoot potential issues at the cluster or container level.
By utilizing the pre-built dashboards and Kubernetes Advisor functionality within IBM Cloud Monitoring, you can effectively manage your resource utilization and diagnose common problems such as CrashLoopBackOff, pod evictions, and resource allocation and limits. This enables you to identify containers that are experiencing issues with CPU, memory, or file system resources, leading to throttled or terminated containers.
IBM Cloud Monitoring also allows you to identify application-level issues such as latency or saturation and correlate events occurring in your infrastructure with your metrics. It goes beyond traditional troubleshooting dashboards by highlighting Kubernetes setup issues that require attention, prioritizing which issues to address first, and providing a list of high-impact items to address. Most importantly, it also gives you details on how to fix these advisories, saving engineers precious time that is usually spent on getting to the root cause of issues.
Alerting in Kubernetes
Setting up and managing alerts using Prometheus Alertmanager can be a daunting task. However, IBM Cloud Monitoring with Sysdig offers a flexible solution for creating alerts base monitoring strategy and forms the foundation for ensuring system reliability and performance.
IBM Cloud Monitoring includes a set of curated alert templates called Alerts Library. These alerts can serve as a foundation for setting up your alerting system. The recommended alerts for the workloads detected in your instance are displayed, and can be enabled with a single click by selecting the cluster, namespace, or workload where you want to apply them.
When creating alerts for Kubernetes infrastructure and applications running on your cluster, it's important to follow a few simple rules:
Alert on impact: Creating alerts for every available metric can create noise for your on-call team. Alerts should only be triggered when there is an impact on your service.
Create alerts that require action: Triggering an alert when there is nothing to be done can be frustrating.
Follow standard methodologies such as the Golden Signals we discussed earlier: This makes alerts and dashboards easier to follow and understand.
There are several layers that require alerts and can directly impact your service.
Alerting during maintenance
If you have planned downtime or maintenance period, you may not want to receive notifications triggered by your alerting process. In such cases, IBM Cloud Monitoring enables you to silence alerts for a specific scope and a predetermined duration. Even though the alerts are still triggered, no notifications will be sent. You can find all the details on how to configure this feature on this link.
Prometheus monitoring with IBM Cloud Monitoring
Prometheus metric collection with Sysdig agent
With Sysdig's built-in Prometheus server, you can easily scrape your endpoints just as you would with Prometheus. To ensure your application endpoints are scraped, it's best to configure Prometheus Native Discovery directly in the Helm Values.
By adding the prometheus.io/scrape: true annotation to your Pods, you can easily scrape all Pods with a specific annotation. Sysdig agent automatically scrapes them using the Prometheus Kubernetes Service Discovery. You can also set up your own Kubernetes service discovery or static configurations. Check out the provided link for more information on how to configure this functionality.
Once you have installed the Sysdig agent in your Kubernetes or OpenShift cluster, access Advisor Troubleshooting where your new cluster must appear.
Also, confirm everything is correctly configured accessing the Dashboard Template "Sysdig Agent Health & Status".
Simplified Querying and Remote Write
IBM Cloud Monitoring simplifies the process of enriching your metrics with Kubernetes and application context by automating the task without the need to instrument additional labels in your environment. This reduces operational complexity and cost, making all your queries much easier.
You can query your Prometheus time series using a simple form-based approach, or you can use the powerful Prometheus Query Language (PromQL) to explore your metrics and build dashboards and alerts.
In case you want to integrate metrics from other sources, you can use IBM Cloud Monitoring Managed Service for Prometheus. Using the remote write protocol, centralize all your metric sources in your IBM Cloud Monitoring instance.
If you're managing multiple ROKS clusters, it can be difficult to have a unified view. However, by utilizing remote write to IBM Cloud Monitoring's Managed Service for Prometheus, you can benefit from pre-built dashboards and alerts for OpenShift control plane, as well as a centralized metric database and alerting system.
Application Monitoring with Custom Metrics
It's essential to gather metrics from your applications to ensure effective monitoring. To facilitate troubleshooting and performance analysis, expose a few metrics for each component. Many programming languages offer client libraries for instrumentation, enabling you to monitor the performance metrics of your code.
In other situations, you would need availability metrics from your own or third-party services. The Blackbox Exporter does not require any code instrumentation in your application but just focuses on availability details. In this document and in this blog, you can find all the interesting information about this exporter and its integration with the Sysdig Agent.
Service Monitoring with Prometheus exporters
Applications like Nginx can be configured to expose metrics by default. Alternatively, you can install an exporter to extract metrics from that service. IBM Cloud Monitoring has created a resource catalog for monitoring applications that performs similarly to Prometheus, but without requiring the deployment and maintenance of a separate Prometheus server.
In the Monitoring Integrations page inside IBM Cloud Monitoring, you can find all the detected services and the step by step installation process to monitor them. These resources include exporter installation process, service configuration if needed, and multiple dashboards and alerts for monitoring each service.
With IBM Cloud Monitoring, services deployed in your environment that have available integrations are automatically detected, and you are provided with a guide on how to install and configure them. These integrations come with out-of-the-box dashboards and alerts, which can be easily configured with just one click.
Getting started with IBM Cloud Monitoring with Sysdig
So, what are you waiting for? click this link to start using IBM Cloud Monitoring today!