IBM Cloud Global

Cloud Global

Our mission is to provide clients with an online user community of industry peers and IBM experts, to exchange tips and tricks, best practices, and product knowledge. We hope the information you find here helps you maximize the value of your IBM Cloud solutions.

View Only

Back to Blog List

Golden Signals for Cloud Service Observability

By NEIL DELIMA posted Thu March 21, 2024 12:35 PM

Authors:

@Moses Galvan - Cloud Adoption Leader - IBM Consulting

@NEIL DELIMA - Cloud Solution Architect - IBM Cloud Center of Excellence

#Cloud #ibmcloud #hybridcloud #operational-resilience

Introduction

While Golden Signals originate from the IT and operational side of an organization, their relevance and impact align directly to business imperatives such as availability, user experience and resilience emphasizing the interconnectedness and impact of technical performance driving business outcomes.

We start by providing context and defining the term “Golden Signals”. Golden Signals in this context are relevant to Cloud Service Provider services that support user-facing digitized solutions. They are the minimal set of metrics that a user-facing solution should focus on measuring to understand the health of the solution.

The real-time technical performance and health of a solution may span across different lenses and objectives. The value of understanding real-time behavior is a critical practice in cloud computing that provides numerous benefits, making it invaluable for organizations using cloud services to support business functions. This document will further describe the “why”, “who”, “how” and “when” aspects of leveraging golden signals.

The Four Pillars

The four Golden Signal pillars includes ^[1][2], Latency, Traffic, Errors and Saturation as proposed by Google, under the Site Reliability Engineering (SRE) practice, has emerged as a widely adopted standard for the classification of minimal set of metrics that an application should focus on measuring to quickly evaluate its health.

Latency is the time that it takes to service a request. Latency is often measured as the response time which should be measured and baselined between services and end-to-end across services. Poor latency can have functional implications on an application as well as degrade the overall user experience. When monitoring for latency issues it is advisable to use histograms instead of averages and monitoring against threshold values. It is also important to take into consideration the type of response (successful, not found, bad request, etc.,) when monitoring for latency.
Traffic is the amount of activity or requests or demand on the application. Traffic measurement is dependent on the type of application. For example, traffic could be the number of requests to an application, the number of database connections, the bandwidth consumed, etc.,
Errors are the rate of requests that are failing and measured in terms of rate. Failures can be detected via response codes such as 500 HTTP errors and requests that return might return a success response code but with incorrect response content. Errors can have functional implications on an application as well as have an impact on the latency and saturation of an application.
Saturation is a measure of the utilization of the resources by an application against available capacity. It is an indication of how loaded the systems resources are. Performance starts to degrade as the system starts to cross a certain threshold percentage for saturation. Examples of saturation include CPU, memory, and disk utilization. Monitoring these metrics and observing patterns can provide predictive insight into a state of future system saturation.

Two additional methods, the RED (Rate, Errors, Duration) method by Tom Wilkie (CTO, Grafana Labs) focused on microservice architectures and the USE (Utilization, Saturation, Errors) method by Brendan Gregg (Computer Engineer, Intel) focused on infrastructure ^[3]. The concept and signals defined in these methods are similar to the structure of Golden Signals. But Golden Signal are not the only metrics that an operations or product team should monitor. Key Metrics are additional metrics that help measure system KPIs, such as query counts, queue lengths, error types and active user counts. It is essential to regularly review and potentially update what considered to be the golden signals and key metrics as applications evolve.

Why Are Golden Signals Important?

The drivers and objectives for golden signals are multifaceted, encompassing performance optimization, cost management, availability, security, scalability, and more. It is an essential practice for organizations looking to leverage the full potential of cloud computing while maintaining reliability and security. Golden Signals can provide assurance to business stakeholders and drive operational excellence providing key capabilities such as:

Performance Optimization: By collecting and analyzing performance data related to cloud resources and applications, organizations can identify bottlenecks, optimize resource allocation, and ensure that their applications are running efficiently
Cost Management: By tracking resource utilization and identifying opportunities to scale resources up or down as needed, organizations can proactively preserve user experiences or prevent unexpected cost spikes due to misconfigurations and prevent resource inefficiencies. Also, by monitoring golden signals organizations may see a service is overbuilt or oversubscribed to resources for the intended purpose and its' design can be reoptimized for efficiency.
Availability and Reliability: By detecting and responding to issues such as downtime, service outages, or performance degradation in real-time, organizations can ensure the availability and their cloud-based applications and thus minimize the impact to the business and to their users.
Security: Continuous monitoring is crucial for identifying and responding to security threats and vulnerabilities. It enables an organization to detect suspicious activities, unauthorized access, and potential data breaches, helping to protect your sensitive information and remain compliant.
Scalability: Cloud resources are designed to be scalable, and monitoring helps an organization determine when and how to scale infrastructure to meet changing demands. It ensures that you can allocate additional resources when traffic spikes and release resources during quieter periods.
Resource Utilization: By gaining insight into resource utilization, an organization can optimize the allocation of resources, ensuring that cloud resources are not over or under provisioned. This leads to cost savings and better resource utilization.
Proactive Issue Resolution: Pattern identification and historical data provide an opportunity to proactively identify and address issues before they impact users thus reducing downtime and preventing service disruptions.
Compliance: Many industries have specific regulatory requirements related to data security and compliance. Monitoring helps an organization maintain compliance by ensuring that their cloud environment adheres to relevant standards and policies.
Historical Data Analysis: Historical data can be analyzed for trends and capacity planning enabling an organization to make informed decisions about proactive issue avoidance, proactive issue resolution, resource provisioning and application improvements.
User Experience: Ultimately, cloud monitoring contributes to a better user experience. When an organization’s services are reliable, performant, and secure, their users are more satisfied, which can lead to increased customer loyalty, growth and retention.

Alert Response & Proactive Troubleshooting

A proactive operational posture is a key objective that is supported by initiatives such as alert response and pro-active troubleshooting. Golden signals provide insight into what may be considered anomalies or indications that the health of a service may be impacted if proactive corrective action is not taken.

Alert response may be defined as corrective action that may be automated based on re-occurring patterns without impacting production workloads. Alert responses may be triggered by indicators such as golden signal threshold detection or event(s). These use cases may also leverage modernized compute options such as serverless and analytics to leverage efficient capacity options and enable AI and machine learning.

Pro-active troubleshooting may be defined as corrective action that may be executed according to operational information readily available such as system logs, historical behavior and golden signal indicators. Aggregation and insight across these data sets may be aligned to known remedies which may be invoked. Over time, automation may be established to create alert responses.

Golden Signals for Monitoring Cloud Applications and Services

Cloud service providers produce metrics for several infrastructure and platform services. These metrics are often available or routed to monitoring tools available on the Cloud platform. For example, on IBM Cloud, metrics produced by various services are routed and available via the IBM Cloud Monitoring service. These metrics are a valuable resource to operations teams of users consuming the services to proactively detect and respond to issues before they escalate, monitor, and analyze traffic and usage trends and analyzing and debugging the root cause of issues.

The onus of monitoring these metrics produced, creating dashboards and setting alerts is on the consumer of the cloud service. The metrics that might be of interest and alerting thresholds that need to be set on them are highly dependent on the type of workload and the cloud services consumed. When using cloud infrastructure services such as virtual private cloud services, there is a greater responsibility of managing and monitor the availability of the resources provisioned on the service consumer as opposed to when using a managed platform service such as a managed database service.

With infrastructure services, such as virtual machines, there is an opportunity to obtain better monitoring data using a monitoring agent that is native to and runs on the infrastructure resource and managed by the end user. One example of such an agent that is part of the Syndic monitoring system is a Sysdig agent ^[5] which runs in a containerized environment like Kubernetes and sends metrics back to the Sysdig backend monitoring system for analysis and alerting. This also begins to recognize the notion of different capabilities across the layers of a workload stack, which may also align to differing roles and responsibilities, which will be elaborated further in this paper.

Getting Started with Monitoring Cloud Services

The first step to getting started with setting up operational monitoring is to understand the roles and responsibilities and distinction between monitoring the application, managed cloud service instances and the service itself. The next step is to determine the Key Performance Indicators (KPIs) of the application and their dependencies on the cloud service instance. One approach to monitoring cloud applications running on public clouds such as IBM Cloud is to identify and monitor the key metrics that are indicators of health of the application and underlying platform and infrastructure. This is typically done using monitoring tools, creating metric dashboards, and setting up actionable alerts related to the application and service KPIs identified.

A cloud platform and/or infrastructure service often produces a number of metrics for consumption. To minimize the overhead of excessive monitoring and alerting, another approach is to identify metrics that can be classified as Golden signals and Key metrics. Further this can be grouped by metrics produced by infrastructure services, platform services, software services when applicable and application metrics. Within each group, for each service it is important to identify the key metrics produced by the cloud service or that need to be collected by an agent that are indicators or latency, traffic, saturation, and errors. These metrics then become the golden metrics for that service or the application. Since infrastructure services sit at the bottom of an application stack, a monitoring alert from an infrastructure service is likely to bubble up the stack and impact the overall health of an application. Golden metrics are often key performance indicators of a workload or application and dependent cloud services.

Golden Signals for IBM Cloud Virtual Server Instance, an Example

The IBM Cloud Virtual Server Instance^[6] service for VPC produces several metrics such as:

ibm_is_instance_average_cpu_usage_percentage,

ibm_is_instance_cpus,

ibm_is_instance_memory_free_kib,

ibm_is_instance_running_state

From these metrics, we may identify the metric(s) that can be classified as indicators of Saturation, Latency, Traffic and Errors. In the case of VPC VSIs, the metric:

ibm_is_instance_average_cpu_usage_percentage

which is the average percentage usage across all CPUs for the VSI is an indicator of resource utilization. This metric produced by the IBM Cloud VPC service and available in IBM Cloud monitoring is collected at the Hypervisor level and should be identified as a Golden signal for saturation.

Once the golden metrics have been identified, it is advisable to create dashboards that contain these metrics. Alerting thresholds can be set to notify a SRE or operations team when the metric value exceeds a certain workload dependent threshold. Alerting threshold values can be determined by observing historic metric data, past incidents and initial defaults determined based on organizational operational and industry best practices. These alerting thresholds are workload dependent and need to be monitored and fine-tuned over time. Alerts should be associated with standardized actionable notifications that get delivered to the operations team of the application utilizing the service that produced the notification to determine the best course of action. This is where it's key to ensure metrics follow a similar context and lexicon, else they are meaningless or hard to understand. Knowledge and expertise related to alerts and remedial actions can be acquired over time. Further, an opportunity to create or utilize operation practices, runbooks and artificial intelligence based operational tools should be explored to address potential incidents identified by monitoring alerts.

Supplementing Cloud Service Metrics

When cloud service platform metrics aren’t sufficient indicators of the health of the cloud service, monitoring agents via external pre-build and custom integrations can be developed and deployed. These agents typically exploit metrics generated by the underlying software, use service APIs or health endpoints to generate metrics and forward them to a monitoring backend. Examples of monitoring agent integrations for Sysdig include integrations for NGINX, Kafka or PostgreSQL.

An end-to-end Monitoring Perspective

While it is important to pay careful attention to golden signals at the infrastructure service level and individual services in isolation, there is significant value in monitoring golden signals from an application end-to-end perspective. A way to monitoring this is by developing and deploying a synthetic probe that is simply an application that simulates requests to the application and monitors response metrics.

Leveraging Golden Signals in Conversations with Cloud Stakeholders

Golden signals identify a sub-set of metrics that can be used to proactively monitor the availability of an application and help ensure that contractual obligations defined by service level agreements can be fulfilled. Latency, traffic, errors and saturation are non-functional measures that can have a significant impact on the usability of an application. If not detected and addressed in a timely manner, they could also impact the functionality of an application.

Several regional regulatory standards also require cloud solution providers to incorporate procedures and tools to ensure the reliability and resilience of their service and monitoring is a key aspect to achieve and demonstrate these objectives. Hence it becomes increasingly important for clients to monitor golden signals not just to provide a better user experience but as evidence to meet their contractual and regulatory obligations.

Validation of Golden Signals

High latency often results in a degraded user experience. Latency can manifest itself through slow web page load speeds, unresponsive app interfaces, lag in web animation and streaming video, etc. The impact to clients is abandoned web pages, shopping carts, negative reviews, switching to competitors, etc. Earning trust after a negative experience take a significant amount of effort. And hence it is important to continuously monitor latency models such as the RAIL (Response, Animation, Idle and Load)^[4] performance model originated by Google and continuously evaluate acceptable alerting thresholds and demonstrate the responsiveness and reliability of an application or service to customers.

Monitoring Traffic metrics and correlating them with Latency metrics can be used to demonstrate how and application or service is able to respond under peak load. It can also demonstrate that an application or service has been able to meet a response time objective under high traffic demand. Further setting up alerts when golden signal thresholds are exceeded is an indicator of the maturity of the operations team and its ability to respond to availability related incidents.

In a similar way, monitoring Saturation metrics is an indication of the operations team’s ability to proactively respond to potential availability issues because of excessive load or resource consumption or demonstrate the optimal utilization of computing resources. Monitoring for Errors is an indication of the operations team's responsiveness to address and eliminate errors such as 404 resources not found and 500 internal server errors.

Responsibility Considerations

It is important to acknowledge that different organizational stakeholders may have different cloud service lenses, objectives, roles and responsibilities. For example, a workload running in cloud, serving the business inevitably has 3 key layers to consider. One layer is the Cloud Service itself which may be considered ‘infrastructure” and managed by the Cloud Service Provider. The next layer may be the application layer which is deployed to consume either a single or a variety of cloud services. For this layer, the customer (consumer of the cloud services) may be responsible for management. The third layer focuses on operations which may coalesce monitoring signals across the stack of an application (e.g. cloud services, infrastructure, application, etc.). Each of these layers may have teams aligned to monitoring according to their focus. These layers come together in the context of managing the overall workload and catering to business needs. Below is a depiction of how this may render into a RACI ^[7] matrix.

Conclusion

In this article we described the importance of monitoring cloud workloads and in particular aligning golden signals to business objectives. We revisited a commonly understood definition of Golden signals and other key metrics and their importance. We outlined the key benefits of leveraging golden signals to provide insights on the cloud such as assuring availability and reliability, utilization, scalability, optimizing performance and cost and more. We highlighted how Golden Signals could be used for proactive issue identification and resolution before they affect the end-users. We then outlined the use of Golden signal in the context on monitoring Cloud application and the underlying cloud platform and infrastructure services with a specific example of determining identifying the Golden metrics for a specific service on IBM Cloud and generic guidance on setting up alerts for each metric to monitor the service. We also described the value of Golden signals and the need for continuous observation, adjustments and fine-tuning to metrics and alerts. This demonstrates how Golden signals are a simple yet powerful tool for both cloud service providers and cloud native applications leveraging these services which have a direct impact on an organization’s business. To realize these benefits, organizations are encouraged to proactively adopt a monitoring strategy that includes identifying and monitoring golden signals and defining procedures to respond to monitoring alerts. Further look for opportunities to continuously learn and refine their golden signal monitoring and alerting strategy.

Future Outlook

The topic of Golden Signals will continue to span across complex and emerging use cases such as multi-cloud deployments, the emergence of AIOps ^[8] and the enablement of self-healing systems. As the industry evolves, the driving factors will remain the same: Delightful User Experiences, Never-down systems for business-critical systems and continuous operational efficiency improvement.

Additional Resources

[1]https://sre.google/sre-book/monitoring-distributed-systems/

[2] https://www.splunk.com/en_us/blog/learn/sre-metrics-four-golden-signals-of-monitoring.html

[3] https://sysdig.com/blog/golden-signals-kubernetes/

[4] https://web.dev/rail/

[5] https://docs.sysdig.com/en/docs/installation/sysdig-monitor/install-sysdig-agent/

[6] https://cloud.ibm.com/docs/vpc?topic=vpc-vpc-monitoring-metrics

[7] https://en.wikipedia.org/wiki/Responsibility_assignment_matrix

[8] https://www.ibm.com/aiops

Authors:

@Moses Galvan - Cloud Adoption Leader - IBM Consulting

@NEIL DELIMA - Cloud Solution Architect - IBM Cloud Center of Excellence