Instana U

 View Only

SRE best practices: Monitoring Synthetic PoP

By LI JING MU posted Thu September 21, 2023 01:24 AM

  

This is the second part of monitoring Synthetic PoP by using the Instana agent. You can find the previous article Monitoring Synthetic PoP by using Instana host agent

In this part, I will introduce the best practices in production environment for Synthetic PoP monitoring, including default and customized events, tuning guidance, and how to make the best of Instana sensors as Kubernetes sensor, docker sensor as well as Synthetic PoP sensor for troubleshooting. This article can be a reference for SRE (site reliability engineers) team who deploy or maintain the Synthetic PoP or customers who want to host self-hosted Synthetic PoP in their production environment. 

Monitoring Synthetic PoP by using the Instana agent

Instana agent is easy to install, which is known as one-liner installation. After installing the Instana agent in the same Kubernetes cluster as Synthetic PoP, Synthetic PoP is auto discovered and Synthetic PoP sensor is activated automatically, that is, no more manual configuration is needed (If Redis TLS is enabled in Synthetic PoP, then some additional configuration is required on the Instana agent side to support the HTTPS communication. You can refer to the Instana document for detailed instructions). You can check the health status of the Synthetic PoP that is monitored by the Synthetic PoP sensor, as well as the health status of your pod and Kubernetes cluster that are monitored by the related sensors. It is important to know the health status of your Synthetic PoP, that is, whether it is suffering resource struggling or starvation because of too much test workload on one of the Synthetic PoP playback engines. If so, you can choose to do vertical or horizontal scaling up of playback engines. Otherwise, your Synthetic PoP might encounter the following issues. Browser playback engine is taken as an example in this article. When you encounter resource starvation, you might see the following messages:

·       Script Timeout: Test failed because of script timeout in specific timeout value, for example, 60 sec. When script timeout occurs, you can see the error message Test Script Timeout on test result details dashboard:  

       Error Message

Test Script Timeout

Stacktrace

Timeout after: 60 seconds

·       Test expiration: Script timeout and test expiration are mechanisms in Synthetic PoP to release resources and ensure availability. If too many concurrent tests that are running exceeds the BROWSERSCRIPT_MAX_TASKPOOL_SIZE, then the browser playback engine does not pick up the new test payload from Redis queue at that time to support scalability. If tests are waiting in the Redis queue for too long, then the tests get expired and are dropped by the playback engine.

·       Pod restart: If the situation continues, then your pod might be OOMKilled with exit code 137 and then the pod is restarted.

Note: All the timeout might not be caused because of resource struggling.

An example of timeout is Explicit wait in browser testing. If the web element is not visible after the explicit wait time, then the test fails normally. It is usually the issue of the target website you are monitoring. You can see the error message as follows:

Error Message

Waiting for element to be located By(xpath, //span[@id="loginButton"])Wait timed out after 10206ms.

Furthermore, the users setting the test frequency as 1 min when the actual test execution time is more than the test frequency leads to Script Timeout error. To ensure the availability and stability, some test executions are dropped by the browser playback engine because of the test overlapping with the following logging message

ERROR Test B4KKfUkIFm5fIPewBqpR is in execution. Abandon new execution to avoid overlapping. Please increase test frequency.

How can you understand whether your Synthetic PoP is suffering from resource struggling and overload? It is not recommended to predict with test scripts count because different test scripts load different number of requests. Quantitative analysis is necessary to analyse whether the Synthetic PoP is suffering from resource struggling and overload. In the performance test it is found that BrowserScript with only 7 requests, 138 KB of response size of all requests can be scheduled 100 in 5 minutes test frequency on one pod. While BrowserScript with 434 requests, 4.09 MiB response size can be scheduled much less than that on one pod.

The best way to monitor the health of your Synthetic PoP in runtime is to use the Instana agent. Let us see how to do it with default and customized events in the following sections. 


Monitoring Synthetic location health status

When you install the Instana agent, you can see your location, health and link on the Synthetic PoP monitoring dashboard. Here, one best practice is to deploy your Synthetic PoP and the Instana agent in different namespaces from your business applications, for example, synthetic-pop and instana-agent. This will provide you benefits when monitoring your Synthetic PoP. 

Click the link in the Location Name to navigate to the Synthetic PoP monitoring dashboard. The Health status shows the default events, such as playback engine overloaded. If this symptom is shown then you need to scale up your playback engine. This means that your tests are scheduled, but cannot be picked up by the playback engine. Therefore, the tests cannot be completed and is waiting in the queue.

 

The following figure shows the Synthetic PoP monitoring dashboard when browser engine is overloaded:

Troubleshooting: If you cannot see location links and health status in the Synthetic monitoring dashboard, click Infrastructure > Comparison Table > Synthetic PoP to check whether your Synthetic PoP sensor is activated in the Infrastructure dashboard.

 

If you can see your Synthetic PoP entity in the Infrastructure dashboard, but you cannot see the location link and health status, then check logs of the Instana agent. If you cannot see the location link and health status, then this issue might be caused due to the communication issues between the Synthetic PoP and the Instana agent. If Redis TLS is enabled on Synthetic PoP, then it means HTTPS communication needs to be supported and more configuration steps are required on the Instana agent side.


Monitoring health status of your pod

Besides Synthetic PoP sensor, you can take advantage of Kubernetes and Docker sensors to monitor health status of your pod, resource utilization in your namespace, node, pod, container etc., and also use customized events.

One sample customized event can be as shown in the following example:

Use the Built-in metrics of Docker Container entity type, and apply the query on selected entities as namespace name is synthetic-pop by entity.kubernetes.namespace:"synthetic-pop", if the conditions are that the CPU total usage average value >= 380% (MAX 400% by default, 4 CPUs in CPU resource limit) and memory used percentage average value >= 80% in time windows of 10 seconds.

 

When the event triggers, click the event to check the trend charts of the metrics of CPU and memory. In this situation, you might encounter script timeout because of resource struggling. You can choose vertical scaling up of your CPU or memory limit or horizontal scaling up of your pods.

You can move to Instana Kubernetes dashboard to see the resource utilization trend of your docker container, your pod and your namespace, or cluster. The following screenshot shows the resource utilization status of the pod.

 

Once you decide to scale up your playback engines, pay attention to your capacity. You can check the resource limit or capacity on your worker node to avoid node resource overcommitment. If your Kubernetes node has enough memory, increasing the CPU or memory limit is usually helpful, and thus improves the performance. If you have a lot of tests and several worker nodes, then you can choose to scale up your playback engine vertically. Scaling up your playback engines’ replicas to a number which exceeds the capacity of your VM or cluster makes things worse because of overcommitment or resource competition. Choose Stack > Kubernetes > Node/Cluster to see which solution is better for you. If you find that you have much more capacity than the limit, then scale up your resource limit.

 

When creating events to monitor your pod, pay attention to the entity type. In AWS cluster, the entity type of your container might be Contained Container instead of Docker Container. To assign specified conditions to your playback engine, you can use Dynamic Focus Query, for example, entity.kubernetes.namespace:"synthetic-prod" AND entity.kubernetes.pod.name:synthetic-pop-http-playback-engine-*

Dynamic Focus Query can apply the events in your HTTP playback engine in the namespace of synthetic-prod.

 

Monitoring your Synthetic PoP entity

Once a customer asked how to flag as failure when a Synthetic PoP is stopped or uninstalled. Usually when a Synthetic PoP is uninstalled or stopped, the Synthetic PoP sensor is deactivated and no default events are triggered, because it is normal behavior. It is possible to do it with custom events since a new entity of the Synthetic PoP is defined in the Instana system. You can create a customized event which uses the System Rule, offline event, and choose entity.type:syntheticpop from the dropdown.

Alerting Health Issues in advance

After creating events, you can use Alerts and Alert Channels to alert any monitoring issues in advance and send alerts with emails or Slack channels. You can create alerts with built-in events and customized events. Here in the example, the alerts are created with built-in events provided by the Synthetic PoP by default, to monitor playback engine overloaded issues and PoP – Instana backend communication issues.

Inside the Alert definition dashboard, you can add Alert Channels to send alerts to email address or Slack channels.

 

Conclusion

This article describes the benefits of using the Instana agent to monitor the Synthetic PoP, and how to monitor and resolve resource struggling issues. Some issues are listed that you might encounter when you have resource struggling, and how to troubleshoot. More best practices will be shared in the following series of Synthetic PoP related blogs.

Permalink