Instana

Instana

The community for performance and observability professionals to learn, to share ideas, and to connect with others.

 View Only

A story about how to diagnose an ACE performance issue

By HOU FANG ZHAO posted Wed April 10, 2024 10:05 PM

  

Authors: Peng Deng, Hou Fang Zhao, Li Jian Wang

The Instana agent integrates many sensors. Because some of these sensors are enabled by default, when users encounter performance issues, it might be challenging to determine which sensor is causing the issue and to identify the root cause. In this blog, well cover a success story on how to diagnose a performance issue when using ACE sensor and how to resolve it.

Problem

A customer has an environment with about 70+ integration servers running in a single OCP node. The customer complains that the CPU usage always amounts to 1.5 cores (sometimes as high as 2.4 cores) when activating the ACE sensor.

Diagnosis

The first thing we checked was the scale of the environment. If they had more integration servers, they might need more resources.

When checking with the customer, we noticed that their environment had more than 70 integrations servers, where one node had more than 110 integration servers running. We understood that since this was a large scale environment, the ACE processes would consume a lot of resources. But what we didn’t understand was why the ACE sensor would cost so much resources as we couldn’t reproduce this issue, despite running the servers on our local environment for an hour with 50 integration servers and 1100+ messages.

Next, we had to figure out how to reproduce the issue either on the customer’s environment or on our local environment. As the customer was on a SaaS environment, we are able to see the UI. We kept an eye on the customer’s environment, while we deployed more integration servers (increasing from 50 to 100) on our local environment and let it run continuously. Fortunately, we were able to reproduce the issue after running for 6 to 8 hours on our local environment. After that, we tried several times to make sure it was reproducible again.

We also double checked that the customer’s environment had a similar curve.

After reproducing the problem on our local environment, we could debug directly on our local environments.

CPU load increased after 8 hours on VM:

CPU load increased after 6 hours on OCP:

Now, we had to narrow down the possible causes for high CPU usage. Here are the steps we followed:

  1. ACE sensor: The customer complained that the issue only occurred when activating the ACE sensor, so the first step we took was to disable the ACE sensor. After running for about 10 hours, we confirmed that without activating the ACE sensor, everything worked well.
    The CPU load was stable after running 18 hours with ACE sensor disabled:

  2. Remote monitoring: When investigating the issue, we found that both the customer’s environment and our local environment were using local monitoring. Local monitoring might introduce more workloads (such as JVM attachment and process sensor) while remote monitoring focuses on the metrics of the monitored technology itself. With remote monitoring, the CPU usage was very stable even when running for long periods of time.
    Note: Remote monitoring is not supported in a cloud-based environment, so we could only test on a VM with a host agent.
    CPU load on VM with remote monitoring was stable with 14 hours running:

  3. Analyzing the differences between remote monitoring and local monitoring:
    From the above result, we can see that the remote monitoring does not cause high CPU usage. So there should be no problem in the ACE sensor itself. However, by taking a closer look into the ACE sensor code, the only difference between remote monitoring and local monitoring that we found was that each ACE integration node process or independent integration server process would activate both an ACE sensor instance and a process sensor instance. Each ACE integration server process was activating a process sensor instance.
    In other words, when using local monitoring, both the ACE sensor and process sensor would be activated, and the number of process sensor instances would be determined by the number of whole processes including the integration node (independent integration server for a cloud-based environment) and integration server.
    Now, let’s see what result we get if we didn’t activate the process sensor.

  4. Stop activating process sensor:
    We introduced a
    forceRemote attribute in the ACE sensor, so that users can force local monitoring as remote monitoring on the local agent to avoid activating the process sensor. We tested with this option and discovered that as long as we didn’t activate the process sensor, everything worked well.
    CPU load on VM with 16 hours running after setting forceRemote to true:
    CPU load on OCP with 16 hours running after setting forceRemote to true:

  5. Investigating what causes high CPU usage in the process sensor:
    From the above steps, we had already discovered the workaround for this issue. But we still wanted to know why the process sensor was causing this issue and if there was anything more we could do to improve this.


    By investigating with the Instana profiler, we found that the Sigar lib which is used by the host sensor and process sensor was a little bit expensive, especially as it was called frequently to retrieve metrics from the process. But even so, we couldn’t say ascertain that the high CPU symptom was caused by it because we didn’t have enough evidence. However, we were aware that the Sigar lib used by the host and process sensor will be called frequently which means we would have a lot of dynamic class loading.

  6. Debugging the JVM options to improve performance:
    Considering what we have found in the last step, we decided to tune the JVM options to get a better performance. When exploring the JVM options used by the Instana agent, we found this option
    -XX:ReservedCodeCacheSize. As we know, the code cache is where the JVM stores compiled code, such as JIT-compiled methods, and this option is used to specify the max size of the code cache it can reach. As long as the code cache reaches the maximum size, the JVM will halt the JIT compilation. The default time value of the instana-agent is 16m, which is a bit small. In our case, we needed a bigger size because of frequent calls from the Sigar lib. So we adjusted this value to 256m and conducted a new test.
    CPU load was stable with 3 days running:

  7. We finally figured out the solution and shared our feedback to the agent team. The agent team chose a proper value for this option and delivered the changes. Here is a snapshot for the ACE sensor running for long hours with the latest agent settings.


    In the future, if you encounter
    a similar issue, you can try to diagnose with these steps and optimize the JVM options.

Summary

In conclusion, if you encounter a performance issue, make sure to narrow down the possible causes with the following steps:

  1. Determine the scale of the environment and how many sensors are enabled in the environment.

  2. Disable all other sensors if possible. Keep one sensor enabled only to check whether the issue is reproducible. You can repeat this step to identify which sensor is causing the performance issue.

  3. Check when and how the issue can be reproduced. Does the issue occur as long as you enable the sensor? Or does the issue occur after running for a prolonged period of time?

    1. For issues that can be reproduced when the sensor is enabled, check the scale of the environment. Then change the agent settings to increase the Max Memory or CPU Quota. You can also try remote monitoring if it is supported. If possible, collect data from the profile to determine if the sensor itself can be improved.

    2. For issues that can be reproduced only after running the servers for a long period of time, try remote monitoring if it is supported. You can try adjusting the JVM options when starting the Instana agent.

In our case, when addressing the performance issue, we also identified areas for improvement within ACE sensor itself and made some significant enhancements to support a large-scale environment. We will post another blog to go into more details about what we have done to the ACE sensor for supporting large-scale environments. We wrote this blog to share our experiences in diagnosing performance issue for large-scale environments. We also hope to see more people sharing their ideas for troubleshooting and solving problems that arise from your daily work.


#CaseStudy
1 comment
39 views

Permalink

Comments

Mon April 15, 2024 10:55 PM

Nice writing, thanks for sharing!