Order Management & Fulfillment

Order Management & Fulfillment

Come for answers, stay for best practices. All we're missing is you.

 View Only

Observability at Scale: Mastering Billions of Transactions with Sterling Order Management - Part 3

By Yaduvesh Sharma posted Sat April 19, 2025 01:27 PM

  

End to end observability of application transactions

A typical business API flow consists of multiple service and component interactions which may span across multiple transaction boundaries. For debugging and triaging of problems in a flow it is important to have end to end visibility of all interactions that happened as part of the specific flow.

Sterling Order Management (OMS) SaaS, relies on a combination of transaction logging and application tracing to achieve end to end observability for every business transaction. Each API may consist of multiple interactions and in turn, each interaction may consist of one o
This article focuses on how OMS SaaS makes use of Instana auto tracing for java applications to achieve observability.

Using java tracing sensor for capturing traces from a java application

Instana has the capability to automatically trace applications through byte code manipulation of the standard libraries and frameworks. It has sensors for several standard libraries which are responsible for collecting the trace data.

Tracing a java application is very useful wherein, Instana will automatically capture all the HTTP, messaging and database calls incoming as well as outgoing in a given java application. It can also stitch together calls across applications to form a trace which represent the overall larger call e.g. external client performing an API invocation, which can be broken up into child interactions across different components. Each interaction is broken down into upto two spans. Instana tries to capture each and every span w/o any sampling.

The spans from an application are collected in a queue within the application and then pushed to the Instana agent periodically. The Instana agent then processes them and uploads them to the backend server.

Below are some of the considerations for using tracing with Instana:

Sizing the application to accommodate tracing

Unlike custom metrics scraping in Instana, the tracing has considerable utilization of resources like CPU and memory within the application because the the spans needs to be captured and staged within the application before they are offloaded to the Instana agent.

Depending on how many outbound calls like database, HTTP, and messaging are done from the application at a given time, the number of spans captured will grow. Further, if there is high incoming traffic to the application, this will also result in increased number of spans collected at a given time.

As an example if a single business API calls received by an application results into X database queries (some of which may be parallel), Y external HTTP calls and Z messaging calls and there are 100 such business API calls are processed at the same time then in the worst case scenario there may be 100 times (X+Y+Z) spans to be collected by Instana in a small window of time.

Hence based on the call patterns of the application, additional resources especially CPU and memory should be added for Instana to function properly.

Tuning the span collection queue contained within the application

This is an advanced configuration which may not be needed for majority of applications, however if an application generates many spans in a small window example due to many parallel database calls made, or a batch of transactions processed in parallel etc, then it may be useful to tune the span queue which is used to stage/hold the generated spans before they are pushed to the instana agent.

The span queue size can be altered using below configuration in Instana configuration yaml:

  com.instana.plugin.javatrace:
  instrumentation:
    spanQueueSize: 50000

Alternatively, it can be configured for a specific java application using the below environment variable

INSTANA_SPAN_QUEUE_SIZE=50000

Note that by default the span queue size is sufficiently large and is above 100k. Instana agent will flush the queue and push the metrics to the Instana backend periodically or when the queue gets filled up. If there are large number of spans generated in a short window, it can cause the queue to grow thus increasing memory pressure on the application. It can also cause more CPU spikes on the Instana agent as it has to deal with a larger chunk of spans.

On the other hand, reducing the value too low, can increase the network communication between the application and Instana backend and in some cases can cause spans to get dropped more frequently. E.g. if there are challenges in connecting to the Instana agent or the agent is overloaded, the pushing of spans to the agent may fail, thus resulting in dropping of the spans waiting in the queue.

Disable the tracing for applications you are not interested in tracing

Instana automatically traces lot of standard technology like Java, python, go etc. If you have a mix of applications using these technologies and you are not interested in monitoring them, it is better to explicitly disable monitoring these. Although Instana enables plugins based on the type of technology it encounters, you can still disable specific technology plugins in the Instana configuration yaml in case you are not interested in metrics or traces for the apps using those technology as shown below:

    com.instana.plugin.nginx:
      enabled: false  // if you are not interested 
                      // in monitoring nginx apps

Further, if you are interested in java application monitoring and tracing in general, but there are specific applications which do not need tracing, you can disable tracing for those applications as below. This avoid sending unnecessary traces to the Instana agent and backend server but also reduces resource utilization on the applications itself.

As an example if you have a a java application responsible for reading log messages from a messaging queue and post them into a elasticsearch, you may want to avoid Instana to enable tracing for this logging application otherwise large volume of spans about logging will be captured and processed by Instana.

You can disable java tracing for specific applications by setting the below environment variable in the respective java application environment:

INSTANA_JAVA_TRACER_ENABLED=false

Care with custom tracing for application code

Other than automatic tracing of HTTP, messaging and database calls, Instana provides capability to add custom tracing in application code to collect spans representing execution of the application code. For instance, you may want to trace some interesting java method and create a custom span for every invocation to that method. This can be achieved through the Instana configuration as well as utilizing the Instana SDK.

Care should be taken while tracing java methods such that you avoid tracing methods which are called multiple times in a single transaction e.g. from inside a loop. Otherwise, this will cause unnecessary large number of spans generated and collected. Instead you could trace a top level method.

Also, while using configuration yaml to configure custom tracing, it is better to provide name of the specific class implementation which has the desired method, instead of specifying a higher level parent class or interface. This helps Instana find the desired class to be monitored more efficiently.

Tuning span suppression due to rate limiting

Instana has a built in mechanism to protect the Instana agent by dropping the spans if there are large number of spans being processed by a specific Instana agent. Processing more spans requires sufficient resources on the Instana agent side. Based on the volume of tracing and spans generated by the workload, the span drop limits can be configured as below.

span suppression threshold on the agent
    com.instana.plugin.javatrace:
      instrumentation:
        minspanduration: 1
        span-suppression-start: 1000

By default the span suppression may begin as soon as span rate goes above few hundred spans per second. Once suppression begins, certain spans with lower duration (fast spans) may be dropped to reduce load.

The rate at which span suppression should being can be configured using ‘span-suppression-start’ attribute with the value as the spans per second. Once the suppression begins, the ‘minspanduration’ attribute can be used to configure the time in milliseconds based on which the spans taking lesser than this time may be dropped.

Avoid unnecessary log statements

Instana provides rich capability to automatically capture log statements generated by the application by tracing the popular logging libraries. It can also link together application traces with any error log messages to help triage the errors.

However, at times, you may not want to capture all the log statements generated by the application because it would add to the network traffic as well as make the agent work more because the data needs to be sent over to the Instana server. More so, if you are using another solution to analyze and capture logs.

For java applications using log4j2 library, you may consider disabling logging for specific cluster by using the below configuration in the Instana configuration yaml:

com.instana.plugin.javatrace:
  instrumentation:
    plugins:
      Log4j2Exit: false

Conclusion

In this blog we discussed how Sterling Order Management uses Instana for tracing of java applications at massive scale to achieve end to end transactional observability which helps in gaining realtime visibility into state of the system and also quickly triaging and debugging problems.


Co-Author: Sreedhar Kodali, Senior Technical Staff Member, Instana Engineering

0 comments
3 views

Permalink