Order Management & Fulfillment

 View Only



LinkedIn Share on LinkedIn

Observability at Scale: Mastering Billions of Transactions with Sterling Order Management - Part 1

By Yaduvesh Sharma posted 27 days ago

  


Considerations for monitoring large scale multi-tenant enterprise applications

For any modern microservice based enterprise SaaS solution that supports mission-critical applications, there are immanent complexities while monitoring such large enterprise applications. Even so, it is pivotal to have end-to-end real-time view of the performance, health and general functioning

Here are some key considerations while implementing real-time observability for such applications.


Multiple services scaled independently

A typical distributed solution serving enterprise workload comprises of multiple services running on a wide variety of technical stack, working in symphony to provide the desired outcome. Further these services are deployed across different clusters and zones for high availability. These services can be scaled independently which is why the monitoring solution needs to keep up in realtime with the elasticity of these components as well as provide an end-to-end view across all the services.

High volume of business transactions

The enterprise solutions especially those running workloads for multiple tenants often experience high volume fluctuations. These sudden spikes in volumes can potentially overload the monitoring solution, thus causing delays in processing observability signals. It can also cause resource conflicts between the monitoring system and the application.

Tenant wise segregation

In addition to holistic application monitoring, the multi-tenant enterprise solution need to measure the tenant-specific metrics to keep track of SLOs, failure rates and noisy neighbour concerns. This helps in taking prompt actions like throttling & traffic rerouting, and helps in providing tenant-specific key metric visualizations.

Portability of application monitoring metrics

For enterprise applications that are deployable outside of their own SaaS environment, for example, in an on-premise or a different public cloud environment, it is important to consider the portability of the application metrics across different observability solutions. This allows the customers to integrate application-level metrics into the observability tool of choice easily.

IBM Order Management : Architecture

IBM Order management (OMS) SaaS solution is a mix of modular business services that serves the shoppers' order management journey from pre-purchase inquiry, order capture, order fulfillment and optimizations backed by AI engine.

The architecture services is a mix of both multi-tenant and tenant-dedicated workloads that provides high availability by maintaining redundancy at each service and cluster level. This enables a low recovery time of few minutes for certain critical services.

The SaaS solution has a wide range of technology stack used for specific capabilities and consists of hundreds of environments, almost two dozen clusters, and over ten thousand container deployments comprising of more than one thousand nodes. These services within these environments can be scaled horizontally and vertically to serve the spikes in customer workloads. During the recent retail holiday peak sale, OMS processed more than twenty billion business transactions during the sale week with the peak volume of more than sixty thousand transactions per second.

 

Observability in Order Management SaaS

Following are some of the key facets of observability in Order Management SaaS:

Open standard application metrics

Considering the OMS deployment across on-premise, private as well as public cloud environments, it is crucial to provide an open and portable standard for monitoring KPIs for the application. All the services within the OMS suite expose a set of Prometheus metrics for measuring the system and functional KPIs for easy integration with any observability tool. These metrics can be extracted through any observability tool through the exposed service endpoints and can be used to power SRE and IT dashboards and create alerts for critical conditions.

Request specific tracing for triaging and root cause analysis

While Prometheus metrics are helpful in macro analysis and trends for relevant KPIs at component, tenant, or cluster levels, there can be possible scenarios where we need deeper debugging of the business process flow to get insights into specific requests and how each component interacts with each other.

Part of this is achieved by logging important aspects of each request in logs which can be reviewed later for a deeper review and root cause analysis. Additionally, OMS SaaS relies on application request tracing to analyse end-to-end flow for requests for debugging performance problems in specific components or services.

Proactive alerting mechanism

Based on a combination of application & infrastructure metrics, request traces, logs, and automated synthetic API call, the services are proactively monitored for key metrics like SLOs, error rates and latency to catch any potential problems before they begin to impact customers. This helps in taking prompt corrective actions like automatic traffic rerouting, applying thresholds and sending notifications to stakeholders.

Visualization dashboards

OMS SaaS provides real-time self service monitoring dashboards for customers to visualize the rate and summary for several metrics including API calls, Inventory actions, Errors and Published events. The users can choose from different range of time to analyse metrics for specific intervals for their workloads.

Additionally, there are internal dashboards that are used by operations and site reliability teams to monitor the workloads.

For the SaaS solution, OMS relies on IBM Instana as its observability solution to provide holistic view across application & infrastructure metrics and traces. In the next part of this series, we will deep dive into how we integrate our application with Instana to support Prometheus-based monitoring in Order Management System at scale.

Conclusion

In this article, we have covered different considerations and complexities for an enterprise application like IBM Order Management to implement observability. We also discussed the different observability capabilities provided by Order Management SaaS to monitor the customer workloads at a large scale. In the next part of the series, we will discuss in detail the tuning and best practices implemented in supporting Prometheus metrics in Order Management using IBM Instana.

References

0 comments
19 views

Permalink