Co-authored by Nirmal Ramesan
Your IBM Sterling Order Management System has a number of components, each with its own performance characteristics. To optimize performance, you need to consider various parameters (e.g., application level, infrastructure level, database level, etc.) and conduct thorough performance testing.
Over time you may experience performance issues, such as high response time, low throughput or out of memory, due to various reasons. Sometimes these issues can be resolved within the OMS, but other times the source may reside within the application server, database, middleware, network, or even third-party applications.
In this blog series, we’ll take a look at some of the real-world performance issues customers have faced and how they were resolved. We’ll also provide some tips on how to optimize OMS performance, including common issues that affect performance and best practices for addressing them.
Keep in mind that these are general guidelines and thorough performance testing should be carried out before applying any changes to your production environment.
Let’s kick-off this series with our first performance insight.
A Production Application server outage due to high row lock contention on the Oracle Database.
The customer uses pre checkout web services (SOAP and REST API) to get Delivery options (e.g., Truck or Parcel delivery) and time slots for the corresponding delivery before placing an order.
- Observed the Application server going in OVERLOADED status and Weblogic App Server Data source connection pools getting exhausted to the upper limit of 200.
- Upon further investigation, observed connections were not getting released due to high Row Lock contention on YFS_INBOX
- Observed excessive logging in YFS_INBOX due to very high occurrences of NullPointerException (NPEs) during TimeWindows (time slots) Web service calls in a short span of time.
- These exceptions were getting registered in YFS_INBOX table and multiple transactions were trying to acquire lock on this table in order to update the error consolidation count for the same error record repeatedly. This caused almost all the App Server threads to go in WAIT mode.
SELECT /*YANTRA*/ YFS_INBOX.* FROM YFS_INBOX YFS_INBOX WHERE (YFS_INBOX.INBOX_KEY = :1 ) FOR UPDATE
- As a result, App server connection pools reached peak capacity and brought down Application Servers.
- This resulted in a Critical incident in which servers were down for almost 30 minutes.
Snapshot showing DB Connection Pool Utilization hitting max limit:
YFS_INBOX FOR UPDATE SQL consuming high DB resources:
To avoid writing errors to the database and, instead, have errors registered only in the logs, the customer took the following steps.
- Implemented the property onerror.raisealert=N in customer_overrides.properties on the Application servers. This Property will not insert/update exceptions to the YFS_INBOX Table.
- Shifted to monitoring exceptions via the Splunk Monitoring tool with custom dashboards in place to read these exceptions details from the Server/Access logs generated by Weblogic.
- Avoided the risk of the Application server going down, even in cases of high exceptions.
- Reduced the WAIT events on the Database with fewer blocking sessions.
For more details on this particular scenario, please contact me or Nirmal Ramesan, and stay tuned for the next blog in this series.