Introduction
In this blog we want to analyze IBM Business Automation Workflow BPEL navigation problems. For this we have a look at the configuration and point out methods for debugging purposes.
The BPEL navigation in Business Automation Workflow can use Work Manager based navigation or JMS based navigation, where the Work Manager based navigation is the default, the normally better performing one and also the newer one. In some situations JMS based navigation is the fallback, if Work Manager based navigation configuration settings are exceeded in context of API calls for example.
When enabling Work Manager based navigation there are also some decisions to be made in context of configuration settings. For further background I like to refer to the References section at the end, especially the document WebSphere Process Server 6.1 - Business Process Choreographer - Performance Tuning Automatic Business Processes for Production Scenarios with DB2 can be of value. The settings will also apply to other database vendors.
JVM properties
Garbage collection
A central part of operations is the JVM. By default there is a gencon garbage collection policy enabled. What is highly recommended is to enable verbosegc logging. This has a small performance impact, but could provide more details about the garbage collection cycles and help to identify problem areas in the garbage collection process. Be aware that more is not always better.
A larger heap size can result in a longer time period without garbage collection, but when it happens, it will take more time for the garbage collection cycle as there are more objects to be scanned.
There have been discussions in the past on min and max Java heap sizing. We will not go into the details here. For performance reasons it could make sense to set the min and the max heap size value the same, while there are also arguments for optimizing the memory usage doing differently. What we can conclude here, is, that Java heap settings can have implications on the system behavior and also performance.
Settings in context of BPEL navigation
JMS based navigation
There are a number of critical tuning parameters. While the concrete settings will differ depending on the workload, the general procedure for tuning will also apply. As a starting point, one might use the settings mentioned in the earlier article. To verify the results one can also monitor the connection pool usage, e.g. with the build-in PMI module. In some cases, one might reach the pool size limit. This will have a negative impact on performance. A general system monitoring (including CPU, memory, disk I/O and network) is always highly recommended, as a number of tuning operations or even changing work loads can bring a system to the limit.
Work Manager based navigation
With the Work based navigation the messaging is directly handled vs. being send to the messaging system. This can reduce the overhead which represents itself by better performance numbers. The Work Manager based navigation can utilize the Intertransaction cache, which will cache database requests and by this will reduce the database traffic, which can bring performance benefits. The sizing of the Intertransaction cache depends on the available memory as the data in question needs to be kept in memory, but also the work load, which needs to be investigated by own performance testing and system monitoring of production systems.
Messaging
BPEL internal messaging
Messages which can not be processed will be parked on the BPERetention queue and will be retried 15 times (each message is retried 3 times and the default setting for the retry limit of the retention queue itself is 5, which can be changed in the Business Flow Manager configuration). If all attempts fail, they will be parked on the BPEHold queue from where they actively need to be replayed after potential problems have been solved. In this context it makes sense for debugging purposes to check the queue utilization. This can be done under the service integration section of the admin console.
Another problem area can be errors which are saved in the System exception queue. Therefore, especially when there are larger numbers seen on the system exception queue, it is recommended to figure out the reason for these and solve the underlying problems.
Scheduling
In some cases activities are saved for a later execution. If there is any concern on not processed activities, it might be worth to check the corresponding scheduler database table SCHED_TASK to see when corresponding activities are actually planned to be executed. Be aware that the time information is stored in UTC time.
Troubleshooting
As the BPEL navigation can involve a number of components, it might not be sufficient to collect only one part of the picture. Especially the database can have a significant performance impact on the navigation, therefore database information should be collected from the same time frame of the problem recreation like the collected trace. This especially also includes performance data like long running queries.
Trace specification
In a number of cases problems can be best analysed if a corresponding trace was collected. This however can get a tricky part, depending on the nature of the problem, as a too detailed trace can have an impact on the system to investigate, on the other side a too lightweight trace might not cover sufficient information to debug the issue. Thus the final decision will depend on the specific situation. Therefore there are a few trace settings mentioned here, which might be beneficial. If a full tracing is possible, one can stick to the more extensive trace like: com.ibm.bpe.*=all:com.ibm.task.*=all:WAS.clientinfopluslogging=all plus required additional information like RRA=all or messaging tracing.
In all other cases one can try to reduce the trace setting as intended to the specially created lightweight trace option.
As we are focusing here on BPEL navigation issues, only the related trace settings are listed here:
com.ibm.bpe.basic.navigation.* |
Major navigation steps and state changes for BPEL processes |
com.ibm.bpe.basic.api* |
Time and sequence of methods called by the Business Flow Manager API |
Depending on the used navigation (especially for JMS based navigation) and utilized messaging products, also the messaging part should be covered:
jmsApi=all:Messaging=all:com.ibm.mq.*=all:JMSApi=all
One standard trace setting to check transaction boundaries is the WAS.clientinfopluslogging=all trace setting.
If a more profound tracking of database activities is intended one can consider a JDBC trace or the RRA=all trace, however this tracing is very rarely required. Database information can in most cases much more easily captured on the database server itself and we will see in the next section some hints in this context.
Database impact
All BPEL navigation will require database resources, therefore a corresponding tracking at the same time of a problem occurrence is advisable.
To give some examples, what you can do:
Db2 database
You can use the following two SQL statements for querying in-memory monitoring (make sure you are connected to the involved database):
db2 "call monreport.dbsummary(300)"
, which is a database summary for 300 seconds monitoring interval (numbers can be adjusted if needed), collected after issuing the command. The output can be also piped to a file for later reference.
db2 "call monreport.pkgcache(30)"
, which is a query for all dynamic and static SQL statements that are updated in the last 30 minutes (numbers can be adjusted if needed). This command will immediately provide an output.
Oracle database
In newer Oracle database server versions the AWR report will include a number of additional information like ADDM reports etc. For troubleshooting it is important to focus on the time a problem is being observed, therefore it is advisable to only cover a time frame of a short period with the problem inside - with a default configuration this could be an one hour time frame of the potential problem period to generate an AWR report. Be aware that such a report can only be collected afterwards and by default the AWR report can be created back to 8 days. Thus nothing is lost immediately if the AWR report has not directly afterwards been created.
General system monitoring
To identify problems caused by system overload it is critical to understand any system bottleneck. Therefore a monitoring of the classic resource utilization of CPU, memory, disk IO and network is important.
Checklist:
- JVM settings
- JMS based / Work Manager based settings
- Tracing of the problem scenario with a documentation of the time stamp and affected process instance and activity instance, thus it is possible to track the navigation
- General system utilization information
- Database performance / configuration
References:
As reference I will add some useful resources:
Redbook IBM Business Process Manager V8.5 Performance Tuning and Best Practices: https://www.redbooks.ibm.com/redbooks/pdfs/sg248216.pdf
WebSphere Process Server 6.1 - Business Process Choreographer - Performance Tuning Automatic Business Processes for Production Scenarios with DB2