z/TPF - Group home

z/TPF system monitoring for Java applications (PJ46312 & PJ46275)

By Jennifer Chiarieri posted Mon April 19, 2021 01:38 PM

  

Problem statement

For this project, we started with the following problem statement:

With the Health Center, I can monitor a single Java virtual machine (JVM) that is connected to a single system, but I need to monitor the entire z/TPF system to better understand the impact of introducing Java to my environment.

 

Solution

z/TPF system monitoring for Java applications provides a mechanism where you can have a system-wide view of all the JVMs that are running on your z/TPF system. JVM data is collected and sent to real-time runtime metrics collection. The data is then sent to Apache Kafka for further analysis and is visually represented in Grafana dashboards.

Two APARs were created to address this solution:

  • APAR PJ46312 provides the infrastructure for monitoring Java applications
  • APAR PJ46275 provides enhancements to real-time runtime metrics collection for monitoring Java applications 

In addition to the two APARs, the z/TPF real-time insights dashboard starter kit was updated to include 10 new dashboards that are fully customizable. You can also create your own dashboards to display the JVM data that you want to monitor.

APAR PJ46312

To better understand the infrastructure for this solution, let's examine the following illustration:

PJ46312 data flow

We start on the left side with Java applications that may or may not be part of a JAM. To monitor JVM activity for either type of Java application, you specify the following options prior to starting the JAM or stand-alone application:

-Xhealthcenter:level=inprocess
-javaagent:/sys/tpf_pbfiles/apps/tpfjmon/tpfagent.jar
-cp /sys/tpf_pbfiles/apps/tpfjmon/tpfagent.jar

For a JAM, you would specify these options in the appropriate JAM descriptor file. These options attach a z/TPF agent, which runs inside of each JVM to collect data. As data is collected, JSON documents are created to represent that data, and those documents are sent to the tpfrtmc offline utility using a JNI routine.

In the following illustration, let's take a closer look at the JSON documents that are sent to tpfrtmc:

PJ46312 JSON document content

By default, data is collected in a JSON document each time a Health Center event is triggered or once per second for JMX data. The contents of the JSON documents will vary based on how the application was started. For applications that are running in a JAM, the JSON documents can include:

  • JAM relational data that will be consistent across each JSON document
  • Health Center data
  • Select JMX data, which can include:
    • Platform JMX metrics
    • Kafka producer metrics
    • CXF performance metrics
    • Custom MBean data

For applications that are not running in a JAM, the JSON documents can include:

  • JVM relational data that will be consistent across each JSON document
  • Health Center data
  • Select JMX data, which can include:
    • Platform JMX metrics
    • CXF performance metrics
    • Custom MBean data

Regardless of how your application is started, custom MBean data will only be present if your applications are instrumented with custom MBeans and you include the –Dcom.ibm.tpf.mbean.config command-line option.

 

APAR PJ46275

The enhancements to real-time runtime metrics collection that were completed for APAR PJ46275 are shown in the following illustration:

PJ46275 data flow


You can see that JVM data is now an additional data type that can be collected along with name-value pair collection (NVPC) data and continuous data collection (CDC) data. Like NVPC and CDC data, the JVM data is sent to the tpfrtmc offline utility and then to Apache Kafka. Unlike NVPC and CDC data, the JVM data then flows to a docker container for Python script, which moves the data to a MariaDB database. The data can then be displayed on a platform like a Grafana dashboard.

The following Grafana dashboard shows an example of JAM summary data with two active JAMs (kafka and flightrules):

PJ46275 sample Grafana dashboard

On this summary page, you can easily see that the average response time for the kafka JAM is unusually high and should be investigated. You can drill down further into the JVM data from this dashboard to complete your investigation.

What about performance?

During our testing of this solution, we did not detect any measurable impact to our systems with this support enabled. For this reason, we recommend always running with this support on for your production systems.

 

For more information about APARs PJ46275 and PJ46312, see the APEDITs or review our information in IBM Documentation.



0 comments
16 views