Stan's Corner

 View Only

IBM Z® performance monitoring through Instana sensor

By Bipin Chandra posted Tue November 09, 2021 08:35 AM

  
Authors: Priyanka, Shivangi, Bipin
IBM Observability by Instana provides a comprehensive observability platform for an enterprise to track requests spanning from mobile to mainframe, and from bare metal machines to hybrid multi-cloud deployments. 
Mainframe computers play a central role in the daily operations of most of the world’s largest corporations. In banking, finance, health care, insurance, public utilities, government, and a multitude of other public and private enterprises, mainframe computers continue to form the foundation of modern business. They are most stable, with negligible downtime.
Meet the IBM z15 mainframe
System z, as with all computing systems, is built on hardware components. At their core, System z mainframes are high-performance computers with large amounts of memory and processors that process billions of simple calculations and transactions in real time. 
IBM Z® Instana sensor monitors the availability and performance of hardware systems and subsystem resources across the enterprise having multiple System z mainframes. It can connect to one or many 
zHMCs at the same time and collect the required metrics.  If any deviation occurs from normal behaviour, alerts and events can be created based on these metrics,  and the abnormal behaviour can be analysed and diagnosed by using relevant data.
Monitoring In-Depth
IBM Z® Instana sensor makes use of the the Web Services API provided by the Hardware Management Console (HMC or zHMC). zHMC is an IBM System z feature that provides the end-user interface to control and monitor the status of the system. 
zHMC controls and monitors the system
zHMC provides the Web Services API, which can be used to query, configure, control and monitor Central Processing Complexes (CPCs) and its various components. By default, the Web Services API is disabled on the zHMC, and can be enabled.
Once a user is permitted to establish API sessions, the actions within those sessions are subject to the zHMC's access control model.
zHMC Instana Sensor Architecture
IBM Z® Instana sensor collects performance metrics, and transforms the information into actionable key performance metrics and reports quickly and efficiently. It monitors the availability and performance indicators of hardware systems and subsystem resources across the enterprise, and provides a single tool for one or many zHMCs. For example, it can connect to one or many zHMCs at the same time, and collects the performance metrics for all of them.
With IBM Z Instana sensor, you can:
  • Visualise performance metrics with built-in dashboards.
  • Get business-critical application monitoring data that is available in near real-time. 
  • Support one or many zHMCs by using only one Instana instance. 
  • Create events and alerts based on any of the performance metrics for the supported alert channels.
Details of the latest available metrics can be found here.
Setup
  • Enabling and accessing the API
       By default, the Web Services API is disabled on the HMC.  The Web Services API can be enabled, and the scope of access to it can be configured by using the Customize API Settings task in the HMC UI.
  • API User Permissions
      By default, this setting is disabled for an HMC user profile, and thus attempts to establish an API session by that user are rejected. Use the Customize API Settings or User Management tasks of the HMC to set this property for one or more HMC users, and thus allow those users to access the API.
  • Install Instana Agent
  • Configurations
    • Follow this document to configure the Instana sensor to connect to one or multiple zHMCs
  • Kickoff Agent
    • To start the Instana Agent, run the command:
      • INSTANA_AGENT_FOLDER/bin/start
    • To stop the Instana Agent, run the command:
      • INSTANA_AGENT_FOLDER/bin/stop
    • To get the status of the Instana Agent, run the command:
      • INSTANA_AGENT_FOLDER/bin/status
  • UI navigation
    • IBM Z - can be found under Platform section of the navigation pane.
Supported Metric Groups
It provides metrics for the system resources, such as power consumption, environmental data, processor usage, etc. The utilization and environment data that is displayed on the user interface is also provided through the Metrics Service API in the following metric groups.
 
Metric Group
Mode 
Details
cpc-usage-overview
C
This metric group reports the aggregated processor usage and channel usage, the ambient temperature, and total system power consumption for each system. The cpc-processor-usage is the average of the percentages of processing capacity for all the physical processors in the CPC. The channel-usage is the average of the percentages of I/O capacity for all the channels and adapters in the CPC.
logical-partition-usage
C
This metric group reports the processor usage and z/VM paging rate for each active logical partition (such as Image, LPAR Image, Zone, PR/SM virtual server) on the system.
channel-usage
C
This metric group reports the channel usage for each channel on the system. An instance of this metric group is created for each channel of a CPC.
crypto-usage
C
This metric group reports the adapter usage for each crypto on the system. An instance of this metric group is created for each crypto adapter. This metric group is not used for a DPM system. For DPM, crypto adapters are reported in the Adapters metric group.
flash-memory-usage
C
This metric group reports the adapter usage for each Flash memory (Flash Express) adapter on the system. An instance of this metric group is created for each Flash memory adapter of the CPC. If a CPC has no flash memory adapters, then no data will appear in this metric group for that CPC. 
roce-usage*
C
This metric group reports the adapter usage for each RoCE (10GbE RoCE) adapter on the system. An instance of this metric group is created for each RoCE adapter of the CPC.
dpm-system-usage-overview
D
This metric group reports the aggregated processor usage, network usage, storage usage, accelerator usage, crypto usage, power consumption and temperature for each DPM enabled system.
partition-usage
D
This metric group reports the processor usage, network usage, storage usage, accelerator usage, and crypto usage for each active partition on a DPM enabled system. 
adapter-usage
D
This metric group reports the adapter usage for each adapter on the DPM enabled system. An instance of this metric group is created for each adapter. 
network-physical-adapter-port
D
OSA and RoCE network adapters have up to two physical ports that connect to the network. Metrics are collected from these ports on a DPM enabled system and provided to the user. This metrics group will contain metrics data representing metrics for one physical port. Metrics are collected and provided on an interval, and each metric provided is the total cumulative value, and not a delta. 
partition-attached-network-interface*
D
This metric group reports metrics for NICs on a DPM enabled system. NICs are network resources associated with DPM partitions. Only NICs that are activated will report metric data. This metrics group will contain metrics data representing metrics for one NIC. Metrics are collected and provided on an interval, and each metric provided is the total cumulative value, and not a delta. 
zcpc-environmentals-and-power
C+D
This metric group reports environmental data and power consumption for the zCPC. 
environmental-power-status*
C+D
This metric group reports line cord power information of connected Power Distribution Units (PDU) or BPAs (Bulk Power Assembly) in the system. 
zcpc-processor-usage
C+D
This metric group reports the processor usage for each physical zCPC processor on the system. This includes the System Assist Processors (SAPs). An instance of this metric group is created for each processor of a CPC. 
* means it's available in next release.
C means Classic mode while D means DPM mode.
Alerts and Events Configuration
Based on these performance metrics, we can create alerts and events. Click the following links to get the details about how to create and configure them.
Following is an example of creating a custom event based on Server's Temperature. 
If the system's temperature is more than the specified value for 60 seconds, then this event is triggered.
An alert can be created by using these events, and can be routed through any available alert channels.
Troubleshooting
  • API Authentication error
      Check if the API access is enabled for this zHMC.
  • Time out
       Check whether the configuration YAML file is configured with correct hmc host name, user id and password.
  • Permission error
       Check whether the zHMC user has required permissions to collect the metrics.
Reporting issues
If you encounter a problem, please report it as an issue here.

#Instana
#IBMZ
#IBMZHardware
#monitoring
#howto
#install
#configure
#architecture
#observability
#performance
#APM
0 comments
32 views

Permalink