AIOps

 View Only

ITM Communications Validation

By Kristen Meren posted Wed October 10, 2018 12:00 AM

  

by John Alvord, IBM Corporation
jalvord@us.ibm.com

Introduction

ITM Communication Services has requirements. When the requirements are not met things break in strange and non-obvious ways. Most communication is via TCP Socket links. After setup these are used to implement Remote Procedure Calls. This often works beautifully by default but in new environments it pays to perform some manual checks. It is also helpful when processes do not connect.

Manual Validation

Lets review the case where a hub TEMS has already been installed and is working. A new remote TEMS is installed and we want to validate the network is prepared. Usually after problems are resolved in one case, many cases are resolved.

 

1) The remote TEMS needs to know where the hub TEMS is located. This control is a file created during an install named glb_site.txt which is located in the

Windows: <installdir>\cms
Linux/Unix: <installdir>/tables/<temsnodeid>

z/OS: RKANDATU(KDCSSITE)

In the simplest case of a single hub TEMS, this will look like

protocol:htems

such as

ip.pipe:HTEMS

or
ip.pipe:#10.11.20.34

If there are two hub TEMSes [Fault Tolerant Option] you will see two such lines. That also requires the CMD_FTO=YES environment variable.

You should never have more than one line for a single hub TEMS. Two or more lines slow things down with no value.

The hub TEMS does not need a glb_site.txt. It does no harm to have one but doesn't help anything.

 

2) To manually verify the setup is correct you can use use the glb_site.txt values to test.
ping HTEMS
or
ping 10.11.20.34

The ping commands will not always respond depending on the network. However you can at least verify that the name resolves correctly. If not the Domain Name Server [DNS] may have incorrect information or the \etc\hosts file might be incorrect.

 

3) To manually verify the hub TEMS is reachable use telnet. Assuming you are using ip.pipe communications
telnet 10.11.20.34 1918

If you use ip.spipe the port target would be 3660.

If this fails that means there is a firewall router along the network path which is missing the rule to allow such communications. If there is no firewall involved, no problems. However if a rule exists it must allow communication to the well known port - 1918 in this case. The rule must be bidirectional. If the test fails your networking support team must make changes in the router firewall rules to allow the communication. Until that is done, there is no hope of a remote TEMS to hub TEMS connection working.

 

4) Another ITM communication requirement is that the entire path allow DF [do not fragment] packets. The packet size is most commonly seen as 1500 bytes however ITM will work with anything. From a performance standpoint a small MTU leads to more transmissions and lower throughput. Following are the tests for 1500 byte packets using ping options:

Linux:  ping -M do  -s 1472 10.11.20.34
Unix:  ping -s 1472 10.11.20.34
Windows: ping -l 1500 -f 10.11.20.34

If these work with no complaint - all is well. The Linux/Unix size setting adapts to an automatically added IEEE header. The Linux -M do option means REALLY no fragmentation, even locally. A typical error seen recently looked like this : 

From 10.99.0.250 icmp_seq=1 Frag needed and DF set (mtu = 1442)
That means along the network path, the router at that address is preventing packet transmission.

See (7) below for network performance comments.

Your networking support team must resolve this issue before ITM communications can possibly work but that can be relatively easy to managed.  See next section.

 

5) When 1500 byte packets fail.
One recent case had a Virtual Private Network [VPN] link in the path that added more bytes to the packet. A 1500 byte DF packet became a 1514 byte DF packet and an intermediate router dropped the packet and communication failed. The solution was to change the interface on the hub TEMS from MTU 1500 to 1350. The remote TEMS and hub TEMS negotiated a MTU size of 1350 and then the added VPN bytes did not exceed the 1500 byte DF maximum at the routers. They could have gone higher of course.  Changing MTUs on interfaces is platform dependent and you will normally get sysadmins or networking people involved to make such changes.

Another recent case was when a customer router was configured packet DF maximum packet size was1448 bytes. In that case the router was reconfigured to the more standard 1500 byte DF limit.

Another recent case was a Linux environment where the configured DF maximum packet size was 992 bytes. There was some good reason for this and so the hub and remote TEMS system interface MTUs was changed to that number.

 

6) z/OS Hypersockets
Another recent issue involved z/OS Hypersockets. It had a MTU of 16K and its logic prevented negotiating down to 1500 bytes. The solution was to configure a second Hypersocket instance set to a MTU of 1500 bytes.

 

7) TEMS to TEMS communication requirements
In step (4) earlier - note the rtt average and the per cent packet loss. TEMS to TEMS communication is unstable if the rtt average is too high *or* if there is much packet loss. A general rule of thumb is that 50 milliseconds or lower is best. 100 milliseconds is OK. At 250 milliseconds or higher many installations will see instability including remote TEMS going offline.


These rules are extremely general and depend on the amount of TEMS to TEMS network traffic. A low traffic environment with not that much communications can often survive at higher latency levels.

The reason for this sensitivity in TEMS/TEMS communications is that much of the work happens with Remote Procedure Calls. After starting up, there are large call structures, up to 30,000 bytes or more. ITM divides each call into MTU [Maximum Transmission Unit] sized separate packets. All packets must arrive and be assembled at the target before logic can continue. If there is any degree of packet loss, many such attempted RPCs fail and need to be re-transmitted. At a higher level in ITM communications there are time out rules for transmission, typically 30 or 60 seconds. In cases of high latency and some packet loss, the resulting failures actually prevent normal work from proceeding. That means the remote TEMS does not get full instructions, like Situation definitions. It also means that the remote TEMS - which has been gathering situation results and likely generating events - is unable to send the events back to the hub TEMS.

The usual solution for a high latency link is to architect a hub TEMS at that location. That is extra work of course but that may be less expensive compared to upgrading a network. The hub TEMS to an event receiver like Netcool/Omnibus is relatively insensitive to event data transmission.

Summary

There are other potential issues. The only good news is that most such cases are rare and that ITM has the controls to adapt to almost any environment. Contact IBM Support if further help is needed.

If you are interested in ITM communication control options see this document:

Sitworld: ITM Protocol Usage and Protocol Modifiers





0 comments
24 views

Permalink