High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

A Platform Symphony host cannot join the cluster or is unavailable because the MTU size on the master and compute hosts differ

By Archive User posted Thu March 17, 2016 03:08 PM

  

Originally posted by: GeoffreyY.


Problem

One or more Platform Symphony hosts fail to join the cluster after running egosh ego start and waiting some time. Running egosh resource list does not show the hosts, or they are listed with a status of "unavail".

Additionally, these symptoms are true:

  • When checking Platform Symphony processes on a problematic host, there is only a lim process which appears to be hanging. Other processes like pem and melim have not started. The lim log on the host logs some information on initial startup, but then stops writing any new messages.
  • There is no firewall between the master and problematic host, and the Platform Symphony ports are fully open.
  • Communication between the master and problematic hosts appear work fine. For example, you can run ssh from one host and connect to another.
  • The cluster's master host may be connected to the compute hosts on a network using jumbo frames.

 

Cause

The MTU size on the network interfaces used to communicate between the master and compute hosts are different.


Verifying the problem

Run the following commands on the master host and problematic host to verify that the interfaces used to communicate between the hosts use different MTU sizes.

Linux:
bash-4.1$ ifconfig

The following command output is an example of showing the MTU size:

eth0      Link encap:Ethernet  HWaddr 52:54:00:8F:31:84
          inet addr:9.21.62.219  Bcast:9.21.63.255  Mask:255.255.252.0
          inet6 addr: fe80::5054:ff:fe8f:3184/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST   MTU:1500   Metric:1
          RX packets:231373691 errors:0 dropped:0 overruns:0 frame:0
          TX packets:213955803 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:59950566296 (55.8 GiB)  TX bytes:77119522815 (71.8 GiB)

Windows:
C:\Users\IBM_ADMIN>netsh interface ipv4 show interfaces

The following command output is an example of showing the MTU size:

Idx     Met         MTU          State                Name
---  ----------  ----------  ------------  ---------------------------
  1          50  4294967295  connected     Loopback Pseudo-Interface 1
 15          25        1500  connected     Wireless Network Connection
 55          20        1500  disconnected  Wireless Network Connection 2
 12          10        1500  disconnected  Local Area Connection


Resolving the problem

Change the MTU size on the master and problem host interfaces so that they use the same MTU size. The problematic host should now be able to join the cluster and be available.

 


#SpectrumComputingGroup
0 comments
0 views

Permalink