Originally posted by: GeoffreyY.
Problem
One or more Platform Symphony hosts fail to join the cluster after running egosh ego start and waiting some time. Running egosh resource list does not show the hosts, or they are listed with a status of "unavail".
Additionally, these symptoms are true:
- When checking Platform Symphony processes on a problematic host, there is only a lim process which appears to be hanging. Other processes like pem and melim have not started. The lim log on the host logs some information on initial startup, but then stops writing any new messages.
- There is no firewall between the master and problematic host, and the Platform Symphony ports are fully open.
- Communication between the master and problematic hosts appear work fine. For example, you can run ssh from one host and connect to another.
- The cluster's master host may be connected to the compute hosts on a network using jumbo frames.
Cause
The MTU size on the network interfaces used to communicate between the master and compute hosts are different.
Verifying the problem
Run the following commands on the master host and problematic host to verify that the interfaces used to communicate between the hosts use different MTU sizes.
Linux:
bash-4.1$ ifconfig
The following command output is an example of showing the MTU size:
eth0 Link encap:Ethernet HWaddr 52:54:00:8F:31:84
inet addr:9.21.62.219 Bcast:9.21.63.255 Mask:255.255.252.0
inet6 addr: fe80::5054:ff:fe8f:3184/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:231373691 errors:0 dropped:0 overruns:0 frame:0
TX packets:213955803 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:59950566296 (55.8 GiB) TX bytes:77119522815 (71.8 GiB)
Windows:
C:\Users\IBM_ADMIN>netsh interface ipv4 show interfaces
The following command output is an example of showing the MTU size:
Idx Met MTU State Name
--- ---------- ---------- ------------ ---------------------------
1 50 4294967295 connected Loopback Pseudo-Interface 1
15 25 1500 connected Wireless Network Connection
55 20 1500 disconnected Wireless Network Connection 2
12 10 1500 disconnected Local Area Connection
Resolving the problem
Change the MTU size on the master and problem host interfaces so that they use the same MTU size. The problematic host should now be able to join the cluster and be available.
#SpectrumComputingGroup