This totally sounds like an active/passive cluster or load balancer issue. Do you have any UM clusters in your environment? If yes what do you have configured in the VIP? Is there a F5 or HA cluster? Are they properly detecting dead nodes?
It sounds like cluster manager doesn’t register the dead node properly. It can also be caused because of a wrong VIP/DNS configuration in the endpoints. If reaching through a VIP only use that to access UM servers, never use hostnames in any configuration if you have a dedicated entry point.
You can test if this is because of the F5/HA cluster problem by changing your VIP name to host arrays like from
nsp://umvip:9000
to
nsp://umserver1:9000,nsp://umserver2:9000
You don’t have to test this through MWS or IS.
First use Enterprise Manager to test the VIP,
then kill the nodes while connected to see if it reconnects.
If enterprise manager connects without any problems then its probably a defect.
If enterprise manage doesn’t reconnect through VIP, then try connecting to the same active node directly with hostname from enterprise manager. If you can connect to the active node, then it means your cluster configuration is wrong. If you can’t connect to the new active node, then that installation in that node probably needs some patching/configuration or may be reinstallation.
Also checking this configuration may help.
#Universal-Messaging-Broker#webMethods