WebSphere Application Server & Liberty

 View Only
  • 1.  Some App Servers will not join HA

    Posted Thu March 10, 2022 04:59 AM









    Hi,

    can anyone help with this error please.

    On WAS 9.0.5.8 Trad.
    Cell
       - Virtual Machine 1: Dmgr & Node Agent 1.
       - Virtual Machine 2: Node Agent 2.

    The JVMs are clustered across both Nodes, 2 JVMs per cluster.
    I'm having problems with about half of the application servers not starting on VM2/Node2.
    About half ~20 JVMs, start fine, and show running in the Admin console.
    I've increased the log level. The ones that won't start, appear not to be joining the HA.


    [09/03/22 16:22:13:445 GMT] 00000046 MbuRmmAdapter I   DCSV1032I: DCS Stack DefaultCoreGroup at Member N02: Connected a defined member cell001\vm2\nodeagent.
    [09/03/22 16:22:13:504 GMT] 00000046 MbuRmmAdapter I   DCSV1032I: DCS Stack DefaultCoreGroup at Member N02: Connected a defined member cell001\cellManager\dmgr.
    [09/03/22 16:22:13:728 GMT] 00000046 MbuRmmAdapter I   DCSV1032I: DCS Stack DefaultCoreGroup at Member N02: Connected a defined member cell001\vm1\nodeagent.
    [09/03/22 16:22:23:869 GMT] 00000059 Peer          I   ODCF8531I: Added neighbor ip=10.123.45.67 udp=10311 tcp=10301 ID=zxcvbnm version=0;cellName=cell001;bridgedCells=[];structuredGateway=true;properties={inOdc=1, memberName=cell001\vm2\nodeagent, MEMBER_STARTUP_TIME=1646841637875, epoch=1646841639251, MEMBER_VERSION=4}, neighbor set size is now 1.
    [09/03/22 16:22:23:870 GMT] 00000059 Peer          I   ODCF8517I: The unstructured overlay is operational, with security: 10.123.45.67  udp_port=31327  tcp_port=31328.
    [09/03/22 16:22:27:103 GMT] 0000005c NGUtil$Server I   ASND0002I: Detected server nodeagent started on node vm2
    [09/03/22 16:23:11:547 GMT] 00000055 RLSHAGroupCal W   CWRLS0030W: Waiting for HAManager to activate recovery processing for local WebSphere server.
    For more information about how to troubleshoot startup, see https://www.ibm.com/support/knowledgecenter/en/SSAW57_9.0.0/com.ibm.websphere.nd.multiplatform.doc/ae/ttrb_server_startup.html


    I have read the suggested 'Troubleshooting server start caused by the CWRLS0030W message' however I do not see the the subsequent error 'DCSV8030I'.

    Doubtful it can be a port conflict, as with only the node agent running on the VM, some of the application servers will not start, with the above error.

    Thanks,
    Mark.​

    ------------------------------
    Mark Shaw
    ------------------------------


  • 2.  RE: Some App Servers will not join HA

    Posted Thu March 10, 2022 05:55 AM
    Edited by Mark Shaw Thu March 10, 2022 05:58 AM
    To answer my own question....
    Did a telnet from node2 to node1, using ports WC_defaulthost_secure & WC_defaulthost, comparing a 'starting' server to a 'non-starting' server.
    On the 'non-starting' servers, the telnet is not connecting, so the Firewall is blocking the ​ports for WC_defaulthost_secure & WC_defaulthost, from Node2 to Node1.
    (This is a newly built cell, and I'm even newer to this project)

    ------------------------------
    Mark Shaw
    ------------------------------



  • 3.  RE: Some App Servers will not join HA

    IBM Champion
    Posted Thu March 10, 2022 06:54 AM
    Hi Mark,

    So you have solved the problem?

    during the start of the server you can test if there are networking issues related to firewalls using the next commands (you can filter by SYNC_SENT instead of NODE_TO_CONNECT_IP if you prefer):

    *NIX: netstat -na| grep NODE_TO_CONNECT_IP
    ss -na| grep NODE_TO_CONNECT_IP
    WINDOWS: netstat -na| findstr NODE_TO_CONNECT_IP

    Tell us if you need more support.


    Regards

    ------------------------------
    Gabriel Aberasturi
    Versia tecnologias emergentes
    ------------------------------



  • 4.  RE: Some App Servers will not join HA

    IBM Champion
    Posted Tue March 15, 2022 05:32 AM
    Hi Mark,

    you said that you were starting 20 JVMs.

    How are they structured?
    Are they grouped in clusters?

    Are you starting them all at the same time?
    I have seen similar problems when JVMs are starting at the same time and they can't bind to the ports because they are all competing with each other.
    Starting the JVMs in a cluster one at a time using the Ripple start button slows down the startup process and allows the JVMs to bind correctly to their ports and join the mesh.

    This article explains a bit more about the mesh - https://www.linkedin.com/pulse/jvms-joining-leaving-websphere-mesh-dcsv1036w-mark-robbins/

    ------------------------------
    Mark Robbins
    Support Lead/Technical Design Authority / IBM Champion 2017 & 2018 & 2019 & 2020 & 2021
    Vetasi Limited
    https://www.linkedin.com/pulse/maximo-support-advice-from-non-ibm-engineer-article-mark-robbins/
    ------------------------------



  • 5.  RE: Some App Servers will not join HA

    Posted Tue March 29, 2022 08:08 AM
    Hi,

    I'm still having the same problems. Both nodes have 50+ JVMs. 
    Node1/Dmgr all JVMs start fine.
    Node 2: JVMs 1-15 will start consistently (starting one at a time) 
    Node2: JVMs 26-50 will not start consistently (Even if nothing else is running in the node​2 other than the node agent, a 'broken' JVM will not start on its own. 

    They all  hang with
    [09/03/22 16:23:11:547 GMT] 00000055 RLSHAGroupCal W   CWRLS0030W: Waiting for HAManager to activate recovery processing for local WebSphere server.


    When starting a JVM on Node2, I can see lots of SYN_SENT to the Dmgr / Node_to_Connect_IP.
    The ports range seem quite high, and don't match WC_defaulthost  or WC_defaulthost_secure. ports.
    Any guidance on what I can be looking for please?
    My gut feel is a Firewall issue, but there are so many ports across the 50+ JVMs..

    Thanks
    Mark.

    ------------------------------
    Mark Shaw
    ------------------------------



  • 6.  RE: Some App Servers will not join HA

    IBM Champion
    Posted Tue March 29, 2022 08:32 AM
    Hi Mark,
    I / others would need to see the JVM logs from startup to understand what is actually happening. There are likely to be clues in the logs.

    The Websphere team are very good so you might want to consider raising a PMR with them.
    I may be able to do an automated log analysis over the logs for the 50 JVMs. Feel free to reach out to me at mark . robbins @ vetasi . com to discuss


    ------------------------------
    Mark Robbins
    Support Lead/Technical Design Authority / IBM Champion 2017 & 2018 & 2019 & 2020 & 2021
    Vetasi Limited
    https://www.linkedin.com/pulse/maximo-support-advice-from-non-ibm-engineer-article-mark-robbins/
    ------------------------------



  • 7.  RE: Some App Servers will not join HA

    IBM Champion
    Posted Tue March 29, 2022 09:26 AM
    Hello Mark,

    In my experience if you have a SYN_SENT is a firewall issue. You have a lot of server in each node. Usually the binding ports for some services are * so is trying to establish a connection using a random port.

    Is a good practice in this environments (with firewalls) to not leave ramdon ports so you know each server wich ports are using and you can shared them with networking team.

    I haven't access to which are HAManager ports I think there are DCS... but not sure. My recommendation is to fix the ports in each server configuration and open them in firewall.

    Hope this helps.

    Regards

    ------------------------------
    Gabriel Aberasturi
    Versia tecnologias emergentes
    ------------------------------