Informix

nested-group-icon.png

DB2

Expand all | Collapse all

Bizarre CM/Connectivity Issue on 12.10.FC14

  • 1.  Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Sun May 24, 2020 11:45 AM

    All:

    I've opened a case on this behavior, but I'm wondering if anyone else here can replicate it.

    We recently updated our production cluster from a patched version of 12.10.FC12 (W1X1) to 12.10.FC14. After the upgrade, I noticed that certain queries were hanging. The common element is that the queries take this format:

    SELECT [columns] FROM [db]@[cm]:[table];

    Where "cm" is a connection manager that ultimately points back to the server on which the query is run.

    Sample of the behavior below:

    $ echo "SELECT COUNT(*) FROM sysutils:systables;" | dbaccess sysmaster@cm__chaos_jbe
    Database selected.
    (count(*))
    76
    1 row(s) retrieved.
    Database closed.
    $ echo "SELECT COUNT(*) FROM sysutils@cm__chaos_jbe:systables;" | dbaccess sysmaster@cm__chaos_jbe
    Database selected.
    908: Attempt to connect to database server (cm__chaos_jbe, conerr=-27001, oserr=0) failed.
    Error in line 1
    Near character position 53
    Database closed.

    Engine and shell are using the same sqlhosts file:

    $ echo $INFORMIXSQLHOSTS
    /informix/ENV/sqlhosts.ids__chaos__b1
    $ onstat -g env | grep SQLHOSTS
    INFORMIXSQLHOSTS /informix/ENV/sqlhosts.ids__chaos__b1


    Relevant sqlhosts entries:

    ids__chaos__b1 onsoctcp ifxprod1-b1 sqlexec_chaos g=gprim_ids_chaos,s=1
    ids__chaos__b2 onsoctcp ifxprod1-b2 sqlexec_chaos g=gprim_ids_chaos,s=1
    ids__chaos__b3 onsoctcp ifxprod1-b3 sqlexec_chaos g=gprim_ids_chaos,s=1
    ids__chaos__b4 onsoctcp ifxprod1-b4 sqlexec_chaos g=gprim_ids_chaos,s=1
    cm__chaos_jbe group - - c=1,i=42025
    cm__chaos_jbe__b1 onsoctcp aecm-b1 sqlexec_chaos_jbe g=cm__chaos_jbe
    cm__chaos_jbe__b2 onsoctcp aecm-b2 sqlexec_chaos_jbe g=cm__chaos_jbe

    When the remote query goes to a different node in the cluster, it works fine:
    $ echo $INFORMIXSQLHOSTS
    /informix/ENV/sqlhosts.ids__chaos__b1
    $ echo "SELECT DBSERVERNAME FROM systables WHERE tabid = 1;" | dbaccess sysmaster@cm_chaos_ro Database selected.(expression) ids__chaos__b21 row(s) retrieved.Database closed.$ echo "SELECT COUNT(*) FROM sysutils@cm_chaos_ro:systables;" | dbaccess sysmaster@cm__chaos_jbeDatabase selected. (count(*)) 761 row(s) retrieved.Database closed.$ grep cm_chaos_ro $INFORMIXSQLHOSTS
    cm_chaos_ro group - - c=1,i=42405
    cm_chaos_ro_b1 onsoctcp aecm-b1 sqlexec_chaos_ro g=cm_chaos_ro
    cm_chaos_ro_b2 onsoctcp aecm-b2 sqlexec_chaos_ro g=cm_chaos_ro




    ------------------------------
    TOM GIRSCH
    ------------------------------


  • 2.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Sun May 24, 2020 07:36 PM
    Huh. I never tried to use a CM SLA in a query so that the engine goes out to the Connection Manager. Interesting. I see the utility of it, just never tried. 

    Art

    Art S. Kagel, President and Principal Consultant
    ASK Database Management


    Disclaimer: Please keep in mind that my own opinions are my own opinions and do not reflect on the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference.  Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.








  • 3.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Sun May 24, 2020 08:17 PM
    This code has been around forever, and it dates back to when the databases used to be on disparate, non-clustered DB servers. It has worked forever until this upgrade. What's bizarre is that there are a couple of SLAs that exhibit this behavior and a couple that do not, even though they're all running on the same oncmsm process.

    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 4.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Tue May 26, 2020 09:57 AM
    ​Hey Tom

    We also had some Problems which look similar with FC14 and Connection Manager. As the Connection Manager is new for us, we could not really say if this is because of FC14 Update, and worked with 12W1.

    In our Case it was a Problem with multiple CM_SLAs in the Server sqlhosts file. Perhaps you could check for this. In our case it was neccesary to restart the informix Server to get the changed sqlhosts activated.

    This one was not working because dbtest1_cm12, dbtest1_cm13 and dbtest1_cm14 are not unique
    dbtest1_1        group           -                 -                        c=1,e=dbtest1_cm14
    dbtest1_cm12    onsocssl        ARDB012V0S0P00    dbtest1_rw_sqli_cm12      g=dbtest1_1
    dbtest1_cm13    onsocssl        ARDB013V0S0P00    dbtest1_rw_sqli_cm13      g=dbtest1_1
    dbtest1_cm14    onsocssl        ARDB014V0S0P00    dbtest1_rw_sqli_cm14      g=dbtest1_1
    
    dbtest1_2       group           -                 -                         c=1,e=dbtest1_cm14
    dbtest1_cm12    onsocssl        ARDB012V0S0P00    dbtest1_rw_sqli_cm12      g=dbtest1_2
    dbtest1_cm13    onsocssl        ARDB013V0S0P00    dbtest1_rw_sqli_cm13      g=dbtest1_2
    dbtest1_cm14    onsocssl        ARDB014V0S0P00    dbtest1_rw_sqli_cm14      g=dbtest1_2​

    This one works:
    dbtest1_rw_sqli group           -                 -                         c=1,e=dbtest1_cm14
    dbtest1_cm12    onsocssl        ARDB012V0S0P00    dbtest1_rw_sqli_cm12      g=dbtest1_rw_sqli
    dbtest1_cm13    onsocssl        ARDB013V0S0P00    dbtest1_rw_sqli_cm13      g=dbtest1_rw_sqli
    dbtest1_cm14    onsocssl        ARDB014V0S0P00    dbtest1_rw_sqli_cm14      g=dbtest1_rw_sqli


    result:

    $ echo "SELECT COUNT(*) FROM sysutils@dbtest1_rw_sqli:systables;" | dbaccess sysmaster@dbtest1_rw_sqli
    Database selected.
          (count(*))
                  76
    1 row(s) retrieved.
    Database closed.
    
    $ echo "SELECT COUNT(*) FROM sysutils:systables;" | dbaccess sysmaster@dbtest1_rw_sqli
    Database selected.
          (count(*))
                  76
    1 row(s) retrieved.
    Database closed.


    ------------------------------
    --------------------------
    Andy Weis
    --------------------------
    ------------------------------



  • 5.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Tue May 26, 2020 11:58 AM
    One new item in the version, actually in FC13, is the sqlhosts cache, part of NS_CACHE config.
    It should be on by default, same 900 seconds as all the other 'name resolution' caches
     -> check 'onstat -g cfg [full] NS_CACHE'

    You might want to turn this off and see whether this makes a difference:  onmode -wm NS_CACHE=sqlhosts=0

    ------------------------------
    Andreas Legner
    ------------------------------



  • 6.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Tue May 26, 2020 12:26 PM
    Yep, disabling that was one of the first things I did to speed up testing. I think @Andreas Weis has identified the same problem I have. We've got a pretty complex CM setup and the behavior seems similar. (Though Mr. Weis may benefit from your NS_CACHE tip -- could have saved him a bounce.)

    It seems when you've got aliased entries (i.e., multiple dbservernames that point to the same host/port/protocol) in sqlhosts in FC14, some of them will work and some of them will hang on remote joins. I'm still trying to nail down further details. If FC12, at least, all such aliases would work.


    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 7.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Tue May 26, 2020 05:35 PM
    Edited by TOM GIRSCH Mon June 01, 2020 10:42 AM

    All:

    I think I'm zeroing in on something here as I test more.​​​​ First, by setting up a test CM and enabling logging, I can see that even when the connection hangs, the connection manager logs the attempt and onstat -g cmsm shows the connection count go up. So it's making it as far as the CM and hanging and eventually failing on the return trip, after the CM has done the forwarding. I've replicated the behavior with CMs running 4.10.FC12, 4.10.FC14 and 4.50.FC3, so it doesn't appear to have to do with the CM version. Instead, it has to do with the DB server version. Works in 12.10.FC12, not in 12.10.FC14 (I haven't tested FC13 yet).

    Here's what I've been able to replicate in controlled tests:

    1. If the CM forwards to a different INFORMIXSERVER than the originating one, the connection succeeds. In my tests, if the query originates on the primary and the remote portion hits the HDR secondary, that works, as does the converse. If the query originates on the primary and the CM redirects back to the primary for the remote portion, or if it originates on the HDR secondary and the CM redirects back to the HDR secondary, that's where I _sometimes_ have trouble.
    2. I _think_ that Mr. Weis' duplicate entry thing is at play, but I haven't been able to safely replicate that condition.
    3. In addition, it looks like the trouble begins when the INFORMIXSERVER value in the sqlhosts file on the originating DB server does not actually match the SLA name on the CM. (We've used aliases in sqlhosts for backward compatibility to emulate decommissioned servers so that old code doesn't need to be rebuilt, for example.) So when the [dbsvr] in SELECT * FROM [db]@[dbsvr]:[table] does not match the SLA name on the CM _and_ the CM redirects back to the originating DB server, the connection successfully gets to the CM, and the CM's redirect attempt hangs for around 30 seconds before returning -908/-27001.
    4. DISREGARD THIS, as it seems to have been a configuration error on my part; further testing needed. HOWEVER, if we reconfigure the SLA as MODE=PROXY and rerun the test listed in 3 above, instead of hanging it fails immediately with "761: INFORMIXSERVER does not match either DBSERVERNAME or DBSERVERALIASES." But only if the redirect points right back to the originating server; if the target server is different than the originating server, the connection succeeds irrespective of whether [dbsvr] and the SLA name match.
    5. Repeating again, all of the above tests connect successfully in FC12. I've actually reproduced the behavior in FC12, too.


    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 8.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Mon June 01, 2020 10:33 AM
    It appears you have found a bug worth note:
    IT33019: SPORADIC -908 / -27001 WHEN CONNECTING TO SLA ALIAS IN 12.10.XC14

    Have you gotten this tested on 12.10.XC13? - Asking for a friend...

    Eric Rowell

    ------------------------------
    Eric Rowell
    Database Administrator
    BFS
    Dallas TX
    2147653575
    ------------------------------



  • 9.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Mon June 01, 2020 10:41 AM

    Cart/Horse confusion. The bug was entered in response to my tech support case. ;) HCL is having internal arguments as to whether or not this should be "expected behavior." Which is silly because even if it's not supposed to work, it should fail immediately and with a different error message.

    I've replicated the behavior on FC12, FC13 and FC14. It doesn't necessarily have to be an alias, though that's the easiest way to reproduce. Can tweak my repro case and post it if you want to try it yourself.



    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 10.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Mon June 01, 2020 11:25 AM
    Tom,
        I have been at the bottom of that type of wreck before (this is my second time trying to reply to this post - so maybe two will appear).  Looking to use CM and SLA's for connection management and redirection.  I would love to see your reproduction notes (publicly or emailed).  I have a feeling from your description it is something I'm looking to do.  If my end goal is not valid (HCL) or long term supported I need to jump in or find another solution.

        Thank you,


    ------------------------------
    ~~~~~~~~~
    Eric Rowell
    ------------------------------



  • 11.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Thu June 04, 2020 11:58 AM


    I saw this APAR and I am chasing this as well!

    Regards,

    David.



    ------------------------------
    David Williams
    ------------------------------



  • 12.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Thu June 04, 2020 04:24 PM
    If you have a ticket open, be sure to have them reference mine, TS003740811

    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 13.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Tue August 11, 2020 10:40 AM
    False alarm! The bug fix does NOT resolve the issue.

    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 14.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Mon June 01, 2020 12:46 PM
      |   view attached

    All, @Eric Rowell:

    I've created a reproduction, and am posting it here for you to test. Here's sqlhosts:

    # IANA (www.iana.org) assigned port number/service names for Informix:
    # sqlexec 9088/tcp
    # sqlexec-ssl 9089/tcp
    
    ids__3740811            onsoctcp        idshost         42001
    
    ##########################################
    #          Connection Managers           #
    ##########################################
    tj_test                 group           -               -                       c=1,i=42001
    cm__3740811             onsoctcp        cmhost          42001                   g=tj_test
    
    tj_kaboom               onsoctcp        cmhost          42001
    


    Here's CMCONFIG:

    NAME            aecm__3740811
    DEBUG           0
    LOG             1
    LOGFILE         /informix/logs/aecm__3740811.log
    CMALARMPROGRAM  /informix/scripts/alarm_cm.sh
    SQLHOSTS        LOCAL
    
    CLUSTER 3740811_cluster # cluster name
    {
            INFORMIXSERVER ids__3740811
            SLA cm__3740811 PRIMARY
            FOC ORDER=DISABLED PRIORITY=4101
    }
    

    Attaching onconfig.

    Port 42001 needs to be open on any firewalls. Database should live on a separate host from the CM.

    From the DB host, running a query via dbaccess sysmaster@tj_kaboom will work. Running, e.g., "SELECT COUNT(*) FROM sysmaster@tj_kaboom:systabnames;" will fail. Sub any other sqlhosts entry for tj_kaboom and it will work in all cases.



    ------------------------------
    TOM GIRSCH
    ------------------------------

    Attachment(s)



  • 15.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Thu June 04, 2020 04:22 PM
    Edited by TOM GIRSCH Thu June 04, 2020 04:23 PM
      |   view attached

    I'm attaching a more robust and more generalized reproduction case here. Create the instance ids__3740811 on host1 using the attached onconfig. Create the instance ids__3740811__hdr on host2 using that onconfig, setting it up as an HDR secondary for ids__3740811. Create a connection manager on host3 using the attached cmconfig file. All three get the same sqlhosts file, also attached. Ensure that port 42001 is open on all firewalls.


    With that all configured, run pretest.sh (attached) and then test.sh (also attached) from either host1 or host2 (doesn't matter which). You'll see that of the five commands it runs, four of them work, and two out of three that reference the SLA alias tj_kaboom.


    Having run those, now reconfigure the cm__3740811 SLA to redirect to the HDR rather than to the primary and rerun both scripts; you'll see that all of the pretest commands still work, while the commands from test.sh have swapped in terms of which succeeds and which fails. In total, six commands that reference tj_kaboom. Four of them work, two fail. And the failures take approximately one minute.


    To hear development tell it, the problem is not the two scenarios which fail, but the four which succeed.


    Even if we accepted their suggestion -- obviously I do not -- it should fail right away and consistently rather than hanging up for a minute doing who knows what.

    One additional note. If you monitor the CM log file while running these tests, you'll see that the CM actually receives the connection request and forwards it even in the cases where it fails. So the issue seems to be on the DB server side.



    ------------------------------
    TOM GIRSCH
    ------------------------------

    Attachment(s)

    zip
    cm_hang_repro.zip   42K 1 version


  • 16.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Thu June 04, 2020 04:51 PM
    Tom:

    This reminds me of when I called Madison because the 2.80 release of the CSDK broke using DBPATH to name a failover server. When I told him that I had been able to use an sqlhosts GROUP entry instead until they fixed DBPATH, he said "Groups don't work that way, they are only for ER!"

    I, of course, responded, "Madison, I don't know if you intended groups to work this way, but they do! Certainly don't 'FIX' them, at least no until the CSDK guys fix DBPATH!" Is it a feature or a bug?

    Art

    Art S. Kagel, President and Principal Consultant
    ASK Database Management


    Disclaimer: Please keep in mind that my own opinions are my own opinions and do not reflect on the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference.  Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.








  • 17.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Thu June 04, 2020 04:27 PM

    On my production server, this behavior replicates even when the referenced SLA name matches a running SLA. But I haven't been able to replicate this behavior in my repro case.


    PRI-->CM (via alias)-->PRI hangs

    PRI-->CM (via alias)-->HDR works

    HDR-->CM (via alias)-->PRI works

    HDR-->CM (via alias)-->HDR hangs



    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 18.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Thu June 11, 2020 09:20 AM
    Additional update: I've replicated this on 14.10.FC3

    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 19.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Wed June 24, 2020 01:50 PM

    IMPORTANT UPDATE from Tech Support on this. I originally said this appeared when I upgraded from FC12 to FC14, but later was able to reproduce it on FC12, too. From TS:

    The reason I am asking is that the stack you sent me yesterday matches the following defect
    which was supposed to be fixed in 12.10.xC6.
     
    APAR IT08052 - SECOND REMOTE SELECT IN A SESSION TO A DIFFERENT SERVERNAME
    FAILS WITH ERROR -908 -27001



    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 20.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Tue July 14, 2020 05:58 PM

    Now hear this! Bug has allegedly been identified and a fix coded. I've requested a special build. If you need one, here's the APAR:

    APAR IT33019 - SPORADIC -908 / -27001 WHEN CONNECTING TO SLA ALIAS

    Which is the exact bug @Eric Rowell identified earlier.



    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 21.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Tue July 14, 2020 06:05 PM
    I'm so glad this has been addressed!...   I'm playing with the CM and finding it to be a pain to know what is valid and not...  

    Eric R.

    ------------------------------
    ~~~~~~~~~
    Eric Rowell
    ------------------------------



  • 22.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted Tue August 11, 2020 10:46 AM
    False alarm! The bug fix did not change anything. And it turns out they didn't even really _attempt_ to fix the actual problem. Their "fix" was to make the engine quicker to recognize that you had "corrected" the "mistake" in the sqlhosts file. Which isn't at all what we asked for.

    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 23.  RE: Bizarre CM/Connectivity Issue on 12.10.FC14

    Posted 16 hours ago

    I finally received an explanation from support on what seems to be going on here:

    With a distributed query like 
    select count(*) from sysmaster@dummy:systables;
     
    This is a server-side issue with Connection Manager involved for distributed query, the coordinator (local) server needs to pre-determine the server 'dummy' in the query is the LOCAL server or a REMOTE server. If it's a remote server then the coordinator creates a connection to the remote server, otherwise no connection is needed.
     
    Only two cases here, 'dummy' is either a remote server or the local server. If it's a remote server, (like we run it on the SDS), then local server will create a connection to 'dummy', and CM re-routes the connection to the primary, it's a remote server to SDS, just like a normal DISTRIBUTED query. 
     
    If we run the distributed query on the primary, 'dummy' is not defined in the CM system, the coordinator server looks up 'dummy' by name, and cannot find that name in system, the coordinator treats it as a 'remote' server and creates a connection to the 'dummy' host/port, then CM re-routes the connection to the primary self, that causes -908. If the coordinator server looks up 'dummy' by host/ip, then the server knows 'dummy' is the same as oltp1, it points to primary self, then  no connection to create. In this case Connection Manager is not involved since no connection is created.
    Distributed query does not work if the coordinator tries to connect to itself, underline I-STAR transaction design does not support that, we need to pre-determine server 'dummy' is local or remote

    Support provided a kludgy patch that works around the behavior by setting a flag in SQL_DEF_CTRL.


    ------------------------------
    TOM GIRSCH
    ------------------------------