I finally received an explanation from support on what seems to be going on here:
With a distributed query like
select count(*) from sysmaster@dummy:systables;
This is a server-side issue with Connection Manager involved for distributed query, the coordinator (local) server needs to pre-determine the server 'dummy' in the query is the LOCAL server or a REMOTE server. If it's a remote server then the coordinator creates a connection to the remote server, otherwise no connection is needed.
Only two cases here, 'dummy' is either a remote server or the local server. If it's a remote server, (like we run it on the SDS), then local server will create a connection to 'dummy', and CM re-routes the connection to the primary, it's a remote server to SDS, just like a normal DISTRIBUTED query.
If we run the distributed query on the primary, 'dummy' is not defined in the CM system, the coordinator server looks up 'dummy' by name, and cannot find that name in system, the coordinator treats it as a 'remote' server and creates a connection to the 'dummy' host/port, then CM re-routes the connection to the primary self, that causes -908. If the coordinator server looks up 'dummy' by host/ip, then the server knows 'dummy' is the same as oltp1, it points to primary self, then no connection to create. In this case Connection Manager is not involved since no connection is created.
Distributed query does not work if the coordinator tries to connect to itself, underline I-STAR transaction design does not support that, we need to pre-determine server 'dummy' is local or remote
Support provided a kludgy patch that works around the behavior by setting a flag in SQL_DEF_CTRL.
------------------------------
TOM GIRSCH
------------------------------
Original Message:
Sent: Sun May 24, 2020 11:45 AM
From: TOM GIRSCH
Subject: Bizarre CM/Connectivity Issue on 12.10.FC14
All:
I've opened a case on this behavior, but I'm wondering if anyone else here can replicate it.
We recently updated our production cluster from a patched version of 12.10.FC12 (W1X1) to 12.10.FC14. After the upgrade, I noticed that certain queries were hanging. The common element is that the queries take this format:
SELECT [columns] FROM [db]@[cm]:[table];
Where "cm" is a connection manager that ultimately points back to the server on which the query is run.
Sample of the behavior below:
$ echo "SELECT COUNT(*) FROM sysutils:systables;" | dbaccess sysmaster@cm__chaos_jbe
Database selected.
(count(*))
76
1 row(s) retrieved.
Database closed.
$ echo "SELECT COUNT(*) FROM sysutils@cm__chaos_jbe:systables;" | dbaccess sysmaster@cm__chaos_jbe
Database selected.
908: Attempt to connect to database server (cm__chaos_jbe, conerr=-27001, oserr=0) failed.
Error in line 1
Near character position 53
Database closed.
Engine and shell are using the same sqlhosts file:
$ echo $INFORMIXSQLHOSTS
/informix/ENV/sqlhosts.ids__chaos__b1
$ onstat -g env | grep SQLHOSTS
INFORMIXSQLHOSTS /informix/ENV/sqlhosts.ids__chaos__b1
Relevant sqlhosts entries:
ids__chaos__b1 onsoctcp ifxprod1-b1 sqlexec_chaos g=gprim_ids_chaos,s=1
ids__chaos__b2 onsoctcp ifxprod1-b2 sqlexec_chaos g=gprim_ids_chaos,s=1
ids__chaos__b3 onsoctcp ifxprod1-b3 sqlexec_chaos g=gprim_ids_chaos,s=1
ids__chaos__b4 onsoctcp ifxprod1-b4 sqlexec_chaos g=gprim_ids_chaos,s=1
cm__chaos_jbe group - - c=1,i=42025
cm__chaos_jbe__b1 onsoctcp aecm-b1 sqlexec_chaos_jbe g=cm__chaos_jbe
cm__chaos_jbe__b2 onsoctcp aecm-b2 sqlexec_chaos_jbe g=cm__chaos_jbe
When the remote query goes to a different node in the cluster, it works fine:
$ echo $INFORMIXSQLHOSTS
/informix/ENV/sqlhosts.ids__chaos__b1
$ echo "SELECT DBSERVERNAME FROM systables WHERE tabid = 1;" | dbaccess sysmaster@cm_chaos_ro Database selected.(expression) ids__chaos__b21 row(s) retrieved.Database closed.$ echo "SELECT COUNT(*) FROM sysutils@cm_chaos_ro:systables;" | dbaccess sysmaster@cm__chaos_jbeDatabase selected. (count(*)) 761 row(s) retrieved.Database closed.$ grep cm_chaos_ro $INFORMIXSQLHOSTS
cm_chaos_ro group - - c=1,i=42405
cm_chaos_ro_b1 onsoctcp aecm-b1 sqlexec_chaos_ro g=cm_chaos_ro
cm_chaos_ro_b2 onsoctcp aecm-b2 sqlexec_chaos_ro g=cm_chaos_ro
------------------------------
TOM GIRSCH
------------------------------
#Informix