I'm using Informix IDS 10.00.UC6 on Solaris 11, with two machines having the same database schema and all tables replicating in both directions using Enterprise Replication, so in theory both databases should have the same content.
However , a problem has arisen where one direction of replication (Host A to Host B) continues to work correctly, but the other direction (Host B to host A) does not work. The symptoms are:
- Changes made to a table on Host B do not propagate to Host A (as determined by changing a row on Host B and inspecting the table on Host A)
- `cdr list serv` shows `Active` and `Connected` (both directions), but on Host B there is a queue of millions of bytes.
- `cdr list repl` shows non-zero queues for several of the replicates.
- `cdr stats recvq` on Host A shows nothing received from Host B recently.
- `cdr stats rqm` shows data in the spool `trg_send_stxn` with flags `SEND_Q, SPOOLED, PROGRESS_TABLE, NEED_ACK, SENDQ_MASK, SREP_TABLE`.
- There are no errors or relevant messages in `online.log` or `cdr_mon.log` , or any other place I can think to look.
- Some of the tables are "out of sync" in that rows have conflicting data or are missing; this is for various reasons relating to past errors where one host was offline. However, even changes to tables with correct data on Host B are not propagated to Host A.
- I did a `cdr cleanstart` on Host B yesterday after this problem was occurring in both directions, which did at least make the A -> B direction start working (the opposite of what I expected), and the queue on Host B were 0 at that time. After that cleanstart, some changes to tables (with correct data) would propagate to Host A, while some changes to other tables on B would not. But today, no tables are propagating from B to A.
- Before the `cleanstart` I had found by experimenting that sometimes deleting an individual replicate would reduce the size of the stuck queue but the queue remained stuck all the same; and sometimes, deleting a replicate would make the queue move for a time before being stuck again.
- There is also a DR host that both A and B do one-way propagation to, and that is propagating correctly with no queue backup.
I'm at a loss now as to try and diagnose why the data in the replication queues is not moving. If there were sync errors (i.e. the replicated change could not be applied due to Host A data differing) I would expect log messages in `online.log` that the update was rejected, with information saved to $INFORMIXDIR/ats_dr and so on -- this has happened recently . It seems as if there must be something in the queue being refused but not being cleared and not logged, blocking the queue. Host A has heavy live traffic and (thankfully) is correctly replicating to Host B, but not vice versa.
Any ideas of more things to try or ways to diagnose the problem would be most welcome.
I have seen from other searching people advising to drop syscdr but nothing was mentioned about how to recreate it and resume replication afterwards.
------------------------------
Matt McNabb
------------------------------
#Informix