is the repack doing lots of commits? If so it will need to co-ordinate for each commit.
David.
Original Message:
Sent: Tue January 02, 2024 01:07 PM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?
David,
(a) I was checking `onstat -l | grep C-` while the replica lagging, it showed almost same values as `onstat -g rss verbose`.
(b) Could the apply contention be an issue if a single thread performs 99.99% of apply job when lagging?
------------------------------
Sincerely,
Dennis
Original Message:
Sent: Mon January 01, 2024 10:47 PM
From: David Williams
Subject: What prevents a RSS to roll forward as fast as its primary?
Hi,
So in order of the flow we have:
Primary
Current Log Page:543836,63218
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340
RSS
Last log page received(log id,page): 543814,29439
Queues are full.
Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4
As onstat -g cluster on Version 11.70 unlike 12.10 https://www.ibm.com/docs/en/SSGU8G_12.1.0/com.ibm.adref.doc/ids_adr_1087.htm does not show "Applied Log (log, page)" can you repeat this with "onstat -l | grep C" from the RSS?
I would say it is an apply issue.
NOTE: There are some APARS in this area that are fixed in 14.10.FC10
https://www.ibm.com/support/pages/apar/IT37242
IT37242: WITH DBSPACES COMPRISED OF MANY CHUNKS, FREQUENT 'CHUNK DOWN' CHECKING CAN BE VERY EXPENSIVE
The bld_logrecs thread checks for down chunks for EVERY log record that is applied.
This doe not show here as "Log Record Queue" is almost full but will not be helping.
https://www.ibm.com/support/pages/apar/IT32067
RA_Q_LIST MUTEX CONTENTION AND HOT READAHEAD SPIN LOCK WHEN THERE ARE MANY READAHEAD THREADS
which can also affect replication, certainly in 12.10, not sure if readahead is the same in 11.70!
Check storage performance on the RSS.
This could also be due to the contention between apply threads I mentioned earlier, someone from HCL can comment further.
Regards,
David.
------------------------------
David Williams
Original Message:
Sent: Sat December 30, 2023 03:19 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?
David,
SEC_APPLY_POLLTIME has no meaning for 11.70.
We are performing a large table repack, and the replica is lagging behind.
onstats on the primary:
$ onstat -g cluster
IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:11 -- 361806176 Kbytes
Primary Server:elids5
Current Log Page:543836,63218
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30
Server ACKed Log Supports Status
(log, page) Updates
elids6 0,0 No ASYNC(RSS),Disconnected,Defined
elids6_r 543813,76895 No ASYNC(RSS),Connected,Active
$ onstat -g rss verbose
IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:39 -- 361806176 Kbytes
Local server type: Primary
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30
Number of RSS servers: 2
RSS Server information:
RSS Server control block: 0x0
RSS server name: elids6
RSS server status: Defined
RSS connection status: Disconnected
RSS Server control block: 0x700001f230ae028
RSS server name: elids6_r
RSS server status: Active
RSS connection status: Connected
RSS flow control:0/0
Log transmission status: Blocked
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Time of Last Acknowledgement: 2023-12-30.10:46:38
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340
Sequence number of next buffer to send: 89167954
Sequence number of last buffer acked: 89167889
Supports Proxy Writes: N
$ onstat -g rss log
IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:58:29 -- 361806176 Kbytes
Log Pages Snooped:
RSS Srv From From Tossed
name Cache Disk (LBC full)
elids6_r 167171226 13287270 15911638
Onstats on the replica:
$ onstat -g rss verbose
IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:15:20 -- 262286176 Kbytes
RSS Server control block: 0x700001e4ea85e60
Local server type: RSS
Server Status : Active
Source server name: elids5_r
Connection status: Connected
Last log page received(log id,page): 543814,29439
Sequence number of last buffer received: 89170395
Sequence number of last buffer acked: 89170395
$ onstat -g laq
IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:16:08 -- 262286176 Kbytes
Log Apply Info:
Thread Queue Total Avg
Size Queued Depth
xchg_1.0 0 112159631 20.15
xchg_1.1 0 7924415 5.82
xchg_1.2 0 7189459 3.64
xchg_1.3 0 2782265 6.04
xchg_1.4 0 9741266 7.09
xchg_1.5 0 29073979 19.87
xchg_1.6 0 23416392 3.57
xchg_1.7 0 20301067 2.25
xchg_1.8 0 20963291 4.72
xchg_1.9 0 2255620 22.28
xchg_1.10 0 7046118 2.34
xchg_1.11 0 5665037 2.83
xchg_1.12 0 7656817 3.50
xchg_1.13 0 2163023 5.55
xchg_1.14 0 6675688 2.43
xchg_1.15 0 6682011 3.20
xchg_1.16 0 8666645 6.72
xchg_1.17 0 3638224 4.32
xchg_1.18 0 3434523 7.04
xchg_1.19 0 3706118 5.16
xchg_1.20 0 3015094 4.42
xchg_1.21 0 4097616 5.35
xchg_1.22 0 3545439 3.19
xchg_1.23 0 6714137 5.09
xchg_1.24 0 18110006 3.48
xchg_1.25 0 4760233 5.37
xchg_1.26 0 2513045 4.01
xchg_1.27 136 195407918 105.51
xchg_1.28 0 7531591 4.96
xchg_1.29 0 13571912 12.91
Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4
Next, we have 120 kio queues and 1 aio queue on the RSS.
------------------------------
Sincerely,
Dennis
Original Message:
Sent: Wed December 27, 2023 12:26 PM
From: David Williams
Subject: What prevents a RSS to roll forward as fast as its primary?
Hi,
Sometimes due to the syncronization points I mentioned having LESS apply threads can be faster.
Try OFF_RECVRY_THREADS 31, 23,17,11 and see which comes out best.
Also try SEC_APPLY_POLLTIME 100 and 50 and 0.
When the lag happens send
- onstat -g cluster on the primary (*)
- onstat -g rss verbose on the primary (*) and the RSS
- onstat -g rss log on the primary
- onstat -g laq on the RSS
When the lag happens run onstat -l to switch log and onlog -n to dump the contents of the previous log and count log records/commits for each partnum associated with the repack.
The starred one tell you a lot about between which 2 points the bottleneck is occuring
- Current Log position on the primary
- Current Send postition on the primary
- Current Acknowledged position on the RSS
- Current Applied Position on the RSS
laq on the RSS also shows which replication queue on the RSS is getting backed up.
I suspect that with the repack only 1 partnum is being hit which means only 1 apply thread doing the work, not much you can do about that.
NOTE: Also check onstat -g ioq, are you using KAIO or AIO? With less CPU VPs there are less KAIO threads so less I/O bandwidth, not sure if that makes a difference in this scenario though.
Yes Version 14 does improve throughput!
Regards,
David.
------------------------------
David Williams
Original Message:
Sent: Tue December 26, 2023 11:26 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?
Doug,
The fact is that the replica rarely lags behind the primary during regular activities.
This usually occurs when an operation is performed on a large table, such as repacking the table or building an index.
Would increasing OFF_RECVRY_THREADS help in this case?
------------------------------
Sincerely,
Dennis
Original Message:
Sent: Tue December 26, 2023 03:26 AM
From: Doug Lawry
Subject: What prevents a RSS to roll forward as fast as its primary?
Hi Dennis.
OFF_RECVRY_THREADS is most significant, as also mentioned by David:
https://www.ibm.com/docs/en/informix-servers/14.10?topic=cptarr-off-recvry-threads-recvry-threads-their-effect-fast-recovery
https://www.ibm.com/docs/en/informix-servers/14.10?topic=parameters-off-recvry-threads-configuration-parameter
Traditionally, the rule for this was the first prime number greater than three times the number of CPU VPs. Having it too low will throttle it compared to the primary, where the equivalent is the number of user sessions.
------------------------------
Doug Lawry
Oninit Consulting
Original Message:
Sent: Mon December 25, 2023 03:03 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?
We have two servers of the same architecture, IBM Power 870.
Each has storage allocated on separate IBM FlashSystem 9200.
Primary's resources:
Cores: 51
RAM: 600 GB
RSS:
Cores: 5
RAM: 400 GB
Performing a table repack, the primary generates logical logs pretty fast, while the RSS redoes them much slower.
Does the RSS do it by design, or do we miss some relevant settings?
------------------------------
Sincerely,
Dennis
------------------------------