What prevents a RSS to roll forward as fast as its primary?

5. RE: What prevents a RSS to roll forward as fast as its primary?

Like

Dennis Melnikov

Posted Tue December 26, 2023 07:38 AM
Edited by Dennis Melnikov Thu December 28, 2023 02:48 AM

Art,

Thank you so much for the detailed reply.
Please see my answers below.

1. In a case of fail-over we transfer resources within an Enterprise pool.
2. 'Maximum server connections 8926.' Not far left to 11000 threshold.
3. AIO VPs are as follows, identical for both,

VPCLASS aio,num=4
AUTO_AIOVPS 1

4. I monitor a CPU load with `topas` and LPAR2RRD. As one can see, the load doesn't exceed 40%. At the same time, however, it seems as if the load is like it's hitting some sort of limit it can't overcome. So it's likely to be some sort of CPU contention.

5. All the cores are physical, i.e. 'processing units' as HMC names it. Eight SMT threads are supported by the systems, and they are running in 4 SMT threads mode.
6. Currently, `onstat -k` shows 0 lock table overflows on the replica, 14+ days uptime. I will monitor it.
7. The storage systems are identical. RAID level: Distributed RAID 6 - does it make sense for flash drives?

------------------------------
Sincerely,
Dennis
------------------------------

Original Message

Original Message:
Sent: Mon December 25, 2023 10:44 AM
From: Art Kagel
Subject: What prevents a RSS to roll forward as fast as its primary?

Dennis:

Tuning Informix on Power is a complex subject, but, in general, secondaries work best if the secondary is identical to the primary. Otherwise, assuming those resources are needed on the primary for normal operations, how will you survive if the secondary has to become the primary. Beyond that you need to be able to process the same transaction volume on the secondaries as on the primary and sometimes even faster to minimize the latency from network transit and overhead.

Every one of those differences in configuration happens to be in the critical performance path.

NUMFDSERVERS manages the number of threads used for network connections that migrate among CPU VPs and you have 160 CPU VPs on the primary and 140 on the secondary so every time a thread has to wait for IO it might migrate to another VP when it awakens. Also the recommendation is to have as many threads configued as NETWORK listeners. You have 50 on the primary and 40 on the secondary, so you probably should be setting NUMFDSERVERS to 50 on both. See the notes on NETTYPE below.
NETTYPE - should be the same on both machines, so: NETTYPE onsoctcp,50,220,NET as things are now. I wonder why you have over 11000 concurrent connections configured here (50 * 220), do you really have that many client connections at peak load? That brings us to VPCLASS cpu.
VPCLASS cpu,num=160 -vs- VPCLASS cpu,num=140 These should be the same, but I do understand that there are a different number of cores on the two machines. Over clocking the secodary with 28 CPU VPs per core is WAY more than those cores can handle, especially with 40 or 50 NET VPs also running and I do not know how many AIO VPs as well. There is a lot of overhead to managing over 200 VPs on 5 cores. You say there is no CPU contention, how are you monitoring that?
As I mentioned in my first reply, 5 cores versus 50 is a significant difference. Also how many of these "cores" are physical cores and how many are SMT threads? You should be running whatever physical cores you have in an SMT mode that disables 50% of those SMT threads. That is both my and IBM's recommendation for running database systems!
STMT_CACHE - this also should be set the same on both machines and my personal recommendation is to have it set to 0 (off) on both systems. Especially important on the secondary which is not running "normal" queries for the most part, and if it is being used for reporting, those queries will not be similar to the needs of the replication going on.
LOCKS - you need exactly as many locks on the secondary as you do on the primary to process transactions and rollbacks. Since you have the setting MUCH lower on the secondary, it has to increase the number of locks dynamically when needed which pauses any processing that requires locks. You can check onstat -k at the bottom of the report to see how many locks the secondary is actually using and how many lock overflows were triggered to get it there. Each overflow doubled the number of locks which might result in MORE locks actually created than on the primary (which you should also check FWIW).
Finally, are the two FlashSystem 9200 storage systems configured identically? Do they have the same number of drives on then and being used for chunks? Are they configured with the same RAID level and configuration (ie block size)? Do they have the same amount of cache memory installed?

If this is not enough to resolve the issues you are facing, then I would make the rare recommendation that you get someone to help you out. At the very least you seem to need a server Health Check performed. I can help.

Art

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com

Original Message:
Sent: Mon December 25, 2023 07:49 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

Art,

Here are differences that make sense between the primary and RSS,

236,238c236
< NETTYPE onsoctcp,50,220,NET
---
> NETTYPE onsoctcp,40,220,NET
243,245c241
< NUMFDSERVERS 8
---
> NUMFDSERVERS 4
278,279c274,275
< VPCLASS cpu,num=160
---
> VPCLASS cpu,num=120
315,316c311,312
< LOCKS 26280000
---
> LOCKS 320000
360,361c356,357
< SHMADD 20480000
---
> SHMADD 10240000
602,603c597,598
< STMT_CACHE 0
---
> STMT_CACHE 2
780,782c775
< AUTO_READAHEAD 1,16
---
> AUTO_READAHEAD 1,32

Next, I don't see CPU contention while redoing the logs, why shortage of cores will affect the speed?

------------------------------
Sincerely,
Dennis

Original Message:
Sent: Mon December 25, 2023 06:11 AM
From: Art Kagel
Subject: What prevents a RSS to roll forward as fast as its primary?

Your RSS has less memory and fewer cores. It is no wonder why the secondary cannot keep up with the activity on the primary! Are the ONCONFIG settings identical? If not, more reason why it falls behind.

Arr

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com

Original Message:
Sent: Mon December 25, 2023 03:03 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

We have two servers of the same architecture, IBM Power 870.
Each has storage allocated on separate IBM FlashSystem 9200.

Primary's resources:
Cores: 51
RAM: 600 GB

RSS:
Cores: 5
RAM: 400 GB

Performing a table repack, the primary generates logical logs pretty fast, while the RSS redoes them much slower.
Does the RSS do it by design, or do we miss some relevant settings?

------------------------------
Sincerely,
Dennis
------------------------------

6. RE: What prevents a RSS to roll forward as fast as its primary?

Like

IBM Champion

Art Kagel

Posted Tue December 26, 2023 08:14 AM

Dennis:

The SMT-4 mode is optimal, so that's fine.
Make certain that the "processor unit" assignmented for the LPAR are dedicated and not shared.
I am not a fan of any parity based RAID level (so 2, 3, 4, 5, 6, etc.) and when I say "not a fan" I mean that I have actively opposed the use of any of these RAID levels for over 30 years and have written extensively about why they should NEVER be used for any data that you care about. And that would be ignoring the fact that write performance on RAID5 & RAID6 is nearly 50% less than the full IO capacity of the number of data drives (so subtract the 2 parity units - yes, I know parity is distributed) if it were simply RAID0.

See this article you can download from my web site:

Why RAID5 should be avoided at all costs.

It focuses on RAID5, but everything it discusses is also applicable to RAID6. The second parity sectors only mitigate one shortcoming of RAID5 and only partially at that.

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com
------------------------------

Original Message

7. RE: What prevents a RSS to roll forward as fast as its primary?

Like

IBM Champion

Art Kagel

Posted Tue December 26, 2023 08:25 AM

Oh, forgot. Topas is only showing a total average CPU loading. You would be best served to monitor using mpstat -P ALL 5 5 (I think that the command line argues are slightly different for AIX than for Linux & Solaris, so you may have to adjust). OK, I looked it up, there is no -P ALL on the AIX version of mpstat, but it will be one of these two that you want to see:

mpstat 5 5

-- or --

mpstat -v 5 5 # see virtual processor details.

If you want to see all of the cores on the physical frame, you can use:

mpstat -X -@ 5 5

Note the number of cores that have low idle time.

Art

Art S. Kagel, President and Principal Consultant

ASK Database Management

www.askdbmgt.com

Blog: http://informix-myview.blogspot.com/

Disclaimer: Please keep in mind that my own opinions are my own opinions and do not reflect on the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference. Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.

Original Message

8. RE: What prevents a RSS to roll forward as fast as its primary?

Like

Dennis Melnikov

Posted Sat December 30, 2023 03:30 AM

Art,

Please see output of `mpstat 5 5` below. As you can see, no processor has low idle time.
I've added 2 cores up to the total of 7.

$ mpstat 5 5

System configuration: lcpu=28 mode=Capped

cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc
0 1756 0 0 1001 3132 101 1 1021 99 11162 45 6 0 49 0.35
1 0 0 0 177 0 0 0 0 - 0 0 0 0 100 0.22
2 0 0 0 176 0 0 0 0 - 0 0 0 0 100 0.22
3 0 0 0 174 0 0 0 0 - 0 0 0 0 100 0.22
4 1524 0 0 824 3436 108 2 1068 99 11564 40 7 0 53 0.34
5 0 0 0 173 0 0 0 0 - 0 0 0 0 100 0.22
6 0 0 0 176 0 0 0 0 - 0 0 0 0 100 0.22
7 0 0 0 177 0 0 0 0 - 0 0 0 0 100 0.22
8 1562 0 0 978 3391 27 1 131 99 9081 51 5 0 44 0.37
9 0 0 0 175 0 0 0 0 - 0 0 0 0 100 0.21
10 0 0 0 175 0 0 0 0 - 0 0 0 0 100 0.21
11 0 0 0 176 0 0 0 0 - 0 0 0 0 100 0.21
12 1171 0 0 986 3453 19 1 106 100 11202 48 5 0 46 0.36
13 0 0 0 176 0 0 0 0 - 0 0 0 0 100 0.21
14 0 0 0 175 0 0 0 0 - 0 0 0 0 100 0.21
15 0 0 0 173 0 0 0 0 - 0 0 0 0 100 0.21
16 1337 0 0 743 2195 15 0 89 99 6640 53 4 0 43 0.37
17 0 0 0 175 0 0 0 0 - 0 0 0 0 100 0.21
18 0 0 0 175 0 0 0 0 - 0 0 0 0 100 0.21
19 0 0 0 177 0 0 0 0 - 0 0 0 0 100 0.21
20 1587 0 0 869 3976 118 0 1123 99 11647 42 6 0 52 0.35
21 0 0 0 174 0 0 0 0 - 0 0 0 0 100 0.22
22 0 0 0 177 0 0 0 0 - 0 0 0 0 100 0.22
23 0 0 0 176 0 0 0 0 - 0 0 0 0 100 0.22
24 1923 0 0 659 1801 25 1 91 99 8180 58 5 0 38 0.39
25 0 0 0 175 0 0 0 0 - 0 0 0 0 100 0.20
26 0 0 0 176 0 0 0 0 - 0 0 0 0 100 0.20
27 0 0 0 178 0 0 0 0 - 0 0 0 0 100 0.20
ALL 10860 0 0 9746 21384 413 6 3629 0 69476 18 2 0 80 7.01
------------------------------------------------------------------------------------------

cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc
0 451 0 0 821 2069 94 1 701 99 6412 91 2 0 7 0.54
1 0 0 0 102 0 0 0 1 100 0 0 0 0 100 0.16
2 0 0 0 104 0 0 0 1 100 0 0 0 0 100 0.16
3 0 0 0 104 0 0 0 1 100 0 0 0 0 100 0.16
4 2145 0 0 599 2378 86 1 749 99 7696 26 6 0 68 0.31
5 0 0 0 103 0 0 0 1 100 0 0 0 0 100 0.23
6 0 0 0 104 0 0 0 1 100 0 0 0 0 100 0.23
7 0 0 0 104 0 0 0 1 100 0 0 0 0 100 0.23
8 1601 0 0 560 1888 12 0 73 99 6139 31 4 0 65 0.31
9 0 0 0 103 0 0 0 1 100 0 0 0 0 100 0.23
10 0 0 0 102 0 0 0 1 100 0 0 0 0 100 0.23
11 0 0 0 103 0 0 0 1 100 0 0 0 0 100 0.23
12 1522 0 0 532 1941 12 0 71 99 7775 29 4 0 66 0.31
13 0 0 0 104 0 0 0 1 100 0 0 0 0 100 0.23
14 0 0 0 104 0 0 0 1 100 0 0 0 0 100 0.23
15 0 0 0 105 0 0 0 1 100 0 0 0 0 100 0.23
16 2311 0 0 459 1170 8 0 59 99 6906 24 6 0 70 0.30
17 0 0 0 103 0 0 0 1 100 0 0 0 0 100 0.23
18 0 0 0 105 0 0 0 1 100 0 0 0 0 100 0.23
19 0 0 0 102 0 0 0 1 100 0 0 0 0 100 0.23
20 1872 0 0 544 2224 81 0 781 99 8485 28 5 0 67 0.31
21 0 0 0 103 0 0 0 1 100 0 0 0 0 100 0.23
22 0 0 0 104 0 0 0 2 100 0 0 0 0 100 0.23
23 0 0 0 105 0 0 0 2 100 0 0 0 0 100 0.23
24 2169 0 0 391 987 8 0 46 99 7613 22 5 0 73 0.30
25 0 0 0 103 0 0 0 1 100 0 0 0 0 100 0.23
26 0 0 0 102 0 0 0 1 100 0 0 0 0 100 0.23
27 0 0 0 102 0 0 0 1 100 0 0 0 0 100 0.23
ALL 12071 0 0 6077 12657 301 2 2503 100 51026 14 2 0 84 6.99
------------------------------------------------------------------------------------------

cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc
0 2017 0 0 834 2647 85 0 767 99 11113 44 6 0 50 0.35
1 0 0 0 143 0 0 0 0 - 0 0 0 0 100 0.22
2 0 0 0 144 0 0 0 0 - 0 0 0 0 100 0.22
3 0 0 0 144 0 0 0 0 - 0 0 0 0 100 0.22
4 2396 0 0 752 2779 78 1 788 99 12424 35 7 0 57 0.33
5 0 0 0 140 0 0 0 0 - 0 0 0 0 100 0.22
6 0 0 0 143 0 0 0 0 - 0 0 0 0 100 0.22
7 0 0 0 145 0 0 0 0 - 0 0 0 0 100 0.22
8 2273 0 0 814 2591 44 0 123 99 9473 51 6 0 43 0.37
9 0 0 0 144 0 0 0 0 - 0 0 0 0 100 0.21
10 0 0 0 142 0 0 0 0 - 0 0 0 0 100 0.21
11 0 0 0 142 0 0 0 0 - 0 0 0 0 100 0.21
12 1353 0 0 764 2608 46 1 96 99 10134 67 4 0 29 0.42
13 0 0 0 141 0 0 0 0 - 0 0 0 0 100 0.19
14 0 0 0 142 0 0 0 0 - 0 0 0 0 100 0.19
15 0 0 0 141 0 0 0 0 - 0 0 0 0 100 0.19
16 2362 0 0 500 1150 10 1 66 99 6856 21 5 0 74 0.30
17 0 0 0 143 0 0 0 0 - 0 0 0 0 100 0.24
18 0 0 0 143 0 0 0 0 - 0 0 0 0 100 0.24
19 0 0 0 143 0 0 0 0 - 0 0 0 0 100 0.24
20 1269 0 0 721 2555 80 2 714 99 7318 52 4 0 44 0.37
21 0 0 0 142 0 0 0 0 - 0 0 0 0 100 0.21
22 0 0 0 143 0 0 0 0 - 0 0 0 0 100 0.21
23 0 0 0 145 0 0 0 0 - 0 0 0 0 100 0.21
24 2041 0 0 744 2079 11 1 54 100 6703 34 5 0 61 0.32
25 0 0 0 142 0 0 0 0 - 0 0 0 0 100 0.23
26 0 0 0 143 0 0 0 0 - 0 0 0 0 100 0.23
27 0 0 0 144 0 0 0 0 - 0 0 0 0 100 0.23
ALL 13711 0 0 8128 16409 354 6 2608 0 64021 16 2 0 82 7.01
------------------------------------------------------------------------------------------

cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc
0 1487 0 0 1105 3417 123 1 1442 99 11090 71 5 0 24 0.44
1 0 0 0 190 0 0 0 1 100 0 0 0 0 100 0.19
2 0 0 0 191 0 0 0 1 100 0 0 0 0 100 0.19
3 0 0 0 188 0 0 0 1 100 0 0 0 0 100 0.19
4 880 0 0 978 3243 116 1 1279 99 11385 64 5 0 32 0.41
5 0 0 0 186 0 0 0 1 100 0 0 0 0 100 0.19
6 0 0 0 189 0 0 0 1 100 0 0 0 0 100 0.19
7 0 0 0 187 0 0 0 1 100 0 0 0 0 100 0.19
8 1645 0 0 1046 3176 25 1 172 99 9491 39 6 0 54 0.34
9 0 0 0 189 0 0 0 1 100 0 0 0 0 100 0.22
10 0 0 0 188 0 0 0 1 100 0 0 0 0 100 0.22
11 0 0 0 186 0 0 0 1 100 0 0 0 0 100 0.22
12 938 0 0 860 3124 27 1 157 99 7739 41 5 0 54 0.34
13 0 0 0 190 0 0 0 1 100 0 0 0 0 100 0.22
14 0 0 0 191 0 0 0 1 100 0 0 0 0 100 0.22
15 0 0 0 188 0 0 0 1 100 0 0 0 0 100 0.22
16 1700 0 0 847 2007 23 1 183 99 7020 32 5 0 63 0.32
17 0 0 0 190 0 0 0 1 100 0 0 0 0 100 0.23
18 0 0 0 190 0 0 0 1 100 0 0 0 0 100 0.23
19 0 0 0 190 0 0 0 1 100 0 0 0 0 100 0.23
20 1261 0 0 1094 3530 108 1 1377 99 9876 39 6 0 56 0.33
21 0 0 0 189 0 0 0 1 100 0 0 0 0 100 0.22
22 0 0 0 187 0 0 0 1 100 0 0 0 0 100 0.22
23 0 0 0 189 0 0 0 1 100 0 0 0 0 100 0.22
24 2201 0 0 832 2030 20 0 190 99 8225 30 6 0 64 0.32
25 0 0 0 190 0 0 0 1 100 0 0 0 0 100 0.23
26 0 0 0 190 0 0 0 1 100 0 0 0 0 100 0.23
27 0 0 0 188 0 0 0 1 100 0 0 0 0 100 0.23
ALL 10112 0 0 10728 20527 442 6 4821 100 64826 17 2 0 81 6.98
------------------------------------------------------------------------------------------

cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc
0 1129 0 0 951 3043 95 0 1381 99 9041 68 4 0 28 0.43
1 0 0 0 169 0 0 0 0 - 0 0 0 0 100 0.19
2 0 0 0 166 0 0 0 0 - 0 0 0 0 100 0.19
3 0 0 0 169 0 0 0 0 - 0 0 0 0 100 0.19
4 1192 0 0 932 3590 129 1 1585 99 12514 45 6 0 49 0.35
5 0 0 0 165 0 0 0 0 - 0 0 0 0 100 0.21
6 0 0 0 168 0 0 0 0 - 0 0 0 0 100 0.22
7 0 0 0 167 0 0 0 0 - 0 0 0 0 100 0.21
8 1052 0 0 885 2879 19 0 111 100 7976 40 5 0 55 0.34
9 0 0 0 170 0 0 0 0 - 0 0 0 0 100 0.22
10 0 0 0 166 0 0 0 0 - 0 0 0 0 100 0.22
11 0 0 0 169 0 0 0 0 - 0 0 0 0 100 0.22
12 1112 0 0 911 2993 16 1 128 99 7940 34 5 0 61 0.32
13 0 0 0 167 0 0 0 0 - 0 0 0 0 100 0.23
14 0 0 0 167 0 0 0 0 - 0 0 0 0 100 0.23
15 0 0 0 166 0 0 0 0 - 0 0 0 0 100 0.23
16 1777 0 0 744 2233 26 0 237 99 9329 34 6 0 60 0.33
17 0 0 0 168 0 0 0 0 - 0 0 0 0 100 0.22
18 0 0 0 168 0 0 0 0 - 0 0 0 0 100 0.22
19 0 0 0 169 0 0 0 0 - 0 0 0 0 100 0.23
20 1373 0 0 970 3433 110 2 1550 99 10473 34 6 0 59 0.33
21 0 0 0 168 0 0 0 0 - 0 0 0 0 100 0.22
22 0 0 0 166 0 0 0 0 - 0 0 0 0 100 0.22
23 0 0 0 168 0 0 0 0 - 0 0 0 0 100 0.22
24 1064 0 0 811 2208 28 0 226 99 6829 51 4 0 45 0.37
25 0 0 0 168 0 0 0 0 - 0 0 0 0 100 0.21
26 0 0 0 166 0 0 0 0 - 0 0 0 0 100 0.21
27 0 0 0 162 0 0 0 0 - 0 0 0 0 100 0.21
ALL 8699 0 0 9716 20379 423 4 5218 0 64102 16 2 0 82 7.00

------------------------------
Sincerely,
Dennis
------------------------------

Original Message

18. RE: What prevents a RSS to roll forward as fast as its primary?

Like

Dennis Melnikov

Posted Tue January 09, 2024 08:11 AM

Benjamin,

I wish we were running 14.10. It is 11.70 in reality.

------------------------------
Sincerely,
Dennis
------------------------------

Original Message

22. RE: What prevents a RSS to roll forward as fast as its primary?

Like

Dennis Melnikov

Posted Sat December 30, 2023 03:20 AM

David,

SEC_APPLY_POLLTIME has no meaning for 11.70.

We are performing a large table repack, and the replica is lagging behind.

onstats on the primary:

$ onstat -g cluster

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:11 -- 361806176 Kbytes

Primary Server:elids5
Current Log Page:543836,63218
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30

Server ACKed Log Supports Status
(log, page) Updates
elids6 0,0 No ASYNC(RSS),Disconnected,Defined
elids6_r 543813,76895 No ASYNC(RSS),Connected,Active

$ onstat -g rss verbose

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:39 -- 361806176 Kbytes

Local server type: Primary
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30
Number of RSS servers: 2

RSS Server information:

RSS Server control block: 0x0
RSS server name: elids6
RSS server status: Defined
RSS connection status: Disconnected

RSS Server control block: 0x700001f230ae028
RSS server name: elids6_r
RSS server status: Active
RSS connection status: Connected
RSS flow control:0/0
Log transmission status: Blocked
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Time of Last Acknowledgement: 2023-12-30.10:46:38
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340
Sequence number of next buffer to send: 89167954
Sequence number of last buffer acked: 89167889
Supports Proxy Writes: N

$ onstat -g rss log

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:58:29 -- 361806176 Kbytes

Log Pages Snooped:
RSS Srv From From Tossed
name Cache Disk (LBC full)

elids6_r 167171226 13287270 15911638

Onstats on the replica:

$ onstat -g rss verbose

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:15:20 -- 262286176 Kbytes

RSS Server control block: 0x700001e4ea85e60
Local server type: RSS
Server Status : Active
Source server name: elids5_r
Connection status: Connected
Last log page received(log id,page): 543814,29439
Sequence number of last buffer received: 89170395
Sequence number of last buffer acked: 89170395

$ onstat -g laq

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:16:08 -- 262286176 Kbytes
Log Apply Info:
Thread Queue Total Avg
Size Queued Depth
xchg_1.0 0 112159631 20.15
xchg_1.1 0 7924415 5.82
xchg_1.2 0 7189459 3.64
xchg_1.3 0 2782265 6.04
xchg_1.4 0 9741266 7.09
xchg_1.5 0 29073979 19.87
xchg_1.6 0 23416392 3.57
xchg_1.7 0 20301067 2.25
xchg_1.8 0 20963291 4.72
xchg_1.9 0 2255620 22.28
xchg_1.10 0 7046118 2.34
xchg_1.11 0 5665037 2.83
xchg_1.12 0 7656817 3.50
xchg_1.13 0 2163023 5.55
xchg_1.14 0 6675688 2.43
xchg_1.15 0 6682011 3.20
xchg_1.16 0 8666645 6.72
xchg_1.17 0 3638224 4.32
xchg_1.18 0 3434523 7.04
xchg_1.19 0 3706118 5.16
xchg_1.20 0 3015094 4.42
xchg_1.21 0 4097616 5.35
xchg_1.22 0 3545439 3.19
xchg_1.23 0 6714137 5.09
xchg_1.24 0 18110006 3.48
xchg_1.25 0 4760233 5.37
xchg_1.26 0 2513045 4.01
xchg_1.27 136 195407918 105.51
xchg_1.28 0 7531591 4.96
xchg_1.29 0 13571912 12.91

Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4

Next, we have 120 kio queues and 1 aio queue on the RSS.

------------------------------
Sincerely,
Dennis
------------------------------

Original Message

23. RE: What prevents a RSS to roll forward as fast as its primary?

Like

IBM Champion

David Williams

Posted Mon January 01, 2024 10:48 PM

Hi,

So in order of the flow we have:

Primary
Current Log Page:543836,63218
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340

RSS
Last log page received(log id,page): 543814,29439
Queues are full.
Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4

As onstat -g cluster on Version 11.70 unlike 12.10 https://www.ibm.com/docs/en/SSGU8G_12.1.0/com.ibm.adref.doc/ids_adr_1087.htm does not show "Applied Log (log, page)" can you repeat this with "onstat -l | grep C" from the RSS?

I would say it is an apply issue.

NOTE: There are some APARS in this area that are fixed in 14.10.FC10

https://www.ibm.com/support/pages/apar/IT37242
IT37242: WITH DBSPACES COMPRISED OF MANY CHUNKS, FREQUENT 'CHUNK DOWN' CHECKING CAN BE VERY EXPENSIVE
The bld_logrecs thread checks for down chunks for EVERY log record that is applied.
This doe not show here as "Log Record Queue" is almost full but will not be helping.

https://www.ibm.com/support/pages/apar/IT32067
RA_Q_LIST MUTEX CONTENTION AND HOT READAHEAD SPIN LOCK WHEN THERE ARE MANY READAHEAD THREADS
which can also affect replication, certainly in 12.10, not sure if readahead is the same in 11.70!

Check storage performance on the RSS.

This could also be due to the contention between apply threads I mentioned earlier, someone from HCL can comment further.

Regards,

David.

------------------------------
David Williams
------------------------------

Original Message

Original Message:
Sent: Sat December 30, 2023 03:19 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

David,

SEC_APPLY_POLLTIME has no meaning for 11.70.

We are performing a large table repack, and the replica is lagging behind.

onstats on the primary:

$ onstat -g cluster

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:11 -- 361806176 Kbytes

Primary Server:elids5
Current Log Page:543836,63218
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30

Server ACKed Log Supports Status
(log, page) Updates
elids6 0,0 No ASYNC(RSS),Disconnected,Defined
elids6_r 543813,76895 No ASYNC(RSS),Connected,Active

$ onstat -g rss verbose

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:39 -- 361806176 Kbytes

Local server type: Primary
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30
Number of RSS servers: 2

RSS Server information:

RSS Server control block: 0x0
RSS server name: elids6
RSS server status: Defined
RSS connection status: Disconnected

RSS Server control block: 0x700001f230ae028
RSS server name: elids6_r
RSS server status: Active
RSS connection status: Connected
RSS flow control:0/0
Log transmission status: Blocked
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Time of Last Acknowledgement: 2023-12-30.10:46:38
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340
Sequence number of next buffer to send: 89167954
Sequence number of last buffer acked: 89167889
Supports Proxy Writes: N

$ onstat -g rss log

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:58:29 -- 361806176 Kbytes

Log Pages Snooped:
RSS Srv From From Tossed
name Cache Disk (LBC full)

elids6_r 167171226 13287270 15911638

Onstats on the replica:

$ onstat -g rss verbose

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:15:20 -- 262286176 Kbytes

RSS Server control block: 0x700001e4ea85e60
Local server type: RSS
Server Status : Active
Source server name: elids5_r
Connection status: Connected
Last log page received(log id,page): 543814,29439
Sequence number of last buffer received: 89170395
Sequence number of last buffer acked: 89170395

$ onstat -g laq

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:16:08 -- 262286176 Kbytes
Log Apply Info:
Thread Queue Total Avg
Size Queued Depth
xchg_1.0 0 112159631 20.15
xchg_1.1 0 7924415 5.82
xchg_1.2 0 7189459 3.64
xchg_1.3 0 2782265 6.04
xchg_1.4 0 9741266 7.09
xchg_1.5 0 29073979 19.87
xchg_1.6 0 23416392 3.57
xchg_1.7 0 20301067 2.25
xchg_1.8 0 20963291 4.72
xchg_1.9 0 2255620 22.28
xchg_1.10 0 7046118 2.34
xchg_1.11 0 5665037 2.83
xchg_1.12 0 7656817 3.50
xchg_1.13 0 2163023 5.55
xchg_1.14 0 6675688 2.43
xchg_1.15 0 6682011 3.20
xchg_1.16 0 8666645 6.72
xchg_1.17 0 3638224 4.32
xchg_1.18 0 3434523 7.04
xchg_1.19 0 3706118 5.16
xchg_1.20 0 3015094 4.42
xchg_1.21 0 4097616 5.35
xchg_1.22 0 3545439 3.19
xchg_1.23 0 6714137 5.09
xchg_1.24 0 18110006 3.48
xchg_1.25 0 4760233 5.37
xchg_1.26 0 2513045 4.01
xchg_1.27 136 195407918 105.51
xchg_1.28 0 7531591 4.96
xchg_1.29 0 13571912 12.91

Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4

Next, we have 120 kio queues and 1 aio queue on the RSS.

------------------------------
Sincerely,
Dennis

Original Message:
Sent: Wed December 27, 2023 12:26 PM
From: David Williams
Subject: What prevents a RSS to roll forward as fast as its primary?

Hi,

Sometimes due to the syncronization points I mentioned having LESS apply threads can be faster.

Try OFF_RECVRY_THREADS 31, 23,17,11 and see which comes out best.

Also try SEC_APPLY_POLLTIME 100 and 50 and 0.

When the lag happens send

onstat -g cluster on the primary (*)
onstat -g rss verbose on the primary (*) and the RSS
onstat -g rss log on the primary
onstat -g laq on the RSS

When the lag happens run onstat -l to switch log and onlog -n to dump the contents of the previous log and count log records/commits for each partnum associated with the repack.

The starred one tell you a lot about between which 2 points the bottleneck is occuring

Current Log position on the primary
Current Send postition on the primary
Current Acknowledged position on the RSS
Current Applied Position on the RSS

laq on the RSS also shows which replication queue on the RSS is getting backed up.

I suspect that with the repack only 1 partnum is being hit which means only 1 apply thread doing the work, not much you can do about that.

NOTE: Also check onstat -g ioq, are you using KAIO or AIO? With less CPU VPs there are less KAIO threads so less I/O bandwidth, not sure if that makes a difference in this scenario though.

Yes Version 14 does improve throughput!

Regards,

David.

------------------------------
David Williams

Original Message:
Sent: Tue December 26, 2023 11:26 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

Doug,

The fact is that the replica rarely lags behind the primary during regular activities.

This usually occurs when an operation is performed on a large table, such as repacking the table or building an index.

Would increasing OFF_RECVRY_THREADS help in this case?

------------------------------
Sincerely,
Dennis

Original Message:
Sent: Tue December 26, 2023 03:26 AM
From: Doug Lawry
Subject: What prevents a RSS to roll forward as fast as its primary?

Hi Dennis.

OFF_RECVRY_THREADS is most significant, as also mentioned by David:

https://www.ibm.com/docs/en/informix-servers/14.10?topic=cptarr-off-recvry-threads-recvry-threads-their-effect-fast-recovery

https://www.ibm.com/docs/en/informix-servers/14.10?topic=parameters-off-recvry-threads-configuration-parameter

Traditionally, the rule for this was the first prime number greater than three times the number of CPU VPs. Having it too low will throttle it compared to the primary, where the equivalent is the number of user sessions.

------------------------------
Doug Lawry
Oninit Consulting

Original Message:
Sent: Mon December 25, 2023 03:03 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

We have two servers of the same architecture, IBM Power 870.
Each has storage allocated on separate IBM FlashSystem 9200.

Primary's resources:
Cores: 51
RAM: 600 GB

RSS:
Cores: 5
RAM: 400 GB

Performing a table repack, the primary generates logical logs pretty fast, while the RSS redoes them much slower.
Does the RSS do it by design, or do we miss some relevant settings?

------------------------------
Sincerely,
Dennis
------------------------------

24. RE: What prevents a RSS to roll forward as fast as its primary?

Like

Dennis Melnikov

Posted Tue January 02, 2024 01:07 PM

David,

(a) I was checking `onstat -l | grep C-` while the replica lagging, it showed almost same values as `onstat -g rss verbose`.

(b) Could the apply contention be an issue if a single thread performs 99.99% of apply job when lagging?

------------------------------
Sincerely,
Dennis
------------------------------

Original Message

Original Message:
Sent: Mon January 01, 2024 10:47 PM
From: David Williams
Subject: What prevents a RSS to roll forward as fast as its primary?

Hi,

So in order of the flow we have:

Primary
Current Log Page:543836,63218
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340

RSS
Last log page received(log id,page): 543814,29439
Queues are full.
Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4

As onstat -g cluster on Version 11.70 unlike 12.10 https://www.ibm.com/docs/en/SSGU8G_12.1.0/com.ibm.adref.doc/ids_adr_1087.htm does not show "Applied Log (log, page)" can you repeat this with "onstat -l | grep C" from the RSS?

I would say it is an apply issue.

NOTE: There are some APARS in this area that are fixed in 14.10.FC10

https://www.ibm.com/support/pages/apar/IT37242
IT37242: WITH DBSPACES COMPRISED OF MANY CHUNKS, FREQUENT 'CHUNK DOWN' CHECKING CAN BE VERY EXPENSIVE
The bld_logrecs thread checks for down chunks for EVERY log record that is applied.
This doe not show here as "Log Record Queue" is almost full but will not be helping.

https://www.ibm.com/support/pages/apar/IT32067
RA_Q_LIST MUTEX CONTENTION AND HOT READAHEAD SPIN LOCK WHEN THERE ARE MANY READAHEAD THREADS
which can also affect replication, certainly in 12.10, not sure if readahead is the same in 11.70!

Check storage performance on the RSS.

This could also be due to the contention between apply threads I mentioned earlier, someone from HCL can comment further.

Regards,

David.

------------------------------
David Williams

Original Message:
Sent: Sat December 30, 2023 03:19 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

David,

SEC_APPLY_POLLTIME has no meaning for 11.70.

We are performing a large table repack, and the replica is lagging behind.

onstats on the primary:

$ onstat -g cluster

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:11 -- 361806176 Kbytes

Primary Server:elids5
Current Log Page:543836,63218
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30

Server ACKed Log Supports Status
(log, page) Updates
elids6 0,0 No ASYNC(RSS),Disconnected,Defined
elids6_r 543813,76895 No ASYNC(RSS),Connected,Active

$ onstat -g rss verbose

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:39 -- 361806176 Kbytes

Local server type: Primary
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30
Number of RSS servers: 2

RSS Server information:

RSS Server control block: 0x0
RSS server name: elids6
RSS server status: Defined
RSS connection status: Disconnected

RSS Server control block: 0x700001f230ae028
RSS server name: elids6_r
RSS server status: Active
RSS connection status: Connected
RSS flow control:0/0
Log transmission status: Blocked
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Time of Last Acknowledgement: 2023-12-30.10:46:38
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340
Sequence number of next buffer to send: 89167954
Sequence number of last buffer acked: 89167889
Supports Proxy Writes: N

$ onstat -g rss log

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:58:29 -- 361806176 Kbytes

Log Pages Snooped:
RSS Srv From From Tossed
name Cache Disk (LBC full)

elids6_r 167171226 13287270 15911638

Onstats on the replica:

$ onstat -g rss verbose

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:15:20 -- 262286176 Kbytes

RSS Server control block: 0x700001e4ea85e60
Local server type: RSS
Server Status : Active
Source server name: elids5_r
Connection status: Connected
Last log page received(log id,page): 543814,29439
Sequence number of last buffer received: 89170395
Sequence number of last buffer acked: 89170395

$ onstat -g laq

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:16:08 -- 262286176 Kbytes
Log Apply Info:
Thread Queue Total Avg
Size Queued Depth
xchg_1.0 0 112159631 20.15
xchg_1.1 0 7924415 5.82
xchg_1.2 0 7189459 3.64
xchg_1.3 0 2782265 6.04
xchg_1.4 0 9741266 7.09
xchg_1.5 0 29073979 19.87
xchg_1.6 0 23416392 3.57
xchg_1.7 0 20301067 2.25
xchg_1.8 0 20963291 4.72
xchg_1.9 0 2255620 22.28
xchg_1.10 0 7046118 2.34
xchg_1.11 0 5665037 2.83
xchg_1.12 0 7656817 3.50
xchg_1.13 0 2163023 5.55
xchg_1.14 0 6675688 2.43
xchg_1.15 0 6682011 3.20
xchg_1.16 0 8666645 6.72
xchg_1.17 0 3638224 4.32
xchg_1.18 0 3434523 7.04
xchg_1.19 0 3706118 5.16
xchg_1.20 0 3015094 4.42
xchg_1.21 0 4097616 5.35
xchg_1.22 0 3545439 3.19
xchg_1.23 0 6714137 5.09
xchg_1.24 0 18110006 3.48
xchg_1.25 0 4760233 5.37
xchg_1.26 0 2513045 4.01
xchg_1.27 136 195407918 105.51
xchg_1.28 0 7531591 4.96
xchg_1.29 0 13571912 12.91

Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4

Next, we have 120 kio queues and 1 aio queue on the RSS.

------------------------------
Sincerely,
Dennis

Original Message:
Sent: Wed December 27, 2023 12:26 PM
From: David Williams
Subject: What prevents a RSS to roll forward as fast as its primary?

Hi,

Sometimes due to the syncronization points I mentioned having LESS apply threads can be faster.

Try OFF_RECVRY_THREADS 31, 23,17,11 and see which comes out best.

Also try SEC_APPLY_POLLTIME 100 and 50 and 0.

When the lag happens send

onstat -g cluster on the primary (*)
onstat -g rss verbose on the primary (*) and the RSS
onstat -g rss log on the primary
onstat -g laq on the RSS

When the lag happens run onstat -l to switch log and onlog -n to dump the contents of the previous log and count log records/commits for each partnum associated with the repack.

The starred one tell you a lot about between which 2 points the bottleneck is occuring

Current Log position on the primary
Current Send postition on the primary
Current Acknowledged position on the RSS
Current Applied Position on the RSS

laq on the RSS also shows which replication queue on the RSS is getting backed up.

I suspect that with the repack only 1 partnum is being hit which means only 1 apply thread doing the work, not much you can do about that.

NOTE: Also check onstat -g ioq, are you using KAIO or AIO? With less CPU VPs there are less KAIO threads so less I/O bandwidth, not sure if that makes a difference in this scenario though.

Yes Version 14 does improve throughput!

Regards,

David.

------------------------------
David Williams

Original Message:
Sent: Tue December 26, 2023 11:26 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

Doug,

The fact is that the replica rarely lags behind the primary during regular activities.

This usually occurs when an operation is performed on a large table, such as repacking the table or building an index.

Would increasing OFF_RECVRY_THREADS help in this case?

------------------------------
Sincerely,
Dennis

Original Message:
Sent: Tue December 26, 2023 03:26 AM
From: Doug Lawry
Subject: What prevents a RSS to roll forward as fast as its primary?

Hi Dennis.

OFF_RECVRY_THREADS is most significant, as also mentioned by David:

https://www.ibm.com/docs/en/informix-servers/14.10?topic=cptarr-off-recvry-threads-recvry-threads-their-effect-fast-recovery

https://www.ibm.com/docs/en/informix-servers/14.10?topic=parameters-off-recvry-threads-configuration-parameter

Traditionally, the rule for this was the first prime number greater than three times the number of CPU VPs. Having it too low will throttle it compared to the primary, where the equivalent is the number of user sessions.

------------------------------
Doug Lawry
Oninit Consulting

Original Message:
Sent: Mon December 25, 2023 03:03 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

We have two servers of the same architecture, IBM Power 870.
Each has storage allocated on separate IBM FlashSystem 9200.

Primary's resources:
Cores: 51
RAM: 600 GB

RSS:
Cores: 5
RAM: 400 GB

Performing a table repack, the primary generates logical logs pretty fast, while the RSS redoes them much slower.
Does the RSS do it by design, or do we miss some relevant settings?

------------------------------
Sincerely,
Dennis
------------------------------

25. RE: What prevents a RSS to roll forward as fast as its primary?

Like

IBM Champion

David Williams

Posted Fri January 05, 2024 02:43 AM

Hi,

is the repack doing lots of commits? If so it will need to co-ordinate for each commit.

Regards,

David.

------------------------------
David Williams
------------------------------

Original Message

Original Message:
Sent: Tue January 02, 2024 01:07 PM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

David,

(a) I was checking `onstat -l | grep C-` while the replica lagging, it showed almost same values as `onstat -g rss verbose`.

(b) Could the apply contention be an issue if a single thread performs 99.99% of apply job when lagging?

------------------------------
Sincerely,
Dennis

Original Message:
Sent: Mon January 01, 2024 10:47 PM
From: David Williams
Subject: What prevents a RSS to roll forward as fast as its primary?

Hi,

So in order of the flow we have:

Primary
Current Log Page:543836,63218
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340

RSS
Last log page received(log id,page): 543814,29439
Queues are full.
Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4

As onstat -g cluster on Version 11.70 unlike 12.10 https://www.ibm.com/docs/en/SSGU8G_12.1.0/com.ibm.adref.doc/ids_adr_1087.htm does not show "Applied Log (log, page)" can you repeat this with "onstat -l | grep C" from the RSS?

I would say it is an apply issue.

NOTE: There are some APARS in this area that are fixed in 14.10.FC10

https://www.ibm.com/support/pages/apar/IT37242
IT37242: WITH DBSPACES COMPRISED OF MANY CHUNKS, FREQUENT 'CHUNK DOWN' CHECKING CAN BE VERY EXPENSIVE
The bld_logrecs thread checks for down chunks for EVERY log record that is applied.
This doe not show here as "Log Record Queue" is almost full but will not be helping.

https://www.ibm.com/support/pages/apar/IT32067
RA_Q_LIST MUTEX CONTENTION AND HOT READAHEAD SPIN LOCK WHEN THERE ARE MANY READAHEAD THREADS
which can also affect replication, certainly in 12.10, not sure if readahead is the same in 11.70!

Check storage performance on the RSS.

This could also be due to the contention between apply threads I mentioned earlier, someone from HCL can comment further.

Regards,

David.

------------------------------
David Williams

Original Message:
Sent: Sat December 30, 2023 03:19 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

David,

SEC_APPLY_POLLTIME has no meaning for 11.70.

We are performing a large table repack, and the replica is lagging behind.

onstats on the primary:

$ onstat -g cluster

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:11 -- 361806176 Kbytes

Primary Server:elids5
Current Log Page:543836,63218
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30

Server ACKed Log Supports Status
(log, page) Updates
elids6 0,0 No ASYNC(RSS),Disconnected,Defined
elids6_r 543813,76895 No ASYNC(RSS),Connected,Active

$ onstat -g rss verbose

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:57:39 -- 361806176 Kbytes

Local server type: Primary
Index page logging status: Enabled
Index page logging was enabled at: 2012/08/24 16:10:30
Number of RSS servers: 2

RSS Server information:

RSS Server control block: 0x0
RSS server name: elids6
RSS server status: Defined
RSS connection status: Disconnected

RSS Server control block: 0x700001f230ae028
RSS server name: elids6_r
RSS server status: Active
RSS connection status: Connected
RSS flow control:0/0
Log transmission status: Blocked
Next log page to send(log id,page): 543813,121374
Last log page acked(log id,page): 543813,120349
Time of Last Acknowledgement: 2023-12-30.10:46:38
Pending Log Pages to be ACKed: 1032
Approximate Log Page Backlog:3015340
Sequence number of next buffer to send: 89167954
Sequence number of last buffer acked: 89167889
Supports Proxy Writes: N

$ onstat -g rss log

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 5 days 10:58:29 -- 361806176 Kbytes

Log Pages Snooped:
RSS Srv From From Tossed
name Cache Disk (LBC full)

elids6_r 167171226 13287270 15911638

Onstats on the replica:

$ onstat -g rss verbose

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:15:20 -- 262286176 Kbytes

RSS Server control block: 0x700001e4ea85e60
Local server type: RSS
Server Status : Active
Source server name: elids5_r
Connection status: Connected
Last log page received(log id,page): 543814,29439
Sequence number of last buffer received: 89170395
Sequence number of last buffer acked: 89170395

$ onstat -g laq

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 18 days 15:16:08 -- 262286176 Kbytes
Log Apply Info:
Thread Queue Total Avg
Size Queued Depth
xchg_1.0 0 112159631 20.15
xchg_1.1 0 7924415 5.82
xchg_1.2 0 7189459 3.64
xchg_1.3 0 2782265 6.04
xchg_1.4 0 9741266 7.09
xchg_1.5 0 29073979 19.87
xchg_1.6 0 23416392 3.57
xchg_1.7 0 20301067 2.25
xchg_1.8 0 20963291 4.72
xchg_1.9 0 2255620 22.28
xchg_1.10 0 7046118 2.34
xchg_1.11 0 5665037 2.83
xchg_1.12 0 7656817 3.50
xchg_1.13 0 2163023 5.55
xchg_1.14 0 6675688 2.43
xchg_1.15 0 6682011 3.20
xchg_1.16 0 8666645 6.72
xchg_1.17 0 3638224 4.32
xchg_1.18 0 3434523 7.04
xchg_1.19 0 3706118 5.16
xchg_1.20 0 3015094 4.42
xchg_1.21 0 4097616 5.35
xchg_1.22 0 3545439 3.19
xchg_1.23 0 6714137 5.09
xchg_1.24 0 18110006 3.48
xchg_1.25 0 4760233 5.37
xchg_1.26 0 2513045 4.01
xchg_1.27 136 195407918 105.51
xchg_1.28 0 7531591 4.96
xchg_1.29 0 13571912 12.91

Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:0
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:0
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:7
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:4

Next, we have 120 kio queues and 1 aio queue on the RSS.

------------------------------
Sincerely,
Dennis

Original Message:
Sent: Wed December 27, 2023 12:26 PM
From: David Williams
Subject: What prevents a RSS to roll forward as fast as its primary?

Hi,

Sometimes due to the syncronization points I mentioned having LESS apply threads can be faster.

Try OFF_RECVRY_THREADS 31, 23,17,11 and see which comes out best.

Also try SEC_APPLY_POLLTIME 100 and 50 and 0.

When the lag happens send

onstat -g cluster on the primary (*)
onstat -g rss verbose on the primary (*) and the RSS
onstat -g rss log on the primary
onstat -g laq on the RSS

When the lag happens run onstat -l to switch log and onlog -n to dump the contents of the previous log and count log records/commits for each partnum associated with the repack.

The starred one tell you a lot about between which 2 points the bottleneck is occuring

Current Log position on the primary
Current Send postition on the primary
Current Acknowledged position on the RSS
Current Applied Position on the RSS

laq on the RSS also shows which replication queue on the RSS is getting backed up.

I suspect that with the repack only 1 partnum is being hit which means only 1 apply thread doing the work, not much you can do about that.

NOTE: Also check onstat -g ioq, are you using KAIO or AIO? With less CPU VPs there are less KAIO threads so less I/O bandwidth, not sure if that makes a difference in this scenario though.

Yes Version 14 does improve throughput!

Regards,

David.

------------------------------
David Williams

Original Message:
Sent: Tue December 26, 2023 11:26 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

Doug,

The fact is that the replica rarely lags behind the primary during regular activities.

This usually occurs when an operation is performed on a large table, such as repacking the table or building an index.

Would increasing OFF_RECVRY_THREADS help in this case?

------------------------------
Sincerely,
Dennis

Original Message:
Sent: Tue December 26, 2023 03:26 AM
From: Doug Lawry
Subject: What prevents a RSS to roll forward as fast as its primary?

Hi Dennis.

OFF_RECVRY_THREADS is most significant, as also mentioned by David:

https://www.ibm.com/docs/en/informix-servers/14.10?topic=cptarr-off-recvry-threads-recvry-threads-their-effect-fast-recovery

https://www.ibm.com/docs/en/informix-servers/14.10?topic=parameters-off-recvry-threads-configuration-parameter

Traditionally, the rule for this was the first prime number greater than three times the number of CPU VPs. Having it too low will throttle it compared to the primary, where the equivalent is the number of user sessions.

------------------------------
Doug Lawry
Oninit Consulting

Original Message:
Sent: Mon December 25, 2023 03:03 AM
From: Dennis Melnikov
Subject: What prevents a RSS to roll forward as fast as its primary?

We have two servers of the same architecture, IBM Power 870.
Each has storage allocated on separate IBM FlashSystem 9200.

Primary's resources:
Cores: 51
RAM: 600 GB

RSS:
Cores: 5
RAM: 400 GB

Performing a table repack, the primary generates logical logs pretty fast, while the RSS redoes them much slower.
Does the RSS do it by design, or do we miss some relevant settings?

------------------------------
Sincerely,
Dennis
------------------------------

26. RE: What prevents a RSS to roll forward as fast as its primary?

Like

Dennis Melnikov

Posted Tue March 05, 2024 03:12 AM

David,

After changing OFF_RECVRY_THREADS from 30 to 31 the logs are applying some two times faster.

------------------------------
Sincerely,
Dennis
------------------------------

Original Message

27. RE: What prevents a RSS to roll forward as fast as its primary?

Like

IBM Champion

Andreas Legner

Posted Tue March 05, 2024 11:23 AM

... which would suggest that the distribution of log records to those worker threads now is more evenly, so more threads can work on the load simultaneously (assuming there are enough cpu vps.)

Might be pure luck, might be the magic of a prime number. (And old suggestion for this parameter was: a prime number slightly above the number of cpu vps.)

------------------------------
Andreas Legner
------------------------------

Original Message

29. RE: What prevents a RSS to roll forward as fast as its primary?

Like

Dennis Melnikov

Posted Fri December 29, 2023 09:56 AM

David,

This moment the replica is not lagging behind, but `onstat -g rss log` shows something strange,

$ onstat -g rss log

IBM Informix Dynamic Server Version 11.70.FC5XE -- On-Line -- Up 4 days 17:00:00 -- 361806176 Kbytes

Log Pages Snooped:
RSS Srv From From Tossed
name Cache Disk (LBC full)

elids6_r 151268124 4920597 4512565

IMHO, the number of tossed pages is rather large, and I have no clue what is LBC, and the docs are of no help.

Next, `onstat -g laq` surprises me, too:

$ onstat -g laq

IBM Informix Dynamic Server Version 11.70.FC5XE -- Read-Only (RSS) -- Up 17 days 21:33:38 -- 262286176 Kbytes
Log Apply Info:
Thread Queue Total Avg
Size Queued Depth
xchg_1.0 0 17359156 78.22
xchg_1.1 0 24247308 4.61
xchg_1.2 0 25561972 10.78
xchg_1.3 0 8699038 18.23
xchg_1.4 1 21663938 75.07
xchg_1.5 0 62545351 37.23
xchg_1.6 0 18108179 21.98
xchg_1.7 0 9762415 20.15
xchg_1.8 4294967295 12843611 33.89
xchg_1.9 4294967288 9484334 40.46
xchg_1.10 0 29077454 15.19
xchg_1.11 1 22566832 13.19
xchg_1.12 1 26365147 7.40
xchg_1.13 2 9200278 7.19
xchg_1.14 0 16668520 9.16
xchg_1.15 0 8394155 20.29
xchg_1.16 3 15112685 7.76
xchg_1.17 0 9186048 12.46
xchg_1.18 1 9165889 15.12
xchg_1.19 0 6295815 22.91
xchg_1.20 0 6622899 22.21
xchg_1.21 0 7136092 17.96
xchg_1.22 0 7323773 23.89
xchg_1.23 0 8790363 12.90
xchg_1.24 0 6826952 6.61
xchg_1.25 0 7822841 16.41
xchg_1.26 0 6459145 12.81
xchg_1.27 0 7736199 14.61
xchg_1.28 0 17405971 67.20
xchg_1.29 0 16082069 23.48

Secondary Apply Queue: Total Buffers:12 Size:512K Free Buffers:11
Log Recovery Queue: Total Buffers:4 Size:8192K Free Buffers:2
Log Page Queue: Total Buffers:128 Size:4K Free Buffers:128
Log Record Queue: Total Buffers:150 Size:512K Free Buffers:132

What are these enormous figures in the 2nd column? Are they signed integer printed as unsigned? If so, what does negative queue size mean?

As for you questions,

(a) Yes, our logical logs are in 4K dbspace.

(b) Is it possible the Log Record Queue buffer size is taken from LOGBUFF, too?

And one more question,

Does a RSS_FLOW_CONTROL setting have any pitfall?

------------------------------
Sincerely,
Dennis
------------------------------

Original Message

30. RE: What prevents a RSS to roll forward as fast as its primary?

Like

IBM Champion

David Williams

Posted Mon January 01, 2024 10:11 PM

Hi,

"What are these enormous figures in the 2nd column? Are they signed integer printed as unsigned? If so, what does negative queue size mean?"

Yes the value is overflowing.

Does a RSS_FLOW_CONTROL setting have any pitfall?

As per https://www.ibm.com/docs/en/informix-servers/14.10?topic=parameters-rss-flow-control-configuration-parameter

"Users connected to the primary server may experience slower response time when flow control is active."

What this makes is that sessions on the primary that try to write to the logical log will be blocked.
onstat -u https://www.ibm.com/docs/en/informix-servers/14.10?topic=utility-onstat-u-command-print-user-activity-profile will show

G Waiting for a write of the logical-log buffer

NOTE: I know you are on 11.70 but 14.10 adds https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-380 so that
onstat -g rss verbose https://www.ibm.com/docs/en/informix-servers/14.10?topic=ogmo-onstat-g-rss-command-print-rs-secondary-server-information shows Number of Delays and Last Delay.

Perhaps someone from HCL can add where "Log Record Queue" size comes from!

Regards,
David.

------------------------------
David Williams
------------------------------

Original Message

31. RE: What prevents a RSS to roll forward as fast as its primary?

Like

IBM Champion

David Williams

Posted Mon January 01, 2024 10:30 PM

Hi,

" I have no clue what is LBC, and the docs are of no help."

See https://ideas.ibm.com/ideas/INFX-I-377

"From my interpretation of the source the information displayed on the primary by the onstat -g rss log command relates to the log buffer pages sent to an RSS server. These pages are copied from the logical log buffers when they are flushed and put onto an internal queue. If the RSS is connected (and caught up) then the required log pages can be taken from this queue and are counted as a cached copy. When the RSS requires a log page that is earlier than those in the queue then they have to be read directly from disk which is then counted as a disk copy. When flushing pages from the logical log buffers if the queue for the RSS is full then the page cannot be saved in the queue and it is counted as tossed. The size of the queue is fixed at 3 x the size of the logical log buffer LOGBUFF (when expressed in pages). "

I would like to know why the idea was marked Not Under Consideration. When there is lag the last thing you want to be doing is heading log from disk!

In fact this applys to all Ideas marked "Not Under Consideration", this seems a simple thing to implement and would have a clear performance win in cases of lag! On large systems you could even allocate a few gig of cache, in case of lag at least the first few gig of log would be replicated quicker!

------------------------------
David Williams
------------------------------

Original Message

32. RE: What prevents a RSS to roll forward as fast as its primary?

Like

Benjamin Thompson

Posted Mon January 08, 2024 06:06 AM

I will go out on a limb here and say RSS_FLOW_CONTROL, at least how it is implemented, is something almost nobody wants or needs. The pitfall with the setting is that if set to '0' it is actually on: '-1' is the setting that actually turns it off. Who came up with that one?

I understand it was implemented for a very specific use case to ensure primary and RSS could be relied on to be exactly in sync but to achieve this it abruptly pauses updates on the primary until the RSS catches up which is likely to have your end users complaining.

Ben.

------------------------------
Benjamin Thompson
------------------------------

Original Message

Informix