Db2

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

View Only

Back to Library

Tips and tricks to resolve HADR performance issues

Wed January 29, 2020 02:53 PM

Michelle Betbadal

DB2 HADR is a Disaster Recovery solution which uses "log shipping" to keep the "remote" instance in sync. Slow HADR can cause problems upstream and stall or slow down applications to a point where it becomes unusable. Since, this solution has a lot of moving parts, it becomes difficult to diagnose and troubleshoot the issues. In this blog entry, I'll try and address that.

In order to understand a slow HADR system, knowledge of HADR architecture and the current HADR sync mode is an absolute must. Let me try and explain both of these as I understand it :-)

First, let me start with Sync modes:

The sync modes can be set in the DB Config and they are:

1) SYNC -- Both Primary AND Standby are completely in sync i.e. a transaction is deemed committed ONLY if it has made it into the disks of both P and S Under the hood, the commit is written to disk on P, log buffer is sent to S, S writes to disk, sends ACK back to P.

2) NEARSYNC ( Default ) -- A transaction is deemed committed when P gets an ACK from S that it has received the log buffer. Under the hood, the log buffer is sent to the db2hadrp thread to ship to the S server when a commit is going to happen on P. If network is fast enough, the overhead of getting an ACK is minimal and committing on P and sending log buffer to S is almost simultaneous.

3) ASYNC -- A transaction is committed on P and also at the same time log buffer is sent across to S. The P does not have to wait for an ACK from S to commit.

4) SUPERASYNC - The transaction log writing and shipping of logs are totally independent. The two servers P and S will never be in PEER state. The S server will always be in Remote Catchup State.

In order to understand the sync modes in detail please read this excellent blog.

Next thing is to know which thread ( EDU ) does what within the engine

On the Primary side:

db2hadrp -- Main HADR thread. It is responsible for log shipping and receiving ACKs from standby
db2loggw -- This is the normal Log Writer thread which writes to the logs.
db2lfr -- This is the thread which reads the Log Files ( required in SuperAsync mode )

On the Standby side:

db2hadrs -- The main HADR thread which receives the logs from primary and sends back the ACKs to primary. This thread is responsible to write the log record to Standby logs as well. There is one such thread per database
db2shred -- This is the shredder thread which receives the log buffer from db2hadrs and shreds them into individual log records to be applied to the tablespaces. There is one such thread per database
db2redom -- This is the Redo Master thread which receives the log records from db2shred and manages the log records. There is one such thread per database
db2redow -- This is the Redo Worker threads which are assigned the records by db2redom thread. This thread replays the log records into the tablespaces. There are multiple such threads per database.
db2lfr -- This is the log file reader which is sometimes required to read old log files and feed the records to the db2shred thread.

With the above knowledge in hand, you can attempt to figure out what is happening to slow down HADR.

For example, in NEARSYNC mode if you are seeing a slowdown on the Primary server ( symptom should be higher number of applications in CommitActive state ) you know that an ACK is required from the Standby to commit a transaction on the Primary, so either the Standby is slow in sending the Ack due to whatever reason or there is something on the wire which is slowing down the ACK packet being sent. You know the db2hadrs thread and db2hadrp thread are involved in that exchange, so you can take a quick db2trc of the EDUs ( db2trc on -l 512m -t -p <db2sysc pid>.<db2hadrp|db2hadrs EDU id for AIX or Thread id for Linux> ) and then see the flow of how P is sending the packet to S and how quickly S is sending the ACK and when does P receive it. You don't really require to have source code knowledge to get this flow ( the function names are intuitive ). You can then figure out where is the slow down i.e. is the Standby slow in sending the ACK or is P receiving it much later than after S had sent it meaning there is something on the wire which requires tuning.

Ok, now that you have kind of an idea of what can be done, let us get down to the nitty gritty.

What are the tools in hand to debug or setup HADR

To test network bandwidth, ideal LOGBUFSZ, TCP Send and Receive buffer sizes, HADR Buffer size

simhadr ( Read this blog for more information )

( I feel using this tool before hand will ensure that HADR is tuned optimally right from the beginning rather than scrambling when an issue is hit. In my opinion anyone who is planning to set up HADR should use this tool first and then go about setting up HADR ).

vmstat
iostat
netstat
tprof ( AIX ), perf or oprofile ( Linux ) -- when CPU usage spikes up on the server

Db2

db2pd ( -hadr, -dpsprcb, -dpsdbcb -- last 2 are internal commands and typically run on standby and generates binary output )
MON_GET_HADR table
db2trc
db2fodc -hadr ( for data collection on Primary and Standby at the same time ).
Typical HADR tuning variables
- db2set
  - DB2_HADR_BUF_SIZE :
    - This controls the standby's log receive buffer. The needed size depends on the anticipated log generation rate and the duration of congestion. The default is 2x Primary Log Buffer size.
  - DB2_HADR_SOSNDBUF / DB2_HADR_SORCVBUF
    - This controls the receive and send TCP buffer size. When the size is bigger or equal to the size of the log flush it will use only one send() call to send the log buffer. You can read the simhadr blog where you can find a tool named "db2flushsize" to calculate the flush size for your system.

HADR monitoring

In Db2 version 10.1 onwards, you can either use db2pd -hadr or db2 "select * from table(mon_get_hadr(null))" to see what is happening within HADR. The main things to look at are:

HADR_STATE -- PEER is the best state. For SUPERASYNC however, it will always be REMOTE CATCHUP
HADR_CONNECT_STATUS -- CONNECTED / DISCONNECTED
HADR_CONNECT_STATUS_TIME -- If Status is CONNECTED it will show the time when it got connected, If Status is DISCONNECTED it will show when HADR got disconnected
PRIMARY_LOG_TIME -- The timestamp of the last log record on Primary
STANDBY_LOG_TIME -- The timestamp of the last log record applied on Standby. You can figure out from this how far behind is the Standby.
HADR_LOG_GAP -- This shows in bytes how much is Standby behind Primary. If number is high constantly in non SuperAsync mode then it means Standby is not able to keep up with Primary or has slowed down for whatever reasons. The type of log records to apply also matters as some of them are at higher blocking levels i.e. Tablespace, DB etc. which means if this type of a record needs to be processed it cannot be parallelized and it will be either blocked at tablespace level ( records like adding extent to object, merging index leaves which can be done by reorg etc. ) i.e. all tables within the tablespace will have to wait before proceeding. The worst is DB Blocking type of record ( like logmigration etc. ) where everything has to stop to process the record and then only standby can proceed.
STANDBY_RECV_BUF_PERCENT -- reports Standby recv buffer percent used. If it becomes 100% Primary will block on sending further log buffers till there is some space in the buffer. However, if Standby Spooling is set up, this is not a problem as the Standby will start spooling the log buffers for replay.
LOG_HADR_WAIT_CUR: How long the primary logger has been blocked waiting on log replication. This is important as that tells Standby is not able to receive log buffers from Primary. If this number keeps growing you have a problem in your hand.

There is an excellent blog on HADR monitoring which goes into this in much more details.

What to do when HADR is slow

1) First off start by monitoring either db2pd -hadr or MON_GET_HADR and check the fields of interest as specified above.

2) If you see that Standby is slowing down ( log gap is increasing, standby recv buf is close to 100% ) and you know for certain that HADR is tuned perfectly as far as HADR_BUF_SIZE and HADR_SOSND/RCV_BUF is concerned or in other words, you have run simhadr, db2flushsize to set the params correctly, initiate data collection

Option#1: db2fodc -db <dbname> -hadr -host all and it will invoke the script db2cos_hadr on both primary and standby
Option#2: If the event is short lived and you are sure it was an HADR Congestion issue, then you can use the following:
- db2fodc -db <dbname> -hadr -detect -- It will keep an eye on HADR congestion and moment it happens it will trigger data collection on both Primary and Standby ( More info at this link )
Option#3: Run the script collectPrimary.ksh <dbname> [ <outputdir> ] on Primary, collectStandby.ksh <dbname> [ <outputdir> ] on Standby
Option#4: If the issue is hit at odd times, you can automate the data collection using socket scripts which can talk to each other across servers
- HADRTriggerPrimary.pl
- HADRTriggerStandby.pl
- Since, these scripts follow client/server architecture, you have to start HADRTriggerStandby.pl first on Standby. Syntax is:
  - HADRTriggerStandby.pl -serverport <port#> -scriptname <scriptname> [ -scriptargs <args> ] -- where you choose a free port on the server to which the other side can connect to. Example syntax: HADRTriggerStandby.pl -serverport 90010 -scriptname /tmp/collectStandby.ksh -scriptargs sample
- Now start HADRTriggerPrimary.pl on Primary. Syntax is:
  - HADRTriggerPrimary.pl -remhost <host> -remport <port> -dbname <dbname> -scriptName <name> [ -scriptargs <args> -num <num app> -trigger <time> -sleep <time> -appstatus <status> ] -- where you provide the standby host name, and port# you started HADRTriggerStandby.pl with. Example syntax:
    - HADRTriggerPrimary.pl -remhost remote.server.com -port 90010 -dbname sample -scriptname /tmp/collectPrimary.ksh -scriptargs sample -num 10 ( with default values for trigger, sleep and appstatus it means this will wake up every 1.5 seconds to check on number applications in CommitActive state. If they go beyond 10 it'll trigger data collection on both Primary and Standby ).
- The main difference between Option#2 and Option#4 is that Option#2 looks only for Congestion where Option#4 is a bit more flexible and user can provide a specific application status which they observe as a sign of an HADR slowdown.

For the adventurous, you can attempt to format the db2trc and look at the stacks and see if any bottleneck can be spotted ( use db2trc perfrep option if on 10.1, 10.5, 11.1 ). However, if you have been able to collect the above data you are already in a much better shape to provide relevant data to the PMR.

I hope this blog post has helped you in some way to figure your way through HADR slowdowns.

DISCLAIMER: The scripts provided in this blog are as-is without any support.

#Db2

Statistics

0 Favorited

45 Views

0 Files

0 Shares

0 Downloads

Comments

Victor Sevilla

Wed February 28, 2024 04:07 PM

Links to scripts still broken. I wonder is there is a newer version !

Randy Dingman

Fri December 18, 2020 08:35 AM

The links for the scripts appear to be broken. Likely target still pointing to previous location

Db2

Db2

Tips and tricks to resolve HADR performance issues

What are the tools in hand to debug or setup HADR

HADR monitoring

What to do when HADR is slow

Comments

Additional
Resources

Office

Quick Links

Db2

Db2

Tips and tricks to resolve HADR performance issues

What are the tools in hand to debug or setup HADR

HADR monitoring

What to do when HADR is slow

Comments

Additional Resources

Office

Quick Links

Additional
Resources