IBM i Global

 View Only
Expand all | Collapse all

Batch performance degradation in OS 740 ?

  • 1.  Batch performance degradation in OS 740 ?

    Posted Tue September 13, 2022 09:34 AM
    All, 
    A question about performance after upgrading from OS 730 to OS740. Since this upgrade the night batch process takes much more time ... and I'm still looking on system-level to the cause of this issue. Yes, I have to say there was - already in 730 - a CPU resource issue, running with 2 processor cores, which were fully used during the night (from 10 pm to 5 am), but now running in 740, I have enabled a third processor (via uncapped processing mode) to get the same batch-runtimes. (now also this third processor is fully used). 
    About the application ... it is an older ERP application ( a mix of native programming & SQL), where are a lot of indexes advised, and where the number of Full DB open operations is very high (in some pgms/jobs). So, here updates can be done.  
    But I'm looking for a more strong explanation to cover this additional CPU utilization in relation to the OS upgrade ?

    Any information, ideas, recommendations ?
    Already in contact with IBM support, but but not found  or identified any cause ...

    Thanks,
    Jozef Thijs
    Kyndryl Belgium

    ------------------------------
    Jozef Thijs
    ------------------------------


  • 2.  RE: Batch performance degradation in OS 740 ?

    Posted Tue September 13, 2022 08:05 PM
    Edited by Satid Singkorapoom Tue September 13, 2022 11:37 PM
    Dear Jozef

    In my long experience helping many IBM i customers with this similar issue you are facing, I can say there can be too many possible reasons for explanation of batch runtime degradation. Unfortunately, these reasons can be identified only when the customers keep a sample of the performance profile of the entire batch process duration by capturing a number of keys PDI performance charts before the problem and compare them with ones at issue.  For example, you should at least have the following charts on disk read/write response time, workload amount (DB IO per second), Wait Overview and Wait by Generic Job or Task, Physical Disk IO Overview and Physical Disk IO by Generic Job or Task, (Memory) Page Fault by Generic Job or Task.

    If there is no such past PDI chart record keeping (do this for a few days known to be of peak/high workload), the best thing we can do is to identify the current cause(s) of the problem by looking at these PDI charts of the current day at issue and identify the current performance bottleneck.  I hope you read all my 5 articles on which PDI charts are useful and how to analyze them here :   https://www.itjungle.com/author/satid-singkorapoom/  

    If you need my help here, we can start by you capturing the PDI charts on Wait Overview and Wait by Generic Job or Task during the batch run duration and posting them here for me to see and I will try to analyze them for you. We may likely need to see more PDI charts after I see the Wait charts. 


    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------



  • 3.  RE: Batch performance degradation in OS 740 ?

    Posted Wed September 14, 2022 10:46 AM
    Satid,

    Thanks already for this feedback, and I will verify certainly verify every point you mention.  

    In fact, in OS 730, the batch night processing, completed well in the window 10 pm to 6 am, although the server was limited on CPU resources (less then 2 cores), and in the iDoctor overview time signature there was clearly a CPU queuing issue (only during the night) ... but the night processing completed still before 6 am.
    Upgrading to OS 740, and still running with the same CPU resources & memory pools, the CPU queuing is still clearly visible, and even over a longer period, as the batch processing completes now around 10 am in the morning. 
    Currently, more CPU resources are assigned (increased to 2.7 and set to uncapped (to max 3 proc). So, at this moment, my batch process is running in an acceptable window again (completed before 6 am), ... but I can't explain to the customer this high CPU requirement. 
    So, at this moment, uncapped of 2.7, I see during the batch window, the 3 processor are full used. And looking at the iDoctor waits/signature time overview, the red 'dispatched CPU' is well presented. About 90% -95% is dispatched CPU, and other waits (very small) are related to 'disk page faults' and 'disk non fault reads'. So, at this moment, I don't see any direct indication !

    Yesterday, I re-enabled as well parallel processing (as it was disabled through QAQQINI (set to *NONE) ... but during the past night, no real improvement seen, as the CPU utilization (also with 3 proc available) is very high. 

    Any further recommendation is welcome.

    Thanks,
    Jos (Jozef) Thijs
    Kyndryl Belgium

    ------------------------------
    Jozef Thijs
    ------------------------------



  • 4.  RE: Batch performance degradation in OS 740 ?

    Posted Wed September 14, 2022 08:01 PM
    Edited by Satid Singkorapoom Thu September 15, 2022 04:31 AM
    Dear Jozef

    Using DB2 SMP when there is substantial amount of CPU Queuing wait would not deliver much performance benefit.  And if after a few more nights of batch run, the run-time does not improve back to what it was before the upgrade, then it may not be about missing autonomic indexes.

    You should also look at Wait by Generic Job or Task chart and focus on jobs of the batch process to compare the proportion between Dispatched CPU Time VS waits to check if wait time is too  much or not and if so what specific wait time is dominant for those jobs.  Do not look at just Wait Overview alone as the overview can obscure some subtle details from our interpretation. 

    You did not mention about disk response time at all but this is also an important performance factor. Please compare disk response time during the batch run duration before VS after the upgrade to see if it degrades substantially during the batch run period or not. If so, then compare Physical Disk IO Overview before and after the upgrade to see if it increases substantially or not. If confirmed, compare this disk IO workload of batch jobs before and after the upgrade to confirm its substantial increase. If so, compare DB IO Per Second before and after the upgrade (read case 5 of my article). If it increases substantially, then it should finally means there is now more workload for batch processing to handle that explains the longer run-time. 

    If batch workload increase is confirmed, you can improve the run-time by improving disk response time and check for and rectify non-optimized object authority assignment for user profiles that run batch jobs that can negatively affect batch run-time (read case 1 of my article).  Creating useful indexes according to System-wide Index Advisor can also help with SQL/Query portion of the batch workload.

    Another thing to do is to review the job log of batch job. Choose a few longest running jobs to review just to check if you see any  repetitive appearance of any message that looks out of place.

    One last point that comes to my mind. Does your customer has an in-house application development team that keeps delivering program modifications/enhancements?  If so, check if the AD team happens to deploy new codes at the same time as this upgrade?  If so, the new codes may increase the workload by intention or if there is no such intention, the new codes may contain bad coding that unnecessarily increase CPU consumption.

    In my past experience when it comes to customers' in-house AD, too many programmers do not know good coding from bad one in terms of performance implication, at least in ASEAN. But this may not be easy to prove as we may not be able to identify as such.  In my past experience, there were very few cases that I could, by some sheer luck, identify bad codes that caused performance issue but this was sadly rare and I was reluctance to do this on my own.

    Let me know the progress if you still cannot solve the issue.  


    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------



  • 5.  RE: Batch performance degradation in OS 740 ?

    Posted Thu September 15, 2022 11:49 AM
    Dear Satid, 

    Thanks very much for your feedback & recommendations.
    I picked up your point about the IO-area ... although the disk response times were fine (before & after upgrade ... similar values balancing around 3ms, and no increase found during the batch-window). But I investigated further on the disk IOs and there I found a strong increase in the average IO rate / sec (from 4000 (before) to 6000 IOs/sec (after the upgrade), and now peaks to 18K IOs/sec, while before the upgrade, the highest peaks during the batch window was around 11K.
    Looking further ... for several jobs, a strong increase in R/W operations, mostly Write operations. In fact, on a lot of jobs, I see now a lot of 'async NON-DB write operations, which were almost not existing before the upgrade. For some jobs, even about millions of these non-DB write operations.

    It looks like almost all jobs are impacted by these NON-DB write operations.
    I have to continue the investigation... but any recommendation is welcome.

    Thanks,

    Jos (Jozef) Thijs


    ------------------------------
    Jozef Thijs
    ------------------------------



  • 6.  RE: Batch performance degradation in OS 740 ?

    Posted Thu September 15, 2022 08:08 PM
    Edited by Satid Singkorapoom Thu September 15, 2022 08:37 PM
    Dear Jozef

    Considering you added one more CPU core and yet all are still highly busy during batch run, this implies significantly more workload and therefore it should not be surprising that disk IO rate also increases.   The question is whether this workload increase comes from more core business data to process in the batch run or missing MTIs causes more unnecessary table scans for SQL/Query or other causes.   

    Non-DB disk writes are about such things as spool files, job logs, data area, data queues, etc.  You can try to reduce unnecessary spooled file creation by enabling what is called job log pending:  https://dawnmayi.com/2011/07/11/job-log-pending/ 

    What about comparing PDI chart on Total Logical DB IO per Sec during the batch run before and after the upgrade?  If it is substantially higher after the upgrade, then it can serve as a potential explanation that the workload just increases.

    While waiting for the item above, another thing you can explore is to dump a Plan Cache Snapshot and check if batch jobs with SQL/Queries perform  excessive unnecessary table scans on tables larger than 500 MB or not?  The best way to do this is to dump the Plan Cache early in the morning and specify the snapshot filtering parameter starting at last night batch run start time and for jobs taking longer than 1 second.
      
    My guess is that you never do a meta-data query at least on the number of MTIs in your production data library (there is now a new alternative service function MTI_INFO) to compare before and after the upgrade. (I think Plan Cache Snapshot contains the MTI count as well.)  So, proving a theory on missing MTIs is not easy to do but I say from my past experience that this is a factor we cannot discard when there is substantial SQL/Query workload and also there are not sufficient number of permanent indexes. But relying on MTIs is not a sensible policy. Better to identify and create useful permanent indexes as it can be done conveniently in DB2i using System-wide Index Advisor. You can ask Lab Services team to help on this. I delivered this task to some 30 customers in ASEAN in the past 10 years before I retired from the team and the result was always satisfactory performance improvement. The more SQL/Query workload, the better the improvement. 

    Have you checked if the customer deployed any code change in the batch programs at the same time as the upgrade?    Another to ask is whether they increase the number of concurrent batch jobs or not?

    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------



  • 7.  RE: Batch performance degradation in OS 740 ?

    Posted Fri September 16, 2022 10:32 AM
    Dear Satid,

    Regarding the logical IO operations, it is almost the same. Even on job-level, I see almost the same number of logical IO operations. And even looking at the batch window, the average logical IO operations are slightly lower now, running in 740. So, in fact, the batch workload is not really increased.
    I'm currently looking into SQL Plan cache to verify the advised indexes.

    I investigated further into the NON-DB Write operations ... and it seems to be related to the DB-triggering process. All jobs, which update databases, which have triggers enabled, I see these strong amount of NON-DB write operations. If these triggers are disabled ... these NON-DB write operations are not detected. 

    Probably, these trigger programs are not well optimized as ' trigger pgm' (performing as well full DB open operations). But in 730, the job & programs ran & completed in time, while now, in 740, these trigger programs are causing a strong workload increase (seeing all these non-db write operations), causing long runs and more CPU utilization.

    Hopefully, this trigger behavior can be anlyzed. & investigated.

    Kind regards, 

    Jos

    ------------------------------
    Jozef Thijs
    ------------------------------



  • 8.  RE: Batch performance degradation in OS 740 ?

    Posted Fri September 16, 2022 08:18 PM
    Edited by Satid Singkorapoom Sat September 17, 2022 01:56 AM
    Dear Jozef

    If the logical DB IO per sec workload appears similar as it was before the upgrade but physical disk IO workload increases, then my best theory is that the MTIs are now gone due to the IPL and, somehow, they never get created in exactly the same amount for the same set of tables which causes unnecessary table scans which increases physical disk IO workload even when logical DB IO workload remain the same.  So, it is sensible to check if there are a lot of unnecessary table scans in long-running batch jobs from Plan Cache snapshot.  If this turns out to be the case, useful indexes need to be created.  Please use Visual Explain in analyzing long running SQL/Query whether they can benefit from index or not using guidelines described in the article that I attach herewith.   

    As for the triggers, do they contain substantial amount of SQL/Query or do they use 100% SQL with no native IO at all?   If so, the increases physical disk IO workload may be the result of missing MTIs. You can try focusing on creating indexes for tables larger than 100 MB that the trigger programs access and see if it improves the situation or not. 

    If the triggers use only native IO with no SQL/Query at all, please check IBM i 7.4 Memo To Users document focusing on any thing relevant to codes or operations used in triggers to see if there is any change that may negatively affect its operation or performance.

    By the way, please run WRKSYSACT command and make sure that "Average CPU Rate" displays a value not much lower than 100%. If it shows a value substantially lower than 100, then something is not right with the CPU.   I ask this because I used to encounter 3 cases in the past of a bug in server firmware (POWER8 and 9) that caused CPU clock rate to reduce by half and caused a performance problem. Using the most current level of server firmware solved this bug. In such cases, Average CPU Rate showed a value around 50%.

    One last point, in the only iDoctor chart you provided previously, I notice an LIC task named SMXCAGER01. From a Google search, I found this is an LIC task for Expert Cache Ager in memory pool 1 which is Machine Pool.  This task produce 800 physical disk IO total which may be high for Machine Pool which hates too much page faulting.  Can you capture the screen of WRKSYSTS for me to see?  I suspect the pool size for Machine Pool may be too small. If this is the case, it can slow down all long-running jobs in the system.  

    I learned long ago from an expert in IBM Rochester that a proper size for Machine Pool should be allocated somewhere between 2 to 2.5 times its RESERVED SIZE which can be seen from WRKSYSSTS screen.  Please allocate 2.5 times the reserved size. But if you turn on automatic performance adjuster (QPFRADJ), you also need to make sure Machine Pool is not reduced by this function by using WRKSHRPOOL command and set the MIN size of Machine Pool to 2.5 times its reserves size. For its MAX size, you just add 0.1% to the MIN value.  This is what I always do to prevent system slow down.   And MAX ACT parameter of pool 2 (*BASE) should be set at a value not less than 1,000.  
     



    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------



  • 9.  RE: Batch performance degradation in OS 740 ?

    Posted Sat September 17, 2022 01:59 AM
      |   view attached
    Here is a useful article providing guidance on how to analyze an access plan in Visual Explain.

    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------

    Attachment(s)



  • 10.  RE: Batch performance degradation in OS 740 ?

    Posted Tue September 13, 2022 08:50 PM
    Edited by Satid Singkorapoom Tue September 13, 2022 11:40 PM
    As a side note, let me give you some samples of my experience on performance issue encountered after IBM i release upgrade.

    1) Memory pool size and its corresponding MAX ACT value reduced after IBM i release upgrade. (This issue does not always happen.)  If we keep a screen shot of WRKSYSSTS before the upgrade and compare it to the one after, sometimes we may notice memory pool size of Machine Pool and/or Base Pool reduces which contributes to the performance issue due to increased memory fault rate and/or less concurrent threads.  If these reductions are not in a significant amount, no or little performance problem is detected.  Adjusting these reduced values back should solve the issue.  But then the next issue may come up.

    2) Object First Touch Overhead.  This is not a widely known knowledge in IBM i circle. After every IPL, when a job accesses any object for the first time after the IPL, IBM i takes more CPU cycle and disk IO to read and build a repository of the object header information (and a few more?) in an OS structure for faster subsequent access of that object (until it is deleted again at the next IPL). If the batch process jobs are the first to access a lot of objects at the same time or in immediate succession, this overhead will slow down the entire batch runtime.  But this overhead happens only once for each object (until the next IPL) and the subsequent batch run should be faster as the overhead no longer happens.  But if during the moment after IPL finishes until the batch process starts, there are a lot of interactive jobs (or data backup job) running and accessing most or all these objects, then the first batch run after IPL will not suffer this overhead because the interactive jobs suffer instead. 

    3) Missing Autonomic Indexes (Maintained Temp Index - MTI). Db2i can create autonomic indexes for SQL statements and Queries that run very frequently without any "permanent" indexes in place but all MTIs are also cleared away at IPL.  This means that SQL and Queries that benefit from MTI will suffer after an IPL and you just let them run for some days before MTIs are created back again.  But the best practice is to analyze DB2i System-wide Index Advisor tool and create permanent indexes so that the workload no longer depend on MTIs.  

    More on MTI here : https://developer.ibm.com/articles/i-manage-mti/

    Does your batch process still takes long to run after its first run after the IPL due to IBM i release upgrade?  If so, it may not be object first touch overhead issue but more likely the missing autonomic indexes.  Or it may be that there is simply much more workload to process and you need to compare DB IO per Second PDI charts (mentioned in my article case 5) before and after the upgrade to see this. 


    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------



  • 11.  RE: Batch performance degradation in OS 740 ?