AIOps

AIOps

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

RTEMS Reporting High CPU

By IMWUC Community Team posted Fri October 20, 2017 06:06 AM

  

by Sandra Jones

Had the reports of a RTEMS using high levels of CPU which then resulted in the RTEMS crashing. The first step taken was to review the logs and to see if there were any messages that could explain the issue.

In the

56D699F0.0000-2F0:kdepenq.c,124,"KDEP_Enqueue") (6019:1918) receive limit (8192) reached: 0.0.0.75

This is normally an indication that there are too many agents connected to an RTEMS, or that an agent is trying to send too much data back to a TEMS. The value of this can be tuned as the default maximum size of data that an Agent will attempt to return in one RPC request is 4096 KB.

If the size exceeds this limit, the error KDE1_STC_RXLIMITEXCEEDED should be seen. A limit can be set to avoid possible memory overruns for an unusually large RPC request. If the limit leads to problems processing requests, consider increasing the limit by adding or editing the KDCFC_RXLIMIT value in the %CANDLE_HOME%\CMS\KBBENV file (on Windows) or

The maximum value allowed is 65536; the minimum value allowed is 1024.
For example, to change it to 32768 KB (32 MB):
KDCFC_RXLIMIT=32768 

However in this case this was not the issue. Further diagnostic work had to be done using the tool published here:

https://www.ibm.com/developerworks/community/blogs/jalvord/entry/sitword_tems_audit_process_and_tool?lang=en 

A trace of:
KBB_RAS1='error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)(unit:kdsstc1,Entry="ProcessTable" all er)(unit:kdssqprs in er)(unit:kraafira,Entry="runAutomationCommand" all)(unit:kglhc1c all)'

was set at the TEMS and run for an hour and then the tracing was set off again.

This is very important if you use this tool; the tracing should only run for a short time as the tracing is verbose and in fact is only needed over a short time, to measure what is happening with situations. A report was then produced using the trace logs and the perl given in the above blog. That identified that one situation was the main user of the CPU. Once that situation was reviewed it was seen that it was using more than one multi-row attribute group.

In this case it was a scan in a file and then checking for a missing process.  Both of these return more than one row of data. A situation like this cannot be created in the TEP Situation Editor but can be created by hand and inserted into the TEMS with tacmd editsys. The situation did not error but caused massive amounts of data to be sent to the RTEMS which once the situation was distributed to a number of agents, caused RTEMS to overload and shutdown. Once the situation was deleted then the RTEMS functioned normally.

0 comments
1 view

Permalink