by Aldo Bucossi
In the last years, meaningful improvements have been introduced to make data collection for Process attribute group more stable in any condition, no longer dependant
on the amount of monitored processes, the length of the Command Line and other elements that in the past caused some troubles.
Despite of this, it can be anyway useful to have some information to know what to look at when the output of the "All Process" workspace view is not the expected one.
For example, I have recently worked on a case where from time to time, some processes were not showed in the view, and this could have also impacted the reliability of situations leveraging on this attribute group.
Process data collection is performed directly by kuxagent process.
By enabling traces on specific modules, you can see how the agent discovers the processes and the information collected for each process.
The needed traces are:
KBB_RAS1=ERROR (UNIT:kux03agt ALL) (UNIT:get_ps_table ALL) (UNIT:get_process ALL)
You can set these traces dynamically using the command line (tacmd settrace) or you can put it into the ux.ini file and then restart the agent.
When the traces are active, in the agent log you can find rows like the following:
(56F295CC.0EE0-4:get_ps_table.cpp,1093,"display_ras1_output") Entry
(56F295CC.0EE1-4:get_ps_table.cpp,1099,"display_ras1_output") iteration: 5306 - cmd:/opt/IBM/ITM/aix526/ux/bin/kuxagent , uid:200, pid:2344190, ppid:1
(56F295CC.0EE2-4:get_ps_table.cpp,1102,"display_ras1_output") uxdc_output CMD:.../kuxagent
(56F295CC.0EE3-4:get_ps_table.cpp,1105,"display_ras1_output") uxdc_output COMMAND:/opt/IBM/ITM/aix526/ux/bin/kuxagent
(56F295CC.0EE4-4:get_ps_table.cpp,1108,"display_ras1_output") uxdc_output BASE_COMMAND: kuxagent
(56F295CC.0EE5-4:get_ps_table.cpp,1111,"display_ras1_output") uxdc_output RT ALLP F: 240001A S:A UID:200 pid:23440190 ppid:1
(56F295CC.0EE6-4:get_ps_table.cpp,1119,"display_ras1_output") uxdc_output C:5 PRI:60 NI:20 SZ:52888 VSIZE:45580
(56F295CC.0EE7-4:get_ps_table.cpp,1126,"display_ras1_output") uxdc_output WCHAN:* TTY:- 00017:27.../kuxagent TIME:00017:27.../kuxagent
(56F295CC.0EE8-4:get_ps_table.cpp,1131,"display_ras1_output") uxdc_output GID:806 EUID:0 EGID:806 PGID:2344190
(56F295CC.0EE9-4:get_ps_table.cpp,1137,"display_ras1_output") uxdc_output session_ID:9830618 sche_class:N/A CPU_ID:-1
(56F295CC.0EEA-4:get_ps_table.cpp,1142,"display_ras1_output") uxdc_output start_time:1160321073619000002d05:34:17 elapsed_time:002d05:34:17
(56F295CC.0EEB-4:get_ps_table.cpp,1146,"display_ras1_output") uxdc_output pctcpu:0 pctmem:100 tpctcpu:54
(56F295CC.0EEC-4:get_ps_table.cpp,1151,"display_ras1_output") uxdc_output heap:-1 stack:-1 major_fault:0 minor_fault:5202952
(56F295CC.0EED-4:get_ps_table.cpp,1157,"display_ras1_output") uxdc_output context:889446 invol_con:170692 readwrite:-1340452233 thread_cnt:57
(56F295CC.0EEE-4:get_ps_table.cpp,1163,"display_ras1_output") uxdc_output utime:000d00:12:59000d00:04:28000d00:17:27 stime:000d00:04:28000d00:17:27 ttime:000d00:17:27 ltime:0 0 wtime:0
(56F295CC.0EEF-4:get_ps_table.cpp,1170,"display_ras1_output") uxdc_output cutime:0 0 0 cstime:0 0 ctime:0
(56F295CC.0EF0-4:get_ps_table.cpp,1179,"display_ras1_output") Exit: 0x0
The above output is provided for each process that the agent is able to discover.
The trace enabled for get_process module is instead useful to have a list of all the processes discovered at the beginning of the task:
For example:
(5707F778.0002-6:get_process-aix.cpp,474,"get_processes") getprocs64 first block invocation returned 170 processes in 2 ms
(5707F778.0003-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 0 - pid = 0
(5707F778.0004-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 1 - pid = 1
(5707F778.0005-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 2 - pid = 8196
(5707F778.0006-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 3 - pid = 12294
(5707F778.0007-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 4 - pid = 16392
(5707F778.0008-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 5 - pid = 20490
(5707F778.0009-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 6 - pid = 24588
(5707F778.000A-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 7 - pid = 28686
(5707F778.000B-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 8 - pid = 32784
..
You can leverage on the output produced by the suggested traces to verify if, at error time when the wanted process is not showed in TEP, the agent was actually facing any problem
while retrieving data for that specific process.
Of course,knowing the PID of the missing process makes easier the analysis of the agent log, and this can be quickly obtained with a:
ps -ef | grep -i
This is applicable only if you are sure that the wanted process is actually running, of course :-)
If you see the target PID in the list of processes provided by the get_process module, then likely you will also find other related information returned for the same PID by module get_ps_table.
While troubleshooting this kind of scenario, the most important thing is finding out the failing component.
If a process is not showed on TEP view, the problem can be with data collection, with data transfer when the record buffer is sent to TEMS, or with TEMS itself, for example
in case TEMS is suffering for instability caused by an unexpected high workload or other error conditions.
But in this case you would notice more important symptoms than processes not showed in a TEP workspace view.
Focusing a bit more on Unix OS agent side, I'd say that if the process you are looking for is listed by the get_process module, than the problem cannot be with
data collection: as previously said there is a big chance you will see it also listed in the messages written by get_ps_table modules, and in this case the problem cannot be with data collection.
There is an additional action we could perform on agent side : verify if the amount of processes discovered by the agent matches the number of rows sent by agent to TEMS.
You can do it by having the following traces activated on the agent:
ERROR (UNIT:KRAADSPT ALL) (UNIT:get_process ALL)
We already know why tracing get_process can be useful.
Module KRAADSPT is instead useful because it is the one that prints the message:
(5707FA89.078A-F:kraadspt.cpp,895,"sendDataToProxy") Sending xxx rows for
In order to be sure that all the discovered processes are sent to TEMS, you can do the following:
1) Find the highest value for proc_index having pid different than zero, by looking at the processBlockList row. For example:
(5707FFEE.00AB-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 165 - pid = 1175724
(5707FFEE.00AC-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 166 - pid = 1179668
(5707FFEE.00AD-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 167 - pid = 0
(5707FFEE.00AE-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 168 - pid = 0
(5707FFEE.00AF-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 169 - pid = 0
(5707FFEE.00B0-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 170 - pid = 0
(5707FFEE.00B1-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 171 - pid = 0
(5707FFEE.00B2-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 172 - pid = 0
(5707FFEE.00B3-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 173 - pid = 0
(5707FFEE.00B4-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 174 - pid = 0
(5707FFEE.00B5-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 175 - pid = 0
(5707FFEE.00B6-6:get_process-aix.cpp,482,"get_processes") processBlockList: proc_index = 176 - pid = 0
In this case we know the highest proc index is 166, it means there are 166 processes the agent will try to get additional information for.
The one having pid=0 are excluded at this stage so must not be considered.
There are anyway other processes that could be excluded from the total: the
(5707FA89.00CA-F:get_process-aix.cpp,517,"get_processes") Arguments datacollection failed for process 12294
So it could be usefulto verify that the process you were expecting to see in TEP is not part of the ones excluded with the aforementioned message.
If it is not, you can follow up with next step
2) Find the amount of processes excluded with message "Arguments datacollection failed"
You can do it by making a search for message "Arguments datacollection failed" and excluding from the total the ones
with pid=0 (because these are already excluded by the code itself so we must not consider these entries despite the message is issued also for them).
In my case I found 61 instances of "Arguments datacollection failed".
3) To find the number of processes that the agent should be sending to TEMS, you must subtract the value found on step 2 to the highest proc_index with PID different than 0, found in step 1.
So in my case: 166 - 61 = 105.
The agent should send to TEMS 105 rows containing metrics about the Unix processes.
4) Now we need to check what the function "SendDataToProxy" is going to send to TEMS
We can do it by making a search for string "rows for OMUNX.UNIXPS" (UNIXPS is the internal name, the table name, for Process attribute group).
In my case I found:
(5707FA89.078A-F:kraadspt.cpp,895,"sendDataToProxy") Sending 105 rows for OMUNX.UNIXPS, <4002415294,238027736>.
So all worked fine.
The amount of collected processes matches the number of row sent to TEMS.
If the value differs, then there may be problem in KRA area and additional investigation may be required here.
This blog post just wanted to provide some elements to perform debugging of Process attribute group data collection.
I hope it will be useful to enable some of you in doing a first analysis in case you face similar scenarios before engaging customer support.
PS: About the case I worked on, after I verified that the agent was correctly discovering the missing process and that all the discovered processes were sent to TEMS, I was a bit lost because it was not clear why, from time to time,
some processes were disappearing from the TEP view.
A better look at the log revealed that this agent was disconnecting every 10 minutes, and that the time the agent was disconnected was more or less matching with the time when the process was missing from TEP view.
Further investigation demonstrated that there was another node using the same hostname that was connecting and showing on TEP in place of the desired one.
Basically, when the expected processes were not showed on TEP, it was because we were looking to the process list of another machine.......
So, before starting the above troubleshooting steps, if you experiences something similar to the scenario I described above, be sure that the Unix OS agent is not being disconnected periodically from TEMS.
It may indicate that another Unix OS agent is registering on TEMS with the same MSN, causing the disconnection of the "original" node.
You can also double check if an "impostor" OS agent is showed on TEP by looking at the PID of the system processes (errdemon, cron, syslogd)
Compare the one listed on TEP with the ones showed by the ps -ef command executed on the machine where the "original" OS Agent runs.
If they don't match, you have to get rid of the duplicate OS agent node by finding its IP address and then stopping the Unix OS agent that runs on it.
Or, changing the CTIRA_HOSTNAME and CTIRA_SYSTEM_NAME as suggested by technote:
http://www-01.ibm.com/support/docview.wss?uid=swg21636802
This technote also suggests how to find the IP address of the second Unix OS agent.