Message Image  

How to Analyze Abend Files

 View Only
Wed July 01, 2020 09:04 AM

Most of the time when an Integration Server or Integration Node crashes/restarts unexpectedly you will see an abend file generated. This is the broker’s output of what was happening at the time of the crash. While these files look a little confusing, they contain a great deal of information, and can be helpful in determining the root cause.

These abend files will be indicated in the syslog and can be found in the /common/errors directory. Typically:
Windows – C:\ProgramData\IBM\MQSI\common\errors
Unix – var/mqsi/common/errors

The abend file will contain the following near the top: System Information and Integration Node Information. As in the following:

From this information we can see the following:

1. The time the process started was Thu Mar 03 08:25:21 2016
2. The version/fixpack  of the Integration Bus is  9.0.0.2
3. The Operating System is Windows 7
4. The Installation and Work Paths
5. The process that crashed was the Integration Server.  This is know because of the following
*Executable Name       :- DataFlowEngine.exe  
 DataFlowEngine=Integration Server
 bipservice=Integration Node

As we continue through the file, we get more specific information regarding the actual Integration Node:

From this section we get the following information:

1. The Integration Node Name – Component Name 'TEST'
2. The Integration Node UUID - Component UUID  982c3700-c40b-4dd6-b8e9-edbe465d9810   
3. The Queue Manager Name – Queue Manager 'TESTQM'
4. The Integration Server Name – Execution Group 'default'
5. The Integration Server UUID - EG UUID  408127d2-4f01-0000-0080-b90d0b172dc3   
6. The time the abend was generated - Time of Report (GMT) Thu Mar 03 08:26:18 2016 **
7. The message flow name that is indicated -  Message Flow  MbOutputTerminalPropagate    
** If this were on a Unix machine, the would be in Epoch value, and a converter would be needed.
For example: 
Time of Report (GMT)  secs since 1/1/1970: 1498729312
This converts to GMT Thursday, June 29, 2017 9:41:52 AM

This information now lets us know more specifically where to look for problems.

The next section specifically points to what happened at the time of the abend. In Unix this can be very helpful as an actual signal is sent. For example:

This insert can let you know what happened at the time. Here is a list of the most common signals:

 1	SIGHUP	Hangup.
 2	SIGINT	Interrupt.
 3	SIGQUIT	Quit.  (1)
 4	SIGILL	Invalid instruction (not reset when caught).  (1)
 5	SIGTRAP	Trace trap (not reset when caught).  (1)
 6	SIGABRT	End process (see the abort() function).	(1)
 7	SIGEMT	EMT instruction.
 8	SIGFPE	Arithmetic exception, integer divide by 0 (zero), or floating-point exception. (1)
 9	SIGKILL	Kill (cannot be caught or ignored).
10	SIGBUS	Specification exception.	 (1)
11	SIGSEGV	Segmentation violation.	(1)
12	SIGSYS	Invalid parameter to system call.  (1)
13	SIGPIPE	Write on a pipe when there is no process to read it.
14	SIGALRM	Alarm clock.
15	SIGTERM	Software termination signal.
16	SIGURG	Urgent condition on I/O channel.	 (2)
17	SIGSTOP	Stop (cannot be caught or ignored).  (3)
18	SIGTSTP	Interactive stop.  (3)
19	SIGCONT	Continue the process if stopped. (4)
20	SIGCHLD	To parent on child stop or exit.	 (2)
21	SIGTTIN	Background read attempted from control terminal.	 (3)
22	SIGTTOU	Background write attempted from control terminal.  (3)
23	SIGIO   Input/Output possible or completed.  (2)
24	SIGXCPU	CPU time limit exceeded (see the setrlimit() function).
25	SIGXFSZ	File size limit exceeded (see the setrlimit() function).
26	SIGVTALR Virtual time alarm (see the setitimer() function).
27	SIGPROF	Profiling time alarm (see the setitimer() function).
28	SIGWINCH Window size change.  (2)
29	SIGINFO	Information request.  (2)
30	SIGUSR1	User-defined signal (1)
31	SIGUSR2	User-defined signal 2.

Notes to table:
(1) Default action includes creating a core dump file.
(2) Default action is to ignore these signals.
(3) Default action is to stop the process receiving these signals.
(4) Default action is to restart or continue the process receiving
			these signals.

 

After this point, the Windows and Linux abend files have a different format, but still contain the same information. In a Windows abend file the next section are the environment variables set on the Integration Node followed by the stack dump at the time of the issue.

On Unix, the next section is the stack dump followed by the environment variables set on the Integration Node.

We are going to focus on the stack trace.

In both Unix and Windows, the top of the stack will typically first be a few lines of the actual abend handling. These lines can be ignored:

Unix:

Windows looks more confusing as the stack is to the right of the actual abend file:

You may ignore any lines containing the words ‘abend’, ‘abort’, or ‘terminate’ near the top of the stack.

While the stacks may look different, they contain the same information, and basically show what was processing at the time of the issue.

At this point you can do a search based upon the information below the abend handling. In the Unix example, a portion of the stack is the following:

You can begin the investigation by searching for a couple of key terms just below the abend handling portion of the stack:

When searching, you can ignore the random letters/numbers indicated with the red boxes.
For example doing a simple internet search on ‘propagateInner ImbDataFlowTerminal’ resulted in a possible known defect, a link to the Knowledge Center, and a post on MQ Series( a known question/answer site). All could be helpful in determining the root cause.

Defect:
https://www-304.ibm.com/support/docview.wss?uid=swg1PI69900

Knowledge Center:
https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/au14185_.htm

MQSeries:
http://www.mqseries.net/phpBB2/viewtopic.php?t=27862

While none of these may be the complete answer or root cause of the abend, it does give a starting place to begin investigating.

There are a some abends that we see more common than others. Here are a few of those:
1. A semaphore locking issue.
If you see Function: semop or Function: semctl above the stack with ImbNamedMutex in the stack it will typically resolve by completing the steps at the following DWAnswers post:
https://developer.ibm.com/answers/questions/169895/why-does-iib-or-wmb-fail-after-a-failover-or-a-res.html

2. JVM Out of Memory issues:
If you see in the stack trace that the Integration Node or Integration server are not exiting the JVM libraties, it is very possible you are exhausting the JVMMaxHeap Size. You can confirm this by navigating to the stderr for the Integration Node or Integration Server and finding the out of memory exception:
Integration Node stderr: /components//stderr
Integration Server stderr /components///stderr

The JVMMaxHeap can be increased to avoid this abend:
https://developer.ibm.com/answers/questions/176620/how-do-you-change-the-max-jvm-heap-size-in-iib-or.html

3. Incorrect odbc configurations
If you see from the stack trace that the libraries indicated are odbc libraries, it is worth first taking a look at your odbc.ini and onbcinst.ini files to verify that they are correct. The location of these files are indicated by the environment variables ODBCINI and ODBCSYSINI. If you see any additional white spaces or carriage return issues can be caused.
https://developer.ibm.com/answers/questions/271466/odbc-connection-errors.html