Most of the time when an Integration Server or Integration Node crashes/restarts unexpectedly you will see an abend file generated. This is the broker’s output of what was happening at the time of the crash. While these files look a little confusing, they contain a great deal of information, and can be helpful in determining the root cause.
These abend files will be indicated in the syslog and can be found in the /common/errors directory. Typically:
Windows – C:\ProgramData\IBM\MQSI\common\errors
Unix – var/mqsi/common/errors
The abend file will contain the following near the top: System Information and Integration Node Information. As in the following:
From this information we can see the following:
1. The time the process started was Thu Mar 03 08:25:21 2016
2. The version/fixpack of the Integration Bus is 9.0.0.2
3. The Operating System is Windows 7
4. The Installation and Work Paths
5. The process that crashed was the Integration Server. This is know because of the following
*Executable Name :- DataFlowEngine.exe
DataFlowEngine=Integration Server
bipservice=Integration Node
As we continue through the file, we get more specific information regarding the actual Integration Node:
From this section we get the following information:
1. The Integration Node Name – Component Name 'TEST'
2. The Integration Node UUID - Component UUID 982c3700-c40b-4dd6-b8e9-edbe465d9810
3. The Queue Manager Name – Queue Manager 'TESTQM'
4. The Integration Server Name – Execution Group 'default'
5. The Integration Server UUID - EG UUID 408127d2-4f01-0000-0080-b90d0b172dc3
6. The time the abend was generated - Time of Report (GMT) Thu Mar 03 08:26:18 2016 **
7. The message flow name that is indicated - Message Flow MbOutputTerminalPropagate
** If this were on a Unix machine, the would be in Epoch value, and a converter would be needed.
For example:
Time of Report (GMT) secs since 1/1/1970: 1498729312
This converts to GMT Thursday, June 29, 2017 9:41:52 AM
This information now lets us know more specifically where to look for problems.
The next section specifically points to what happened at the time of the abend. In Unix this can be very helpful as an actual signal is sent. For example:
This insert can let you know what happened at the time. Here is a list of the most common signals:
1 SIGHUP Hangup.
2 SIGINT Interrupt.
3 SIGQUIT Quit. (1)
4 SIGILL Invalid instruction (not reset when caught). (1)
5 SIGTRAP Trace trap (not reset when caught). (1)
6 SIGABRT End process (see the abort() function). (1)
7 SIGEMT EMT instruction.
8 SIGFPE Arithmetic exception, integer divide by 0 (zero), or floating-point exception. (1)
9 SIGKILL Kill (cannot be caught or ignored).
10 SIGBUS Specification exception. (1)
11 SIGSEGV Segmentation violation. (1)
12 SIGSYS Invalid parameter to system call. (1)
13 SIGPIPE Write on a pipe when there is no process to read it.
14 SIGALRM Alarm clock.
15 SIGTERM Software termination signal.
16 SIGURG Urgent condition on I/O channel. (2)
17 SIGSTOP Stop (cannot be caught or ignored). (3)
18 SIGTSTP Interactive stop. (3)
19 SIGCONT Continue the process if stopped. (4)
20 SIGCHLD To parent on child stop or exit. (2)
21 SIGTTIN Background read attempted from control terminal. (3)
22 SIGTTOU Background write attempted from control terminal. (3)
23 SIGIO Input/Output possible or completed. (2)
24 SIGXCPU CPU time limit exceeded (see the setrlimit() function).
25 SIGXFSZ File size limit exceeded (see the setrlimit() function).
26 SIGVTALR Virtual time alarm (see the setitimer() function).
27 SIGPROF Profiling time alarm (see the setitimer() function).
28 SIGWINCH Window size change. (2)
29 SIGINFO Information request. (2)
30 SIGUSR1 User-defined signal (1)
31 SIGUSR2 User-defined signal 2.
Notes to table:
(1) Default action includes creating a core dump file.
(2) Default action is to ignore these signals.
(3) Default action is to stop the process receiving these signals.
(4) Default action is to restart or continue the process receiving
these signals.
After this point, the Windows and Linux abend files have a different format, but still contain the same information. In a Windows abend file the next section are the environment variables set on the Integration Node followed by the stack dump at the time of the issue.
On Unix, the next section is the stack dump followed by the environment variables set on the Integration Node.
We are going to focus on the stack trace.
In both Unix and Windows, the top of the stack will typically first be a few lines of the actual abend handling. These lines can be ignored:
Unix:
Windows looks more confusing as the stack is to the right of the actual abend file:
You may ignore any lines containing the words ‘abend’, ‘abort’, or ‘terminate’ near the top of the stack.
While the stacks may look different, they contain the same information, and basically show what was processing at the time of the issue.
At this point you can do a search based upon the information below the abend handling. In the Unix example, a portion of the stack is the following:
You can begin the investigation by searching for a couple of key terms just below the abend handling portion of the stack:
When searching, you can ignore the random letters/numbers indicated with the red boxes.
For example doing a simple internet search on ‘propagateInner ImbDataFlowTerminal’ resulted in a possible known defect, a link to the Knowledge Center, and a post on MQ Series( a known question/answer site). All could be helpful in determining the root cause.
Defect:
https://www-304.ibm.com/support/docview.wss?uid=swg1PI69900
Knowledge Center:
https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/au14185_.htm
MQSeries:
http://www.mqseries.net/phpBB2/viewtopic.php?t=27862
While none of these may be the complete answer or root cause of the abend, it does give a starting place to begin investigating.
There are a some abends that we see more common than others. Here are a few of those:
1. A semaphore locking issue.
If you see Function: semop or Function: semctl above the stack with ImbNamedMutex in the stack it will typically resolve by completing the steps at the following DWAnswers post:
https://developer.ibm.com/answers/questions/169895/why-does-iib-or-wmb-fail-after-a-failover-or-a-res.html
2. JVM Out of Memory issues:
If you see in the stack trace that the Integration Node or Integration server are not exiting the JVM libraties, it is very possible you are exhausting the JVMMaxHeap Size. You can confirm this by navigating to the stderr for the Integration Node or Integration Server and finding the out of memory exception:
Integration Node stderr: /components//stderr
Integration Server stderr /components///stderr
The JVMMaxHeap can be increased to avoid this abend:
https://developer.ibm.com/answers/questions/176620/how-do-you-change-the-max-jvm-heap-size-in-iib-or.html
3. Incorrect odbc configurations
If you see from the stack trace that the libraries indicated are odbc libraries, it is worth first taking a look at your odbc.ini and onbcinst.ini files to verify that they are correct. The location of these files are indicated by the environment variables ODBCINI and ODBCSYSINI. If you see any additional white spaces or carriage return issues can be caused.
https://developer.ibm.com/answers/questions/271466/odbc-connection-errors.html