MQ

MQ

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only
  • 1.  Multi-instance shared drive failure causes damaged queues

    Posted Tue December 01, 2020 04:35 PM
    Hi Folks,

    We have a windows multi-instance instance configuration with lots of QM on the same machine. The shared drive was brought down without stopping MQ first and we have ended up with lots of corrupted queues on disk. Mainly the SYSTEM.CLUSTER.REPOSITORY.QUEUE and SYSTEM.HIERARCHY.STATE along with some other files like the AMQERRO01. log file.

    These files are, in most cases, "corrupted" on disk and cannot even be deleted from the disk until the disk is checked and fixed using disk tools.
    I'm assuming that these specific files are corrupted on disk because MQ had them open?

    I'm trying to figure out if it's expected MQ behaviour that the files end up being corrupted or whether this is something specific to the type of disk that was used and actually the disk should have been able to cope with the fact that MQ had these files open (if that is why these specific files are corrupted). I don't have details of the type of disk being used yet so can't confirm it is a supported configuration - sorry. Let's assume it is for now please?

    What's your experience of this situation please folks?

    many thanks,
    John.

    ------------------------------
    John Hawkins

    TallJHawkins Consulting Ltd
    ------------------------------


  • 2.  RE: Multi-instance shared drive failure causes damaged queues

    Posted Wed December 02, 2020 01:36 AM

    Hi John,

    MQ is just a user of the file system like any other. If the file system has been brought down hard, then like any file system that has been in a power outage or whatever, you will likely need to run a disk recovery on it before any user of the file system is allowed to open a file. You see the same with any computer that has a power outage - chkdisc will run before you get to start up fully.

    You say the files are "corrupted" - are you sure, or is this just the file system's way of saying "I can't guarantee the consistency of this file until I check it, so I won't allow anyone to open it"? Are they still "corrupted" after you run the disk tools you mention?

    You should probably be asking the question, "is it expected behaviour that a file system can cause file corruption just because a program has a file open when the file system is brought down?" It's not something that would be specific to MQ. As I'm sure you are well aware, MQ is not doing anything particularly unusual in it's use of file systems.

    Cheers,
    Morag



    ------------------------------
    Morag Hughson
    MQ Technical Education Specialist
    MQGem Software Limited
    Website: https://www.mqgem.com
    ------------------------------



  • 3.  RE: Multi-instance shared drive failure causes damaged queues

    Posted Wed December 02, 2020 02:29 AM
    Edited by John Hawkins Wed December 02, 2020 03:00 AM
    I agree Morag,

    I think it's partly the way that MQ talks about damaged files - almost as if it's to blame when, clearly, it's just a general disk issue that could have happened to any file that was open when the disk went down.

    I guess my real question could be - how on earth can a FS controller, "in this day and age", not close files down nicely when it undergoes maintenance (in this case),
    It feels like the FS controller was either not brought down at all nicely and/or the FS controller could have done a much better job of closing files when it went into maintenance. Fundamentally - Why does a file being open mean that it gets corrupted it - this sounds counter-intuitive to me - I would have naturally assumed that the controller was capable of understanding that a file was e.g. being written to and then either backing out that write or completing it using some kind of finalize routine. Then I being to wonder what on earth MQ was doing with all those queue definition files open - does it need them open all the time - presumably in write-mode? Isn't MQ then to blame ( just a little) for having files open in write mode that it doesn't need to?  Presumably a file open in read mode only wouldn't cause an inconsistency ?

    any knowledge in this area anyone?

    thanks,
    john.

    ------------------------------
    John Hawkins
    Integration Consultant
    ------------------------------



  • 4.  RE: Multi-instance shared drive failure causes damaged queues

    Posted Wed December 02, 2020 02:57 AM
    Hi John,

    We use MQ on Windows server 2012 SP 2 with multiple instances in our test environment and before 2018 also in PRD.
    But we had a seperate WIndow 2012 file server running for file storage. All our systems where running on  VMware and we used Vmotion at the start. (You should not use Vmotion by the way on a running Qmanager ). After we switched off  Vmotion we got less trouble with our Qmanagers.

    We  disconnected the file system lot of times because our system admins who owns this fileserver where updating and restarting it without warning the MQ team. 
    But we had only trouble with the QManager's like we should expect but we never had issues like corrupted files or something like that. Issue's where solved after restarting the Qmanager.

    ------------------------------
    Bernard Pittens
    Integration Engeneer
    Sligro Foodgroup B.V.
    Veghel
    ------------------------------



  • 5.  RE: Multi-instance shared drive failure causes damaged queues

    Posted Wed December 02, 2020 03:02 AM
    Thanks Bernard - what sorts of issues did you see?

    ------------------------------
    John Hawkins
    Integration Consultant
    ------------------------------



  • 6.  RE: Multi-instance shared drive failure causes damaged queues

    Posted Wed December 02, 2020 03:19 AM
    Hi John,

    FDC error's like this:

    +-----------------------------------------------------------------------------+
    | |
    | IBM MQ First Failure Symptom Report |
    | ========================================= |
    | |
    | Date/Time :- di november 17 2020 10:45:11 W. Europe Standard Time |
    | UTC Time :- 1605606311.126000 |
    | UTC Time Offset :- 60 (W. Europe Daylight Time) |
    | Host Name :- TSTIMQ01 |
    | Operating System :- Windows Server 2012 R2 Server Standard Edition, Build |
    | 9600 |
    | PIDS :- 5724H7251 |
    | LVLS :- 9.1.0.1 |
    | Product Long Name :- IBM MQ for Windows (x64 platform) |
    | Vendor :- IBM |
    | O/S Registered :- 1 (amqxcs2.dll) |
    | Data Path :- D:\ProgramData\IBM\MQ |
    | Installation Path :- D:\Program Files\IBM\WebSphere MQ |
    | Installation Name :- Installation1 (1) |
    | License Type :- Production |
    | Probe Id :- XC560030 |
    | Application Name :- MQM |
    | Component :- xcsCreateDirectory |
    | SCCS Info :- F:\build\slot1\p910_P\src\lib\cs\pc\winnt\amqxcrtn.c, |
    | Line Number :- 258 |
    | Build Date :- Nov 8 2018 |
    | Build Level :- LAIT27071-221114 |
    | Build Type :- IKAP - (Production) |
    | UserID :- SA_xxxxxx |
    | Process Path :- D:\Program Files\IBM\WebSphere MQ\bin64 |
    | Process Name :- amqzmur0.exe |
    | Arguments :- -m QMAN_xxx |
    | Addressing mode :- 64-bit |
    | Process :- 00003500 |
    | Thread :- 00000008 DiagMsgService (6072) |
    | Session :- 00000000 |
    | UserApp :- FALSE |
    | ConnId(1) IPCC :- 16 |
    | Last HQC :- 1.0.0-78272 |
    | Last HSHMEMB :- 0.0.0-0 |
    | Last ObjectName :- |
    | Major Errorcode :- xecF_E_UNEXPECTED_SYSTEM_RC |
    | Minor Errorcode :- OK |
    | Probe Type :- MSGAMQ6119 |
    | Probe Severity :- 1 |
    | Probe Description :- AMQ6119S: An internal IBM MQ error has occurred (WinNT |
    | error 53 from CreateDirectory.) |
    | FDCSequenceNumber :- 1 |
    | Comment1 :- WinNT error 53 from CreateDirectory. |
    | Comment2 :- The network path was not found. |
    | |
    +-----------------------------------------------------------------------------+

    But after restarting the qmanager's and switch them with   endmqm -r -s -i QMAN_xxxx  strmqm -x QMAN_xxx   
    system continues running.


    ------------------------------
    Bernard Pittens
    Integration Engeneer
    Sligro Foodgroup B.V.
    Veghel
    ------------------------------