Data Protection Software

 View Only
  • 1.  ISP 8.1.18.0 : volume not mounting, process not cancelling

    Posted Tue April 04, 2023 10:27 AM

    Hi,
    I recently (last thursday) updated our ISP server (SLES12SP5) to V8.1.18.0.
    Today I found a process (stgp migration against local disk buffer pool) asking for a volume (LTO) to be mounted and was still waiting after 5 h with a drive reserved.
    On the other hand there was a backup stgp process running against the same stgp (with other volumes mounted, so no mount request concurrency).
    And the server with some cpu and sys values over 50%, which always makes me suspicious ... although seemed to work fine.

    I then nevertheless submitted a cancel backup stgp and an hour later a cancel stgp migration.
    After three/two hours both processes were still running. So I halted ISP, rebooted the server and it looks good now.

    Any idea where to look for a cause?
    Thanks, Igor



    ------------------------------
    Igor MERKU'
    ------------------------------


  • 2.  RE: ISP 8.1.18.0 : volume not mounting, process not cancelling

    Posted Wed April 12, 2023 12:01 PM

    Hello Igor,
    if a process cannot be cancelled it may hang in I/O. More details at very high level in my own words ...
    The SP server is running in user space. If a process/ thread needs to issue an I/O, for example to a tape drive, this needs to be done through kernel space. As soon as the thread issued the I/O it has no more control about it and can only wait for the I/O to time out, fail or successfully return to the thread. The thread waiting for the I/O response usually cannot be cancelled. Even if you halt the SP server it's possible the thread is still "hanging around". To clean this up it may be required to reboot the Linux/ Operating System to free the resources. This is not a SP server unique issue ... it's the way Operating Systems work.
    Nevertheless, this is a guess from what you describe. I would check for I/O issues with the tape drive/ tape library matching time of issue.
    If the issue reoccurs you may check for locking conflicts in the SP server (resources will be locked when in use; SP server threads may wait for a lock to go away allowing access to the resource). The SP server has a build in deadlock detector. If the SP server detects a deadlock, it will cancel a process/ session to resolve the deadlock. In your case the SP server seems to have not detected a deadlock. 
    You can get some more details by issuing the following commands as SP server admin:
    q mount f=d
    show mp
    q sess f=d
    show session 
    show thread
    show locks
    q drive f=d
    show library 
    ---
    The show library output has a stanza for each drive. In the 3rd line of each drive you will find "polled = 0/1". If it's 1, the SP server is polling the drive because it is not responding. This may indicate an I/O issue with the drive, resulting into a process tape mount to hang forever ... because the process already got this drive resource assigned and is not waiting for it ... waiting for the mount to be completed.
    By the way ... show thread and show locks may generate a lot of output, depending of the SP server load condition.
    I hope I did not confuse you.
    Regards,
    Holger



    ------------------------------
    Holger Martens
    ------------------------------



  • 3.  RE: ISP 8.1.18.0 : volume not mounting, process not cancelling

    Posted Thu April 13, 2023 02:48 AM
    Edited by Igor P. Merkù Thu April 13, 2023 02:49 AM

    Hello Holger,
    thank you very much for your detailed answer, I very much appreciate your effort.
    There is a lot for me to learn - you never stop learning, do you...
    When we started off with ISP 8.1 (migrating to a new machine coming from TSM 7) we ran into these performance/lock/unresponsive situation from the start, has been very frustrating. For apparently no reason things settled and have been quite stable for several months up until the recent update from 8.1.17 to 8.1.18 (and kernel update, and lin_tape update) and "here we go again" ... 

    Yesterday afternoon, after a hard reset, I had the ISP completely for myself, no session, no process, nothing. So I started a backup db full onto file ... and the server went nuts in a blink of an eye. Very frustrating.

    I have a case open with IBM, hopefully we can make a sense out of it all, eventually. Might be some stupid option (stupid in the sense of a change/reset to some default value or behaviour) to set up differently with 8.1.18/lin_tape/kernel ... I'll keep this thread posted.

    Thanks again, Holger, much appreciated.
    Cheers, Igor



    ------------------------------
    Igor MERKU'
    ------------------------------



  • 4.  RE: ISP 8.1.18.0 : volume not mounting, process not cancelling

    Posted Fri April 14, 2023 04:25 AM

    Hello Igor,

    I'm dealing with ADSM/ TSM/ SP since V1.0 ... due to the huge amount of supported devices and the different operating systems it's never boring and I learn new things every day - that's why I love to be part of the SP community.

    Addressing the issue to our support team is definitely the right thing to do; I hope they will be able to track it down and fix it soon.

    Regards,

    Holger



    ------------------------------
    Holger Martens
    ------------------------------



  • 5.  RE: ISP 8.1.18.0 : volume not mounting, process not cancelling

    Posted Fri April 14, 2023 04:53 AM

    Hello Holger,
    well, I'm on TSM/SP since V5.0 ...

    Last 24h ISP server behaved, nothing to see ... nightly schedules (including backup db tape and backup db file) went well.
    I really hope support can come up with something, because it might be very hard to pin down a particular situation where the situation escalates and the server becomes unresponsive. Although have not heard back from support last 48h...

    Thanks again and kind regards, Igor



    ------------------------------
    Igor MERKU'
    ------------------------------