AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.


#Power
#Power
 View Only
  • 1.  Detaching SVC disks damages other disks

    Posted Thu October 06, 2011 06:30 AM

    Originally posted by: Tibor_B


    Hi,
    I have a problem that is not easy to describe.

    The problem happens when we do something (mostly attaching and detaching) SVC disks to/from our AIX hosts. And sometimes this action somehow corrupts some other unrelated disks, they become unwritable and applications using them are crashing.
    Problem is not limited to specific version, just today it happened on AIX 5.2 host.
    We use multipath, so all operation are done on vpaths.
    Today I was doing detaching disks

    umount ...
    varyoffvg ....
    exportvg ...

    And short time after oracle DB using some other disks crashed and I noticed that our err log started being populated with errors like:
    Description
    USER DATA I/O ERROR

    Probable Causes
    ADAPTER HARDWARE OR MICROCODE
    DISK DRIVE HARDWARE OR MICROCODE
    SOFTWARE DEVICE DRIVER
    STORAGE CABLE LOOSE, DEFECTIVE, OR UNTERMINATED

    Recommended Actions
    CHECK CABLES AND THEIR CONNECTIONS
    INSTALL LATEST ADAPTER AND DRIVE MICROCODE
    INSTALL LATEST STORAGE DEVICE DRIVERS
    IF PROBLEM PERSISTS, CONTACT APPROPRIATE SERVICE REPRESENTATIVE

    Detail Data
    JFS2 MAJOR/MINOR DEVICE NUMBER
    002A 0001
    FILE SYSTEM DEVICE AND MOUNT POINT
    /dev/s......lv, /.....

    The filesystems is still mounted as rw (according to mount command)

    Usual recovery is detach and reatach the filesystem (umount, varyoffvg, exportvg and backward). Sometimes it goes without problem, and sometimes it goes but with error when doing varyoffg:
    0516-062 lqueryvg: Unable to read or write logical volume manager
    record. PV may be permanently corrupted. Run diagnostics
    But it goes on and looks like it is all right and can be attached back to host.

    But sometimes when doing importvg it returns:
    Method error (/usr/lib/methods/chgvpath):
    0514-047 Cannot access a device.

    Or even:

    0516-062 lqueryvg: Unable to read or write logical volume manager
    record. PV may be permanently corrupted. Run diagnostics
    0516-062 lqueryvg: Unable to read or write logical volume manager
    record. PV may be permanently corrupted. Run diagnostics
    0516-1140 importvg: Unable to read the volume group descriptor area
    on specified physical volume.
    the way to fix it is:
    chdev -l $vpath -a pv=clear
    chdev -l $vpath -a pv=yes
    rmdev all hdisks and vpath
    recreatevg .....

    So we usually can recover, but we want to find core problem to avoid crashing

    Any idea?

    Tibor
    #AIX-Forum


  • 2.  Re: Detaching SVC disks damages other disks

    Posted Sat October 08, 2011 01:37 AM

    Originally posted by: Kosala


    Hi,

    Don't have a exact idea what's going on but, looking at things I noticed, you're using SDD for 5.2 not SDDPCM. For SVC the recommended version SDDPCM is 2.4.0.2. You might want to migrate to SDDPCM, since SDD is kind of old now, and check your firmware levels of the FC cards to see whether they need upgrading.

    I have not faced this specific scenario, but most of the abnormal behavior I have seen is caused by incompatible versions of SW and FW.

    Cheers,
    Ko
    #AIX-Forum


  • 3.  Re: Detaching SVC disks damages other disks

    Posted Wed October 12, 2011 04:18 AM

    Originally posted by: Tibor_B


    Hi,
    thank you for your response. You are right, we still use sdd. But situation is not that clear. We have two pSeries 690 servers with 6 LPARs in total. All of them has devices.sdd.61.rte 1.7.2.0 and oslevel is 5200-07-00 and only two of them have these problems. Also we had this problem with one pSeries 570 box, and again it happens only on one of 3 LPARs. But this one is AIX 6100-03-03-0943 and driver devices.sdd.61.rte 1.7.2.0.

    But another aspect is that it happens only on some disks, f.e. we have 20 SVC disks on a box but only the same 3 or 4 of them have this problems from time to time. And when we moved such disk to some other unaffected server, it worked there flawlessly.

    Also I looked at version of oracle that is running on troublesome and "no problem" LPARs. If oracle should be the culprit, it would be version 9, but I can not say that there is 100 % corelation.

    OK, it seems that only good conclusion is to update of sdd drivers to sddpcm. This is something at least...

    Thanks

    Tibor
    #AIX-Forum


  • 4.  Re: Detaching SVC disks damages other disks

    Posted Wed October 12, 2011 03:13 PM

    Originally posted by: UncAndy79


    Another thing to look at is your fibre channel switches and the firmware they're running. We had a problem where we would get errors on certain LUNs when zoning work was done on some of the switches for unrelated systems. Turns out the switch version / firmware wasn't supported by the disk array we were using. When we moved the LUNs to a fabric with a supported switch / firmware combination we no longer had those problems.
    #AIX-Forum


  • 5.  Re: Detaching SVC disks damages other disks

    Posted Tue October 18, 2011 09:43 AM

    Originally posted by: Tibor_B


    Hi,
    A lot of things took place since my last post, and it seems that we got closer to the core of problem.

    We had another series of disk failure, it was connected to some work with FC optical fiber cables between our two localities.
    Even on hosts where disks did not failed totaly we are getting a lot of this errors:

    Type: TEMP
    Resource Name: fscsi0
    Resource Class: driver
    Resource Type: efscsi
    Location: U789D.001.DQD8P77-P1-C6-T1

    Description
    ADAPTER ERROR

    Probable Causes
    ADAPTER HARDWARE OR CABLE
    ADAPTER MICROCODE
    FIBRE CHANNEL SWITCH OR FC-AL HUB

    Failure Causes
    ADAPTER
    CABLES AND CONNECTIONS
    DEVICE

    Recommended Actions
    PERFORM PROBLEM DETERMINATION PROCEDURES
    CHECK CABLES AND THEIR CONNECTIONS
    VERIFY DEVICE CONFIGURATION

    As I said we use 8 paths for our disks, 4 of them going to other locality over two cables, and they did some reconfiguration of one of those two long-distance cables (speed + mode change) and had problems with performance & reliability of that cable afterwards.

    Then I found that common settings are like:

    lsattr -El fscsi0
    dyntrk:________ no
    fc_err_recov___ delayed_fail

    And combination of these two facts might cause above mentioned problems.

    Currently I picked few servers where I changed above setting to:

    dyntrk:________ yes
    fc_err_recov:__ fast_fail

    (of course, I deleted path+hdisks and run cfgmgr and addpaths as required). No errors since, but I will wait more days to see if it works all right.

    So I am bit optimistic about this.
    #AIX-Forum


  • 6.  Re: Detaching SVC disks damages other disks

    Posted Thu October 13, 2011 12:22 AM

    Originally posted by: Kosala


    Hi, SDD to SDDPCM will not be a upgrade, it'll be a migrate. Take care before you uninstall SDD filesets to remove all the vpath and related devices (dpo as well IIRC). You'll be required to do a export and import for the VG's with the new device name. If you use Oracle RAC, check with the DBA how they can handle the device name changes. I have done this for couple machines, but never for a Oracle box. Good luck.

    Cheers,
    Ko
    #AIX-Forum