AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.

 View Only
Expand all | Collapse all

Unhealthy filesystems

  • 1.  Unhealthy filesystems

    Posted Wed October 15, 2008 06:15 AM

    Originally posted by: SystemAdmin


    I am having some weird problems with our oracle jfs2 file systems. Whenever we have to power down the lpars a couple of hours after they are powered back on errpt flags problems with the /u01/*/oracle file systems (see below). Umounting the fs and fsck'ing it fixes the problems (quite often with corrupt sibling chain and inode problems) but only with these 2 file systems. Any ideas...Oracle install, processes not closing cleanly??

    oslevel 5.3.0.0
    LABEL: J2_FSCK_INFO
    IDENTIFIER: AE3E3FAD

    Date/Time: Wed 15 Oct 09:17:46 2008
    Sequence Number: 3904
    Machine Id: 00CDCDEA4C00
    Node Id:
    Class: O
    Type: INFO
    Resource Name: SYSJ2

    Description
    FSCK FOUND ERRORS

    Probable Causes
    INVALID FILE SYSTEM CONTROL DATA

    Detail Data
    ERROR CODE
    0000 0000
    RESOLUTION STATE
    0000 0000
    FILE SYSTEM DEVICE
    /dev/fslv01

    LABEL: J2_FSCK_INFO
    IDENTIFIER: AE3E3FAD

    Date/Time: Wed 15 Oct 09:16:45 2008
    Sequence Number: 3903
    Machine Id: 00CDCDEA4C00
    Node Id:
    Class: O
    Type: INFO
    Resource Name: SYSJ2

    Description
    FSCK FOUND ERRORS

    Probable Causes
    INVALID FILE SYSTEM CONTROL DATA

    Detail Data
    ERROR CODE
    0000 0000
    RESOLUTION STATE
    0000 0000
    FILE SYSTEM DEVICE
    /dev/fslv01

    LABEL: J2_IMAP_CORRUPT
    IDENTIFIER: 61277850

    Date/Time: Mon 13 Oct 11:06:41 2008
    Sequence Number: 3902
    Machine Id: 00CDCDEA4C00
    Node Id:
    Class: U
    Type: UNKN
    Resource Name: SYSJ2
    Resource Class: NONE
    Resource Type: NONE
    Location:
    VPD:

    Description
    FILE SYSTEM CORRUPTION

    Probable Causes
    INVALID FILE SYSTEM CONTROL DATA

    Recommended Actions
    PERFORM FULL FILE SYSTEM RECOVERY USING FSCK UTILITY
    OBTAIN DUMP
    CHECK ERROR LOG FOR ADDITIONAL RELATED ENTRIES
    IF PROBLEM PERSISTS, CONTACT APPROPRIATE SERVICE REPRESENTATIVE

    Detail Data
    FILE NAME
    j2_imap.c
    LINE NUMBER
    2053
    JFS2 MAJOR/MINOR DEVICE NUMBER
    0021 0003
    JFS2 ERROR LOG FLAG
    0008 0010
    FILE SYSTEM DEVICE AND MOUNT POINT
    /dev/fslv01, /u01/app/oracle

    LABEL: J2_FSCK_REQUIRED
    IDENTIFIER: B6DB68E0

    Date/Time: Mon 13 Oct 11:03:38 2008
    Sequence Number: 3901
    Machine Id: 00CDCDEA4C00
    Node Id:
    Class: O
    Type: INFO
    Resource Name: SYSJ2

    Description
    FILE SYSTEM RECOVERY REQUIRED

    Probable Causes
    INVALID FILE SYSTEM CONTROL DATA DETECTED

    Recommended Actions
    PERFORM FULL FILE SYSTEM RECOVERY USING FSCK UTILITY
    OBTAIN DUMP
    CHECK ERROR LOG FOR ADDITIONAL RELATED ENTRIES

    Detail Data
    ERROR CODE
    0000 0005
    JFS2 MAJOR/MINOR DEVICE NUMBER
    0021 0003
    CALLER
    0023 E3B4
    CALLER
    0022 734C
    CALLER
    0026 FD6C



  • 2.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 06:28 AM

    Originally posted by: tony.evans


    How do you power down the LPARs?
    What disk subsystem do you run?
    Are the filesystems set to check on start?


  • 3.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 06:48 AM

    Originally posted by: SystemAdmin


    How do you power down the LPARs? We wait until the DBA has closed the DBs then umount the non root file systems; then run shutdown (no grace) on the lpars. There are multiple lpars (80+) across a p590.
    What disk subsystem do you run? All the /u01 fs's are a volume group on vio.
    Are the filesystems set to check on start? Normal unix boot


  • 4.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 07:34 AM

    Originally posted by: tony.evans


    You get this on any of the 80 LPARs?

    How many VIO servers? Are the disks local to the VIO servers or are they SAN / NAS or something?

    By checked on boot, I mean do the filesystems have check set to true or false (or nothing) in /etc/filesystems?

    Are the VIO servers also being rebooted?


  • 5.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 07:43 AM

    Originally posted by: tony.evans


    And what does oslevel -s return?

    And what level of software on the VIO servers?


  • 6.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 07:55 AM

    Originally posted by: SystemAdmin


    You get this on any of the 80 LPARs? Randomly, but only on the /u01 file systems

    How many VIO servers? Are the disks local to the VIO servers or are they SAN / NAS or something? 8 vio servers. The oracle application servers use local vio disks, the oracle RAC servers use fibre attached SAN disks.

    By checked on boot, I mean do the filesystems have check set to true or false (or nothing) in /etc/filesystems? false

    Are the VIO servers also being rebooted? Yes

    And what does oslevel -s return? 5300-03-00

    And what level of software on the VIO servers? 5300-03-00


  • 7.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 08:05 AM

    Originally posted by: SystemAdmin


    Sorry, this "By checked on boot, I mean do the filesystems have check set to true or false (or nothing) in /etc/filesystems?" should have been nothing


  • 8.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 08:24 AM

    Originally posted by: tony.evans


    Ok.

    AIX 5.3 ML3 is pretty old (pre 2005). It may be unrelated, and I fully appreciate the difficulty of upgrading production servers, but I strongly recommend you move to a supported TL. It's entirely possible you're suffering an issue with early versions of AIX / VIO causing corruption on virtual scsi disks during a reboot cycle (I don't know of specific PMR's, I'm just suggesting it's a possible option).

    If you were to open this as a full PMR with IBM, they'd recommend upgrading to a supported level before proceeding.

    What's the order of shutdown? shutdown -h all lpars, then reboot the VIO servers, make sure the VIO servers are all up and running, and then restart the LPARs? Is there any chance that the LPARs are coming back up before the VIO servers?

    If you modify the filesystems to check on boot, then it'll fix the corruptions before the applications come up - but that just works around the issue, not resolve it. It would mean you don't need to stop the applications and take manual action though, assuming the corruption is occurring during the reboot phase, rather than while the servers are in use.

    Are we sure the VIO servers aren't having any disk connection issues, being rebooted, or that the routes to the disks aren't being affected in some other way?


  • 9.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 09:12 AM

    Originally posted by: SystemAdmin


    Shutdown order from HMC is (wait for each group to shutdown before continuing):

    Shutdown APP LPARs
    Shutdown RAC LPARs
    Shutdown VIOS and RMAN LPARs
    Shutdown tie breaker P520 LPARs
    Shutdown NIM server

    obviously reverse on powerup.

    There are no errors or syslog errors on the vio servers.
    The corruptions seem to occur several hours after the OS has been running...which makes me think the oracle app is the likely suspect when it accesses the file system.


  • 10.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 09:49 AM

    Originally posted by: tony.evans


    The corruption is detected after a couple of hours, doesn't mean it didn't happen earlier.

    That's why I suggest setting the filesystems to fsck on boot, at least you'll know for certain they were clean when they came up. You could fsck them before shutdown as well.

    Could be oracle, could easily be your very out-of-date version of AIX, as I say, if you raise a PMR with IBM they'll suggest you patch as a first action.


  • 11.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 09:54 AM

    Originally posted by: SystemAdmin


    Every time we raise a PMR with IBM they advise patching and firmware upgrades...but like you say easier said than done on 24/7 servers!

    Thanks for the advice.


  • 12.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 10:16 AM

    Originally posted by: tony.evans


    Well, since the problem comes to light after you reboot, you obviously have some room for upgrades, and you can use at least two methods to do 90% of the work without any downtime.

    The reason they suggest moving to a recent version is because it has a pretty high hit rate on fixing weird stuff. There are several thousand fixes between the version you're running and the latest TL.


  • 13.  Re: Unhealthy filesystems

    Posted Wed October 15, 2008 10:53 AM

    Originally posted by: CRM


    Just a thought, what version was RAC supported on virtual disks. I seem to recall it was something like 1.3 (my metalink login is not working at the moment to confirm), this was based on something like 5.3 TL6. You look to be running 1.1.2 or some very early and unsupported version of VIO.

    I would seriously recommend updating the code as per IBMs recommendations!

    regards

    Chris