Informix

 View Only
Expand all | Collapse all

down chunk

  • 1.  down chunk

    Posted Wed April 01, 2020 11:18 AM

    Informix 11.50.FC6 on HP-UX 11.31 PA-RISC

    We're working toward upgrading to 14.10 on Linux, but until then, we're running the environment shown above.  Since 11.50 is way past EOL, we're kind of out of luck for support from IBM/HCL.  If it was a production system, I might try to push them for support, but since it's not technically a "system down" situation, I doubt I'd have much luck.

    The current problem involves our disaster recovery server, which is kept nearly up-to-date via Continuous Log Restore.  Every 15 minutes, our primary server does an 'onmode -l' to change logical log files, backs up any logical logs used during those 15 minutes, then transfers the backups to the DR server.  A job on the DR server then applies the log backups to the instance.

    This has been running for months without a problem  We've occasionally brought the DR server to online mode to confirm that everything is working correctly, then restored from a level 0 and restarted the Continuous Log Restore.  

    Last night, we got an error which resulted in a down chunk on our DR server:

    23:16:00  Resuming Logical Restore
    23:16:00  Logical Log 64637 Complete, timestamp: 0x2befe189.
    23:16:02  Checkpoint Completed:  duration was 0 seconds.
    23:16:02  Tue Mar 31 - loguniq 64638, logpos 0xe018, timestamp: 0x2befe240 Interval: 5170003
    
    23:16:02  Maximum server connections 0
    23:16:02  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 719, Llog used 0
    
    23:16:03  Checkpoint Completed:  duration was 0 seconds.
    23:16:03  Tue Mar 31 - loguniq 64638, logpos 0x1ab018, timestamp: 0x2beff0d7 Interval: 5170004
    
    23:16:03  Maximum server connections 0
    23:16:03  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 329, Llog used 0
    
    23:16:03  Suspending Logical Restore
    23:31:26  Resuming Logical Restore
    23:31:26  Logical Log 64638 Complete, timestamp: 0x2beff66e.
    23:31:30  Rollforward of log record failed. iserrno = 0
    23:31:30  Log Record: log = 64639, pos = 0x1530584, type = OLDRSAM:CHALLOC(51), trans = 5074
    23:31:43  Assert Warning: Chunk 7 is being taken OFFLINE.
    23:31:43  IBM Informix Dynamic Server Version 11.50.FC6WE
    23:31:43   Who: Session(41, informix@drserver, 0, c0000001456c5288)
                    Thread(98, xchg_2.0, c00000014568b0a8, 1)
                    File: rsmirror.c Line: 1794
    23:31:43   Results: Dynamic Server will block at next checkpoint
    23:31:43   Action: Shutdown (onmode -k) or override (onmode -O)
    23:31:43  stack trace for pid 3951 written to /ifmx_dump/my_instance/af.44a0b12
    23:31:43   See Also: /ifmx_dump/my_instance/af.44a0b12
    23:31:44  Chunk 7 is being taken OFFLINE.
    23:31:44  Rollforward of log record failed. iserrno = 0
    23:31:44  Log Record: log = 64639, pos = 0x1530584, type = OLDRSAM:CHALLOC(51), trans = 5074
    23:32:20  Logical Log 64639 Complete, timestamp: 0x2bf1dc64.
    23:32:39  Checkpoint blocked by down space, waiting for override or shutdown
    



    Looking at the af file, there are several HINSERT and ADDITEM entries listed, until we get to this:

    logpos:64639:15303fc HINSERT  tx:5074 pn:00611ad7 fl:             112
    c000000146da0060: 00000084 00000028 00000112 00000000   .......( ........
    c000000146da0070: 00000000 00000000 000013d2 015326c0   ........ .....S&.
    c000000146da0080: 91e8c3aa 00611ad7 00611ad7 00078817   .....a.. .a......
    c000000146da0090: 00430004 00000000 00000000 80017f0a   .C...... ........
    c000000146da00a0: 00017f0b 30313139 39208000 00000000   ....0119 9 ......
    c000000146da00b0: 00800000 00000000 80000000 00000080   ........ ........
    c000000146da00c0: 00000000 00008000 00000000 00800000   ........ ........
    c000000146da00d0: 00000000 80000000 00000080 00000000   ........ ........
    c000000146da00e0: 000000d7                              ....
    
    logpos:64639:1530544 ADDITEM  tx:5074 pn:00611ad8 fl:              10
    c000000147004060: 00000040 0000001c 00000010 00000000   ...@.... ........
    c000000147004070: 00000000 00000000 000013d2 01532700   ........ .....S'.
    c000000147004080: 91e8c3aa 00611ad8 00611ad8 00611ad7   .....a.. .a...a..
    c000000147004090: 00078817 000002a9 00010004 80017f0b   ........ ........
    
    logpos:64639:1530338 HINSERT  tx:5074 pn:00611ad7 fl:             112
    c000000146db1060: 00000084 00000028 00000112 00000000   .......( ........
    c000000146db1070: 00000000 00000000 000013d2 01532784   ........ .....S'.
    c000000146db1080: 91e8c3ac 00611ad7 00611ad7 00078818   .....a.. .a......
    c000000146db1090: 00430004 00000000 00000000 80017f0b   .C...... ........
    c000000146db10a0: 00017f0c 30363339 39208000 00000000   ....0639 9 ......
    c000000146db10b0: 00800000 00000000 80000000 00000080   ........ ........
    c000000146db10c0: 00000000 00008000 00000000 00800000   ........ ........
    c000000146db10d0: 00000000 80000000 00000080 00000000   ........ ........
    c000000146db10e0: 000000d7                              ....
    23:31:30  End of queued log recs
    Log Record: log = 64639, pos = 0x1530584, type = OLDRSAM:CHALLOC(51), trans = 5074
    c000000146f6b060: 00000034 00000033 00000090 00000000   ...4...3 ........
    c000000146f6b070: 00000000 00000000 000013d2 01530544   ........ .....S.D
    c000000146f6b080: 91e8c375 00000000 00220730 0000000a   ...u.... .".0....
    c000000146f6b090: 00000080                              ....
    23:31:43
    23:31:43  IBM Informix Dynamic Server Version 11.50.FC6WE Software Serial Number AAA#B000000
    
    23:31:43  Assert Warning: Chunk 7 is being taken OFFLINE.
    23:31:43   Who: Session(41, informix@drserver, 0, c0000001456c5288)
                    Thread(98, xchg_2.0, c00000014568b0a8, 1)
                    File: rsmirror.c Line: 1794
    23:31:43   Results: Dynamic Server will block at next checkpoint
    23:31:43   Action: Shutdown (onmode -k) or override (onmode -O)
    23:31:43  Raw hex dump of stack located in /ifmx_dump/my_instance/af.44a0b12.rawstk
    23:31:43  Stack for thread: 98 xchg_2.0
    
     base: 0xc000000147673000
      len:   69632
       pc: 0x0000000000000000
      tos: 0xc000000147675380
    state: running
       vp: 1
    
    ( 0)  0x4000000000fb0008   legacy_hp_afstack + 0x320  [/informix/IDS11.50.fc6/bin/oninit]
    ( 1)  0x4000000000faf4a4   afstack + 0x64  [/informix/IDS11.50.fc6/bin/oninit]
    ( 2)  0x4000000000fae410   afhandler + 0xa98  [/informix/IDS11.50.fc6/bin/oninit]
    ( 3)  0x4000000000fad904   afwarn_interface + 0x4c  [/informix/IDS11.50.fc6/bin/oninit]
    ( 4)  0x4000000000a1eac8   bring_media_down + 0x9a0  [/informix/IDS11.50.fc6/bin/oninit]
    ( 5)  0x4000000000b31c78   rollfwd_error + 0x2b8  [/informix/IDS11.50.fc6/bin/oninit]
    ( 6)  0x4000000000b7f534   rlogm_redo + 0x82c  [/informix/IDS11.50.fc6/bin/oninit]
    ( 7)  0x4000000000b20e48   scan_logredo + 0x998  [/informix/IDS11.50.fc6/bin/oninit]
    ( 8)  0x4000000000b216e4   scan_logredo + 0x1234  [/informix/IDS11.50.fc6/bin/oninit]
    ( 9)  0x4000000000b1f80c   next_lscan + 0x87c  [/informix/IDS11.50.fc6/bin/oninit]
    (10)  0x4000000000fbb598   prod_loop1 + 0x2e8  [/informix/IDS11.50.fc6/bin/oninit]
    (11)  0x4000000000fbbb30   producer_thread + 0x330  [/informix/IDS11.50.fc6/bin/oninit]
    (12)  0x4000000000f7cf34   startup + 0xd4  [/informix/IDS11.50.fc6/bin/oninit]
    (13)  0x4000000000f7cd1c   resume + 0x10c  [/informix/IDS11.50.fc6/bin/oninit]
    
     base: 0xc000000147673000
      len:   69632
       pc: 0x0000000000000000
      tos: 0xc000000147675380
    state: running
       vp: 1
    
    
    
    23:31:43   See Also: /ifmx_dump/my_instance/af.44a0b12
    
    ---------------------------------
    Begin System Alarm Program Output
    ---------------------------------
    
    Assertion Failure Type: Warning
    Host Name:              drserver
    Database Server Name:   my_instance
    Time of failure:        Tue Mar 31 23:31:44 EDT 2020
    AF file:                /ifmx_dump/my_instance/af.44a0b12
    Shared memory file:     None
    System Blocking:        OFF
    
    



    I'm not sure what the OLDRSAM:CHALLOC entry is showing.  Is it saying that the table (partition) added an extent?  

    Our production instance is running with no reported problems.  I've looked in the online.log for the relevant time period and there is nothing other than log complete/backup started/backup completed messages, and some checkpoint messages.  Since the log backups came from there, I would expect any problems other than a failed disk to show up on that server as well, but as I said, it looks fine.  Users are on the system, doing their normal work.

    Our Unix sysadm has looked in syslog and and dmesg, but does not see anything that looks out of place.  He also ran ioscan, and no issues were found.  Looking at vgdisplay shows all volumes syncd and available.  He has not run chkdsk yet, as the volume group is a RAID 10 striped across several disks, so it would take a while to complete.

    Any suggestions on what to look for?  I can just restore from the latest Level 0 archive and restart the continuous log restore, but I'd really like to be sure that there are no underlying problems first.





    ------------------------------
    Mark Collins
    ------------------------------

    #Informix


  • 2.  RE: down chunk

    Posted Wed April 01, 2020 11:26 AM

    Mark, did you try running oncheck on the CDR server ?

     

    You look to have corruptions in chunk 7, that would be the cause of down chunk. Take a look at the .af file and see what it says.

     

    Is this RSS that you are suing to replicate?

     






  • 3.  RE: down chunk

    Posted Wed April 01, 2020 11:40 AM
    Eric,

    We're not using RSS or CDR in this situation.  It is the Continuous Log Restore, where the ontape is called iteratively to apply new logical log backups over a period of time.  As each logical log is applied, the instance is left in a state that allows more logical log backups to be applied.  The plan is that if we ever have an actual emergency and have to switch over to the DR server, we just run one last ontape command to bring the instance online.

    I'm going to see if I can find the table tied to that partition so that I can run oncheck against it.  I wish there was an oncheck option to check a whole chunk (or dbspace) at a time.


    ------------------------------
    Mark Collins
    ------------------------------



  • 4.  RE: down chunk

    Posted Wed April 01, 2020 11:57 AM

    I see

     

    This oncheck option would be a good candidate for Request for enhancement ...

     

    here

    https://ibm-data-and-ai.ideas.aha.io/?project=INFX

     

    I will vote for it!

    Eric Vercelletto
    Data Management Architect and Owner / Begooden IT Consulting
    Board of Directors, International Informix Users group
    IBM Champion 2013,2014,2015,2016,2017,2018,2019,2020
    ibm-champion-rgb-130px

    Tel:     +33(0) 298 51 3210
    Mob : +33(0)626 52 50 68
    skype: begooden-it
    Google Hangout: eric.vercelletto@begooden-it.com
    Email:
    eric.vercelletto@begooden-it.com
    www :
    http://www.vercelletto.com
    www  https://kandooerp.org

    image001.jpg@01CDC3E9.1425CBB0

    image002.jpg@01CDC3E9.1425CBB0

    image003.jpg@01CDC3E9.1425CBB0

     

     






  • 5.  RE: down chunk

    Posted Wed April 01, 2020 12:01 PM
    I like your thinking.  I'll get on that shortly.

    ------------------------------
    Mark Collins
    ------------------------------



  • 6.  RE: down chunk

    Posted Wed April 01, 2020 11:56 AM
    I found the database:table combination identified by the partnum (pn) in the HINSERT log record, and ran oncheck -cd against that table.  When I run it on our production instance, it does not report anything, and returns with a 0 return code.  When I run it on our DR server (where the error occurred), I get "ISAM error: Primary and Mirror chunks are bad", with return code = 2.  It ran so quickly that I'm not sure whether it actually looked at the table pages on the disk, or if it is basing its assessment on the fact that the chunk is down.

    ------------------------------
    Mark Collins
    ------------------------------



  • 7.  RE: down chunk

    Posted Wed April 01, 2020 12:08 PM

    That's exactly where Tech Support could help

     

    Look, once I had a customer in a more or less similar situation (wanting to migrate to 14.10, which I obviously recommend).
    TS has tools to look deep into your chunks, so what I would do is talk with your IBM representative (if he ever know what means the word INFORMIX), commit in some way that you are migrating to 14.10 and ask for special authorization to benefit for exceptional support.

     

    My customer had stopped paying for support, he could negotiate reinstating TS contract  then finally TS could solve the problem.

     

    Worth a try, but I might lose some friend here ��

     

     

    Eric

     

     

    Eric Vercelletto
    Data Management Architect and Owner / Begooden IT Consulting
    Board of Directors, International Informix Users group
    IBM Champion 2013,2014,2015,2016,2017,2018,2019,2020
    ibm-champion-rgb-130px

    Tel:     +33(0) 298 51 3210
    Mob : +33(0)626 52 50 68
    skype: begooden-it
    Google Hangout: eric.vercelletto@begooden-it.com
    Email:
    eric.vercelletto@begooden-it.com
    www :
    http://www.vercelletto.com
    www  https://kandooerp.org

    image001.jpg@01CDC3E9.1425CBB0

    image002.jpg@01CDC3E9.1425CBB0

    image003.jpg@01CDC3E9.1425CBB0

     

     






  • 8.  RE: down chunk

    IBM Champion
    Posted Wed April 01, 2020 12:16 PM
    Seeing that CHALLOC failure, with CHALLOC indeed being an extent ofcurrently FREE pages being asigned/allocated to an object, typically a partition, one had to suspect some sort of extent poblem.
    As this replicated over from primary, there's a chance the same problem got introduced there, unnoticed.

    To be sure your real problem isn't on the primary (and might be the cause there for further havoc) I'd first run an 'oncheck -ce <dbspace_name>' on the dbspace containing chunk #7.   Should that come back clean, you'd be good to recreate the 'secondary' from a fresh backup.  Should it show an extent overlap or other error, one had to see from there - yet make sure you're not loosing your latest backup from before the initial problem.


    ------------------------------
    Andreas Legner
    ------------------------------



  • 9.  RE: down chunk

    Posted Wed April 01, 2020 12:56 PM
    Andreas,

    The oncheck -ce did not report any errors on the primary.  When I run it on the DR server, I get:

    > oncheck -ce
    
    Validating extents for Space 'rootdbs' ...
    ERROR: Failed to get header page for partnum 0x100001 (buffer may be locked).
           Please limit DDL/DML activity when running this command.
    
    Validating extents for Space 'tempdbs1' ...
    
     Chunk Pathname                             Pagesize(k)  Size(p)  Used(p)  Free(p)
         2 /informix/links/rdb1                           2  1750000       53  1749947
    
    
    Validating extents for Space 'llogsdbs' ...
    
    Validating extents for Space 'indexdbs' ...
    ERROR: Failed to get header page for partnum 0x400001 (buffer may be locked).
           Please limit DDL/DML activity when running this command.
    
    Validating extents for Space 'tempdbs2' ...
    
     Chunk Pathname                             Pagesize(k)  Size(p)  Used(p)  Free(p)
         5 /informix/links/rdb2                           2  1750000       53  1749947
    
    
    Validating extents for Space 'tempdbs3' ...
    
     Chunk Pathname                             Pagesize(k)  Size(p)  Used(p)  Free(p)
         9 /informix/links/rdb3                           2  1750000       53  1749947
    
    
    Validating extents for Space 'trainingdbs' ...
    ERROR: Failed to get header page for partnum 0x900001 (buffer may be locked).
           Please limit DDL/DML activity when running this command.
    
    Validating extents for Space 'plog2dbs' ...
    ​
     Chunk Pathname                             Pagesize(k)  Size(p)  Used(p)  Free(p)
        21 /informix/links/rdb9                           2  3145728  3145728        0
    


    I don't know whether the fact that the DR instance is still in Fast Recovery mode (due to the Continuous Log Restore) is causing any of those errors.




    ------------------------------
    Mark Collins
    ------------------------------



  • 10.  RE: down chunk

    Posted Wed April 01, 2020 03:54 PM

    not sure if this defect is applicable to a CLR instance but worth a look since you cannot  obtain support:

     

    https://www.ibm.com/support/pages/node/4915095

     

    IC68817: MISMATCH IN MIRROR SETTING IN ONCONFIG CAN LEAD TO CHALLOC ROLLFORWARD ERRORS ON ALL TYPES OF SECONDARY SERVERS

     






  • 11.  RE: down chunk

    Posted Wed April 01, 2020 07:03 PM
    Mark,

    Thanks.  Both servers have same settings for mirroring - no mirroring enabled.

    Still searching.

    ------------------------------
    Mark Collins
    ------------------------------



  • 12.  RE: down chunk

    IBM Champion
    Posted Wed April 01, 2020 01:02 PM

    If your admins don't see any problems with the disk structures on the DR site then I would run a set of onchecks on the primary during the next quiet time and if all is well there then take and restore a level 0 archive and put the server into log restore mode again. I suggest the following:

    oncheck -cR
    oncheck -ce
    oncheck -cc <for each database>

    That should be sufficient since the problem appears to be structural rather than a data/index problem. The following are therefore optional:

    oncheck -cDI <for each database>
    oncheck -cS

    Of the first group, only the -cR will take much time (though if you have lots of dumb blobs the -ce may take a while as it validates the blobspace pages.



    ------------------------------
    Art Kagel
    ------------------------------



  • 13.  RE: down chunk

    Posted Wed April 01, 2020 01:13 PM
    Art,

    I've run the oncheck -ce in response to Andreas' post, and it did not report any problems on the primary.  I just now ran the oncheck -cc for the database identified based on partnum from the HINSERT log entry, and the only thing that it reported was no sysdepend record for a view, but the message said that this could be ignored for views on tables in external databases, which this is, so no problems there.

    I will try to get the oncheck -cR later today.  I did run oncheck -cr, and it did not find anything.  I know, the -cR will be much more thorough, but I figured it wouldn't hurt to do the quicker version first.

    ------------------------------
    Mark Collins
    ------------------------------



  • 14.  RE: down chunk

    Posted Wed April 01, 2020 07:04 PM
    Art,

    Got a chance to run oncheck -cR.  Nothing reported as errors:

    > oncheck -cR
    
    Validating IBM Informix Dynamic Server reserved pages
    
        Validating PAGE_PZERO...
    
        Validating PAGE_CONFIG...
    
    
        Validating PAGE_1CKPT & PAGE_2CKPT...
              Using check point page PAGE_1CKPT.
    
    Validating physical log pages ...
    
    Validating logical logs ...
    
        Validating PAGE_1DBSP & PAGE_2DBSP...
              Using DBspace page PAGE_2DBSP.
    
        Validating PAGE_1PCHUNK & PAGE_2PCHUNK...
              Using primary chunk page PAGE_2PCHUNK.
    
        Validating PAGE_1ARCH & PAGE_2ARCH...
              Using archive page PAGE_1ARCH.
    ​



    ------------------------------
    Mark Collins
    ------------------------------



  • 15.  RE: down chunk

    Posted Wed April 01, 2020 08:31 PM
    @Mark Collins
    I'm not sure if the Continuing Support Offering option is still available, but if so, I think it would be possible to contact IBM for this issue.
    https://www.ibm.com/support/pages/informix-continuing-support-offering  ​

    ------------------------------
    SangGyu Jeong
    Software Engineer
    Infrasoft
    Seoul Korea, Republic of
    ------------------------------