IBM i Global

 View Only
  • 1.  Does FAA's computer system use data replication? If so, what type is it that caused this epic diruption?

    IBM Champion
    Posted Thu January 12, 2023 08:17 AM
    Edited by Satid Singkorapoom Thu January 12, 2023 08:23 AM
    This question of mine is irrelevant to the purpose of this group but I could not help resisting my curiosity and hope others may share the same urge and it may help us gain some knowledge about data replication.

    I just watched a short CNN report about this widely-reported disruption in FAA's computer system (I heard also from CNN the day before it was IBM system) that delivered the same content as the following news piece I got from CNN's web site :

    [QUOTE from CNN]
    The computer system that failed was the central database for all NOTAMs (Notice to Air Missions) nationwide. Those notices advise pilots of issues along their route and at their destination. It has a backup, which officials switched to when problems with the main system emerged, according to the source.

    FAA officials told reporters early Wednesday that the issues developed in the 3 p.m. ET hour on Tuesday.

    Officials ultimately found a corrupt file in the main NOTAM system, the source told CNN. A corrupt file was also found in the backup system.
    [UNQUOTE]

    The very last sentence above is the key point I would like to discuss here.  How was the data corruption propagated to the backup system causing this mishap?  By the mention of "backup system", I assume FAA computer system uses data replication of some type to a DR system. 

    A long time ago, I heard an ISV who sold "logical data replication" on iSeries (MIMIX or DataMirror which is now iCluster, I do not recall) compared their solution with "disk HW replication" solution and one point that caught my interest was that the ISV said that logical replication would NEVER propagate corrupted data to the target HA/DR system because it never touched the source tables and propagated the data at all. It touched change records in journaling object and propagated from there. (With IBM i remote journaling, I would see this fact remains intact). So, the target files would never be corrupted BY logical replication. 

    In contrast, the ISV said that disk HW replication worked by copying an entire physical disk sector (or page or cluster or whatever jargon used) image in memory to the target disk sector - verbatim bit by bit.  If there was any glitch in system (SAN box or computer server) or application level SW that caused the data in the source tables to be corrupted, disk HW replication microcode would NOT possibly know about this and therefore would faithfully propagate the corrupted data sector without delay!  I remember the ISV tech rep. even insisted to me he even knew of such rare but unfortunately possible mishap case before.  I also heard about this from a BP as well but never personally encountered a case myself.

    So, I'm wondering if any of you (especially in US) ever know if the problematic FAA's system use disk HW replication (and therefore had the issue described above) or not?  If so, do you know whether this kind of undesired weak point was or will be addressed in SAN microcode yet?  (I remember the SAN disk system microcode is owned by a company named FalconStor - or whatever new name it may have now, not IBM.)  

    Just curious and want to understand the issue.

    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------


  • 2.  RE: Does FAA's computer system use data replication? If so, what type is it that caused this epic diruption?

    IBM Champion
    Posted Thu January 12, 2023 09:59 AM
    Often it depends on your definition of 'corruption'.  Does your definition of corruption only include what we call a 'damaged object' on IBM i, or does it also include bad data such as characters in a numeric column?

    ------------------------------
    Robert Berendt
    ------------------------------



  • 3.  RE: Does FAA's computer system use data replication? If so, what type is it that caused this epic diruption?

    IBM Champion
    Posted Thu January 12, 2023 07:35 PM
    Edited by Satid Singkorapoom Thu January 12, 2023 07:48 PM
    Dear Robert

    >>>>  Does your definition of corruption only include what we call a 'damaged object' on IBM i, or does it also include bad data such as characters in a numeric column? <<<<

    When I asked my question, I had no specific notion about this data corruption but I think I can see what you try to imply in your question. I see that the latter case you mentioned can be propagated by logical replication (and this can happen only with DDS-created file but not SQL-created file because the latter is designed to validate the data at write time while the former does at read time). So,I now realize my question has more to do with the case of damaged or partially damaged object (I also wonder if this can happen in other OS as well?).

    And this jogs my memory that I used to help a customer who used logical replication and had a case of inexplicable partially damaged object that crashed the core application. The solution was to stop the application, renamed the damaged object, and saved the object from DR system and restored it to the production system.  I do not remember encountering this similar case with a customer who used disk HW replication and wonder if this can be the case with FAA (even if it does not run IBM i)?   

    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------



  • 4.  RE: Does FAA's computer system use data replication? If so, what type is it that caused this epic diruption?

    Posted Sat January 14, 2023 03:30 PM
    You bring up an interesting topic.  I think its unlikely we'll ever know the details of the FAA data corruption, but they have provided further information that "was damaged by personnel who failed to follow procedures." (https://www.forbes.com/sites/suzannerowankelleher/2023/01/13/faa-contractor-corrupt-software-file-ground-stop/?sh=52a30af3353d_)  identified in the same article as a contractor.

    So, administrator error or scapegoat, again, we'll probably never know.

    Modern databases are designed to provide crash consistent recovery with every operation they undertake such that they are never in a condition that is unrecoverable if, for any reason, they loose access to underlaying storage.  This is true of pretty much any enterprise level database because it is a critically important to be able to recover from a crash.  Most have some form of journaling (aka logs) to provide further protection.  

    Some of those systems (not IBM i) store the database files in a filesystem that could be damaged directly by administrator mistakes.  Other operating systems (especially unix variants) allow direct write access to disk devices.  with sufficient authority, those devices can be corrupted directly.

    If I were a betting man, I'd put my money on file or device level corruption on one of those systems caused by an admin fat fingering a command while running at a level of authority that was not consistent with "procedures"

    All of that aside, on the question of logical replication (e.g. based on remote journaling) versus disk or storage based replication (e.g. PowerHA Geographic mirroring, or storage based replication), what is the truth?

    While it may be true that block level corruption on a disk will not be replicated by logical replication via a journal, (unless it occurs during the write of the journal!), its not a factor that really warrants much consideration.   It is not a given that block level corruption in a write to a disk will get replicated.  If the replication draws from the RAM cache of the original write, it does not necessarily follow that the byte transmitted to the replicated copy will be incorrect .  In synchronous replication (i.e. MetroMirror) especially, the data is almost certainly not written to a disk on the source, then read from the disk and sent to the target.

    There are many systems in place in hardware systems to ensure that data does not get corrupted in transit or storage, such as checksums, error correction codes, RAID arrays, etc.  

    That example of logical replication's "superiority" is simply a sales tactic.  I'd be much more concerned about maintaining the logical integrity of the data in the order of database writes and maintenance of what does and does not get replicated in that logical replication scenario.    When you copy the storage, you get a copy of the storage at a point in time.  Period.  Lets not forget that journals were created to provide crash recovery and commitment control, so if you have journals, you already have exactly the same type of protection against this corruption.

    If your data is critical, you need more than one method of backup.  Replicated HA and DR systems, logical or storage level, will blissfully replicate bad data just as well as they will replicate good data.  You need to journal critical files.  You need point in time copies of that data that you can recover or revert to.  That may be in the form of traditional tape backups, virtual tape libraries, or immutable storage based technologies like IBM Safeguarded Copy.

    You also need to put procedures in place to protect against administrative errors.  Adopt the principal of least privilege.  Don't use a *SECADM account if you don't need a *SECADM account.  Even as an administrator you can accomplish most of your tasks with a plain old user account.   If your security is setup right, you should be able to do most tasks with a less privileged account.  If your security is not setup right, don't you think it should be?


    ------------------------------
    Vincent Greene
    IT Consultant
    Technology Services
    IBM
    Vincent.Greene@ibm.com


    The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.
    ------------------------------



  • 5.  RE: Does FAA's computer system use data replication? If so, what type is it that caused this epic diruption?

    IBM Champion
    Posted Mon January 16, 2023 04:10 AM
    Dear Vincent

    Thank for your post that gives more information to take note of on the matter discussed.

    ------------------------------
    Right action is better than knowledge; but in order to do what is right, we must know what is right.
    -- Charlemagne

    Satid Singkorapoom
    ------------------------------



  • 6.  RE: Does FAA's computer system use data replication? If so, what type is it that caused this epic diruption?

    IBM Champion
    Posted Wed January 25, 2023 09:31 AM
    This is a very interessting topic. Thanls for sharing your thouths...

    ------------------------------
    Carsten Schulz
    ------------------------------