AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.

 View Only
  • 1.  System Time

    Posted Mon October 17, 2022 05:03 PM

    I'm working with a vendor supplying cloud-based AIX environments.
    I've deployed several 7.2 and 7.3 virtuals and seeing some unusual issues with the system time on all of them.
    I will set the time zone, reboot the virtual and then set the system time to my preferred settings (America/New_York).
    I can reboot the virtual all day long and the time stays set.

    However, when I power the virtual off and then power it back on, the system time is ahead by 6 hours and 19 minutes, sometimes a few minutes more.

    The time zone has not changed, just the server time.
    The virtual has access to the internet and I've configured NTP to several external sources.
    When the virtual is powered on, xntpd starts and then fails after 3 minutes. Failure is logged in errpt.

    ---------------------------------------------------------------------------
    LABEL: SRC_SVKO
    Date/Time: Mon Oct 17 16:26:14 2022
    Type: PERM
    Resource Name: SRC
    Description
    SOFTWARE PROGRAM ERROR
    Detail Data
    SYMPTOM CODE
    256
    SOFTWARE ERROR CODE
    -9017
    ERROR CODE
    0
    DETECTING MODULE
    'srchevn.c'@line:'401'
    FAILING MODULE
    xntpd
    ---------------------------------------------------------------------------

    The NTP log shows this:

    17 Oct 16:26:14 xntpd[5833150]: time error -22778.690494 is way too large (set clock manually)

    I know very little about what's running under the covers (it's the cloud!).
    What I can see is this:

    System Model: IBM,9009-22A
    Machine Serial Number: XXXXXXX
    Processor Type: PowerPC_POWER9
    Processor Implementation Mode: POWER 9
    Processor Version: PV_9_Compat
    Number Of Processors: 1
    Processor Clock Speed: 2500 MHz
    CPU Type: 64-bit
    Kernel Type: 64-bit
    LPAR Info: 25 vm-xxxxxxxxx-xxxxxx
    Memory Size: 2048 MB
    Good Memory Size: 2048 MB
    Platform Firmware level: VL950_092
    Firmware Version: IBM,FW950.30 (VL950_092)
    <and so on>

    And a little bit about the VIO (from kdb):

    vscsi0 0x000007 0x0000000000 0x0 xxxxxxxvios13a->vhost23

    Vendor support is trying to convince me the problem is with the virtuals.
    I've deployed extra virtuals from their AIX templates, set the time zone and see the same behavior.
    I think this is caused by the time on the physical host being off, or possibly the hypervisor not providing the correct time when the virtual is powered on.

    I wanted to see if anyone has seen this or can shed some light on the situation.

    Thanks!

    Bob

    ------------------------------
    Robert Bothwell
    ------------------------------


  • 2.  RE: System Time

    Posted Tue October 18, 2022 04:18 AM

    That looks squarely like a cloud provider issue.

    Internally, both the P9 and the LPAR represent time as the good old unix time_t.


    When the clock changes inside an LPAR, a time offset is recorded in the partition (I don't know enough internals to tell you if it's in the partition NVRAM or some other representation of the partition inside the P9).

    Assuming(*) that when you reboot the virtual, it's not stopping and starting the LPAR, then the offset will stay fixed and the LPAR's own "clock" (P9 clock + offset) won't jump.  This matches with what you are seeing (no jump on reboot).

    When you shutdown, the offset should be recorded on the LPAR.   There's at least two firmware levels after that, and none mention a time issue, so it could be the cloud provider deleting the LPAR after shutdown, and recreating on startup, which would explain why the offset is lost.

    Two things to look into:
    - Usually the VIOS are the system time references (via the PHYP), you won't be able to query them directly, but the vendor can.  Tell them to show you the output of the VIOS for "echo $TZ" and "date", and the ASMI date.  Do not budge on this.
    - You are absolutely correct that the initial time comes from the P9 PHYP. Extraordinary claims needs extraordinary proof, if the problem is with the template, they'll need to demonstrate that it's not with the P9.  Either way, the template is theirs, so they must fix it.


    My guess is that the support team is somewhere between GMT-5 and GMT-6 (-22778.690494 seconds is -5h19m), it's usually a result of setting the clock before setting the TZ.  After that, xntpd won't work inside the VIOS (delta > 20m), and the P9 time will drift.   Another possible issue is if the VIOS TZ was still CST6CDT when the date was set according to GMT/CUT0GDT.  (Living in Portugal - CUT0GDT - I've seen my share of systems with a time_t offset of 6 hours, due to the default TZ being CST6CDT).

    (*) - We all know how *that* goes.



    ------------------------------
    José Pina Coelho
    IT Specialist at Kyndryl
    ------------------------------



  • 3.  RE: System Time

    Posted Tue October 18, 2022 06:52 AM
    As Jose pointed out, the vendor may (and for resource management, probably does) delete the LPAR when you shut it down.  Now, starting the virtual back up again, do you know if you're even on the same physical P9 that you were just on previously?

    Track the serial number and lpar ID to the issues you're seeing, you may find that your skew is based on which piece of hardware you're on





  • 4.  RE: System Time

    Posted Tue October 18, 2022 04:23 AM
    Ask your "cloud service provider" to make a time reference partitions on each box. It is not a problem with LPARs, but with their systems.

    https://www.ibm.com/support/pages/time-drift-power8-and-power9-servers

    https://www.ibm.com/support/pages/time-and-date-management-powervm

    ------------------------------
    Andrey Klyachkin

    https://www.power-devops.com
    ------------------------------



  • 5.  RE: System Time

    Posted Wed October 19, 2022 04:50 AM

    Time and Date Management with PowerVM: https://www.ibm.com/support/pages/time-and-date-management-powervm




    ------------------------------
    José Pina Coelho
    IT Specialist at Kyndryl
    ------------------------------



  • 6.  RE: System Time

    Posted Mon October 24, 2022 07:06 AM
    Edited by Esa Kärkkäinen Mon October 24, 2022 07:07 AM
    FWIW there is a workaround, run ntpdate when AIX boots.
    This is the way many *nix operating systems work or can be setup in this manner.

    IMHO this is fairly robust way to run ntpdate on boot on AIX.

    The ntpdate command must be ran before xntpd starts, so run the shell script pasted below before "rctcpip" from /etc/inittab.

    The shell script:

    #!/bin/sh
    PATH=/usr/bin:/usr/sbin
    ntpdate -b $(awk '/^server / && !/127.127./{print $2}' /etc/ntp.conf | xargs)

    Then the usual rigamarole, create file, paste, set appropriate permissions and see if "srcmstr" is the one that is started before "rctcpip", if so "mkitab -i srcmstr 'appropriate inittab entry here'".

    ------------------------------
    Esa Kärkkäinen
    ------------------------------



  • 7.  RE: System Time

    Posted Mon October 24, 2022 09:55 AM
    All,

    I appreciate all the feedback on this.
    I like to think I'm pretty knowledgeable after 30+ years of AIX, but you never know when IBM might slide a change into a release!

    I escalated this with the vendor and it appears they have addressed the issue.
    I will follow up with them in the next day or so and find out what changed, maybe even get an RCA.

    Tom's suggestion to track the serial number is a great troubleshooting idea.
    When this issue first appeared, I did exactly that.
    I have a 'test' area with the same vendor and proceeded to deploy a bunch of new virtuals from their templates.
    One thought I had was to try and get 'new' virtuals deployed on different physical hosts in their cloud.
    I did get one virtual deployed on a host I had not seen on previous deployments, and it did not have the time issue.
    However, when I powered this virtual off and then powered it back on, it was running on a physical host with the time issue.

    The cloud provider support team mentioned something about the virtual being 'unprovisioned' in their 'power-off' action.
    I suspect their 'power-on' action re-provisions the virtual on a host based on some algorithm.
    When I select the 'power-on' action in their UI, it takes a couple of minutes before the typical AIX splash screen appears on the console.

    Thanks!

    Bob



    ------------------------------
    Robert Bothwell
    ------------------------------