AIX

 View Only

Live Update Cookbook

By Christian Sonnemans posted Tue March 30, 2021 05:24 AM

  

Live Update NIM Cookbook

AIX Live Kernel Updates (Live Update) has been available for AIX 7.2 for several years now and from a customer perspective it is a very good feature. It can help system administrators avoid working after hours to patch their systems.

As a customer myself and an AIX admin, I made a concentrated effort to enable the Live Update feature in my environment.So, I’ve written this to help other customers do the same using my “Live Update cookbook”This article guides you through a TL upgrade, invoking it from a NIM master, or manually from the LPAR itself.The same steps can be used for updating to a SP or for a single ifix that requires a kernel update (and reboot).

 

Purpose: Setup a working NIM environment for Live Update

Steps to take:

Set up a NIM or local LPAR environment in the right order.
Checklist for the LPAR and take the right steps to make a Live Update  successful.
Pitfalls and workarounds, logging and tracing.

 

Minimum requirement for an NIM environment:

In this section I describe how to setup your NIM environment and show some sample configurations. I’ve included some other helpful links on this topic at the end of this blog, under the references section.
We start by configuring and defining all the required NIM objects and communication between the NIM master, HMC, CEC and VIO servers (VIOS).

 

First steps on NIM master:

Step-1:  filesets that must be present:

lslpp -l|grep dsm
dsm.core
dsm.dsh

 

Step-2: Communication from NIM master to HMC.
Create a password keystore for storing the password that belongs to the user that can logon to the HMC see following example:

/usr/bin/dpasswd -f /export/NIM/hmc_liveupdate_passwd -U liveupdate
Password:
Re-enter password:
Password file created.

Save this password in a secure place, and create the user liveupdate on the HMC. Please refer to the example below:

 

On the NIM master, define a HMC object with the following NIM command. See the example below:
Syntax example:

Syntax example:
nim -o define -t hmc -a passwd_file=EncryptedPasswordFilePath \
-a if1=InterfaceDescription \
-a net_definition=DefinitionName \
HMCName

More detailed example:
nim -o define -t hmc -a if1="find_net <hmc-name> 0" \
-a net_definition="ent 255.255.255.0" \
-a passwd_file=/export/NIM/hmc_liveupdate_passwd \
HMCname

check the NIM object with command lsnim -l hmcname

Tip: to learn more about your existing NIM networks use the following commands:
lsnim -c networks
lsnim -l <network name found in command above>

Exchange ssh keys between the HMC and NIM master:

dkeyexch -f /export/NIM/hmc_liveupdate_passwd -I hmc -H <hmc_name>

Steps 3 and 4 are required if you use NPIV for storage paths, but also useful if you like to create VIOS backups via a central NIM server. A pitfall for VIOS backups is that you have to open all the NIM port ranges if you use a VIO firewall. If you don’t use a VIOS firewall, then there is no need to perform this step.
Example:
viosecure -firewall allow -port <portnumber> -address<nim_server>
viosecure -firewall allow -port <portnumber> -address <nim_server> -remote

ports are: 1019,1020,1021,1022,1023, 3901,5813,1058,1059,3901,3902

Step 3: Create a NIM CEC object for each managed system:
Example:
nim -o define -t cec -a hw_type=9009 -a hw_model=42A -a hw_serial=<number> -a mgmt_source=hmc-name cec-name

check with command:
lsnim-t cec
<cec-name> management cec

lsnim -l <cec-name>

output example:
<cec-name>:
class = management
type = cec
serial = 9009-42A********
hmc = <hmc-name>
manages = <LPAR-names>
manages = <LPAR-name>
manages = <vio-name>
manages = <vio-name>
Cstate = ready for a NIM operation
prev_state =

Step 4: Create a VIOS NIM object for each managed system.

nim -o define -t vios -a if1="<network-name> <vio-name> 0 entX" -a mgmt_source="<cec-name>" -a identity=<LPAR-id> <vio-name>

More detailed example:
nim -o define -t vios -a if1="adminnetwork vio1 0 ent7" -a mgmt_source="p9-S924-testserver" -a identity=1 <vio-name>

On each VIOS, you must register it as a NIM client (to the NIM master), this can be done in one of two ways.


First method to do this is with the default padmin user:
$ remote_management -interface <interface> <nimserver>

The disadvantage of this method is that if you have more than one network interface on the VIOS you cannot specify the desired interface.

The second method gives you more control by allowing the administrator to choose the desired FQDN:

Using oem_setup_env, run the following command as root on the VIOS:

$ oem_setup_env
# niminit -a master_port=1058 -a master=<FQDN-name-nimmaster> -a name=<vios-FQDN-name>

Also, in the oem_setup_env environment you can now check the file /etc/niminfo to ensure the NIM definitions for the client and the master are correct i.e.
cat /etc/niminfo


Final check on your NIM server if you can access via NIM your vio servers with:

nim -o lslpp <vio-server name>

Also, on the NIM server check if all the management resources are defined and available with:

lsnim -c management

and for each object

lsnim -l <management_object>

Step 5: For each NIM client (LPAR) that you wish to update using Live Update via NIM, it is necessary to update the NIM client object with a valid management profile.

This can be done via the command below:

nim -o change -a identity=<LPAR_ID> -a mgmt_source=<CEC> <lpar_name>

More detailed example:

nim -o change -a identity=8 -a mgmt_source=p9-S924-test testlpar

This is a very boring job if you have a lot of LPARs, so we created a script for this that gathers the right information form the HMC and then updates for every LPAR the extra management profile.

See below for an example script-1.

Step 6: For Live Update to function, each LPAR will need two additional (unused) disks before the surrogate LPAR can be created. These “spare” disks must be available on the LPAR, before you can run Live Update. Also, on the NIM master, there must be an object than holds the correct information about those disks (i.e. the Live Update data configuration file)

It’s useful to create for every LPAR an liveupdate_data NIM object, if you like to run in parallel NIM Live Updates.

nim -o define -t live_update_data -a server=master -a location=/install/nim/liveupdate/liveupdate_data_<LPARname>liveupdate_data_<lpar-name>

Below an example for at liveupdate_data_<lpar_name> file
Note: the path is also an example it can be any path where you store your nim objects.

cat /install/nim/liveupdate/liveupdate_data_<LPARname>
general:
kext_check = yes

hmc:
lpar_id =
management_console = <HMC_object_name>
user = liveupdate

# trace:
# trc_option = -anl -T 20M -L 40M

disks:
nhdisk = hdisk1
mhdisk = hdisk2
tohdisk =
tshdisk =

This is again a boring job so you can create again a script for this job, the script that I created is customer specific therefore not published.

Step 7: Create your NIM lpp_source. Depending on what you like to update with Live Update you need to prepare your lpp_source, for an SP or TL upgrade.

SP or TL lpp_source:
Steps how to create a valid lpp_source from base images can be found on the IBM support page:
How to create a spot and lpp_source from an Iso image.
There are several ways to create your lpp_source, but this is beyond this blog.
But I like to give one example that can be used on the NIM master:
-1 download the right TL/SP levels from fixcentral.
-2 put extra software for example for the expansion pack also into this source/download directory
-3 run the example below with your right paths:

nim -o define -t lpp_source -a server=master \
-a location="/install/nim/lpp_source" \
-a source=<path_downloaded_files_from_fixcentral> <lpp_source_name>

ifix lpp_source:
I like to mention here also the possibility to use Live Update for ifixes. Some ifixes include kernel fixes and will require a reboot. Sure we can use Live Update for this also and no reboot is needed!
The update source is also an lpp_source, but this source contains only ifixes.
You can put multiple ifixes in one directory and then define it as an lpp_source:

NIM example:
nim -o define -t lpp_source -a server=master -a location="/install/nim/lpp_source/<ifixes-dir>" <lpp_source_ifixes>

To check if your ifix is Live Update capable use the emgr command in the following example:.

emgr -d -e IJ28227s2a.200922.epkg.Z -v2 | grep "LU CAPABLE:"
LU CAPABLE: yes

After this step 7 you can do a final check if you have all the required resources:

lsnim | egrep "hmc|cec|vios|live"

After this your NIM environment is ready for Live Update from your NIM master.

Sample script-1
#!/bin/ksh93

LPAR=$1

ping -c3 ${LPAR} >/dev/null 2>&1
if [ $? -ne "0" ]
then
printf "could not ping host or wrong name <LPARname> \n"
exit 1
fi

serialnumber=$(ssh ${LPAR} prtconf |grep "Machine Serial Number" |awk '{ print $4}')
LPAR_id=$(ssh ${LPAR} LPARstat -i |grep "Partition Number" |awk '{ print $4}')

# Get right info from the HMC
mgmt_sys=$(ssh user@hmc "lssyscfg -r sys -F name |grep ${serialnumber}")

printf "${LPAR} has LPAR id: ${LPAR_id} and is current running on mgmt_system: ${mgmt_sys} \n"

NIM -o change -a identity=${LPAR_id} -a mgmt_source=${mgmt_sys} ${LPAR}

lsnim -l ${LPAR}

exit 0

Preparing the LPAR before we kick off the actual Live Update update script:

Before the actual Live Update starts, there are several requirements you must meet prior the Live Update kicks off.
Also, for users that are using TE (Trusted Execution) some extra steps must be done.
The same case for users that using ipsec_v4 or v6 (ipfilt)
In this section I will describe those extra steps. In our case we created a pre-script that runs those checks for us and put the LPAR into the right state. Because this script is very customer specific, I describe the common steps here.

Step 1:
Users that are using TE must first disable it during Live Update. On the LPAR you run:

trustchk -p te=off

Note: special case for users that store TE and / or RBAC database on a read-only LDAP server must modify the /etc/nscontrol.conf so that during the update only the local /etc/security/tsd/tsd.dat and /etc/security/<rbacfiles> are modified. This is not Live Update specific but should be done with any update to another SP or TL level. Example for tsd.dat file:

chsec -f /etc/nscontrol.conf -s tsddat -a secorder=files

Step 2:
Make a backup of your ip-filter lists and stop ip-filters on the LPAR.

After this remove the devices with:

rmdev -Rdl ipsec_v4
rmdev -Rdl ipsec_v6

Step 3:
Stop the processes that can cause Live Update to fail. We discovered several processes that caused Live Update to fail:
WLM (workload manager), we found out that this was started via the inittab even we do not use it. Therefore we shut this down and also disable it in the /etc/inittab (if you do not disable this entry in inittab, the surrogate will execute it and start wlm again).
On the LPAR use:
/usr/sbin/wlmcntrl -o

chitab "rcnetwlm:23456789:off:/etc/rc.netwlm start > /dev/console 2>&1 # disable netwlm"

Alternative if you do not use WLM, you can ofcource simply remove it from the /etc/inittab with:

# rmitab rcnetwlm

Other processes that we stop before we start the Live Update are:
Backup agents
Monitoring scripts, such as nimon, nmon, lpar2rrd etc.
Reason for this: You do not want a backup starting during the upgrade phase.
Monitoring can cause some open files in /proc and Live Update does not like this.
See also the restrictions on the IBM page:
https://www.ibm.com/support/knowledgecenter/ssw_aix_72/install/lvupdate_detail_restrict.html

Step 4:
Remove all ifixes on the LPAR if present, this action is normal to do before an upgrade to a higher TL or SP and requires without Live Update most of the time a reboot, if the ifixes are containing kernel patches.

In this case we only remove the ifixes and do no reboot otherwise it would not be a Live Update.

Step 5:
If the following filesets are installed on the LPAR they must be removed otherwise Live Update will fail.
This is can be found in the requirements for Live Update.
https://www.ibm.com/support/knowledgecenter/ssw_aix_72/install/lvupdate_limitations.html

It concerns the filesets bos.mls.cfg and security.pkcs11

geninstall -u -I "g -J -w" -Z -f filelist  where the file list contains the mentioned files

or

installp -ugw bos.mls.cfg security.pkcs11

Step 6:
This step is also TL02SP2 specific and it is solved in TL03 (see pitfall number 4.)
Copy a modified version c_cust_Live Update to /usr/lpp/bos.sysmgt/nim/methods/ on the LPAR
The modification made was:

vi /usr/lpp/bos.sysmgt/NIM/methods/c_cust_Live Update
Around line 892
Change:

geninstall_cmd="${geninstall_cmd} ${VERBOSE:+-D} -k"

To:

geninstall_cmd="${geninstall_cmd} ${VERBOSE:+-D} -k -Y"

The Live Update will accept the -Y = agree license flag on the LPAR.

Other solution is to install the right apar for this see list below:
IJ02913 for level: 7200-01-05-1845
IJ08917 for level: 7200-02-03-1845
IJ05198 for TL level: 7200-03

Of course it’s always better to start with LKU at the latest available level.
But the goal of LKU is updating to an higher level, this it the reason why I describe those pitfalls and mention the solution for those levels.

Step 7:
This step was also a step that we needed to do only on TL02 SP2
See for more detail pitfall number 5.
On the NIM maser run once and on every LPAR also:

lvupdateRegScript -r -n AUTOFS_UMOUNT -d orig -P LVUP_CHECK

Step 8:
We discovered if there are programs running on the LPAR that have open files in /proc Live Update will fail during the blackout period. (See pitfall 1).
You can prevent this by checking the LPAR with lsof |grep proc
If you have output than you should stop executables that are causing this, such as truss or debugging programs.

Step 9:
Check if there is enough space in /var and the root filesystem, /.
You need at minimum of 100Mb free space, but for safety we verify at that we have at least 500Mb free space in those filesystems.

Step 10:
Before running Live Update, I recommend creating a backup of your system.
You have several options for this NIM, mksysb, alt_disk_copy.
I prefer to use alt_disk_copy because it only takes one reboot to fallback to the old rootvg.

Make on a separate disk a copy of your current rootvg with the tool alt_disk_copy.

Steps are very simple and straight forward and is a real time saver if something goes terribly wrong!

Steps are:

cfgmgr -v after you added the extra disk via zoning or storage pools.

In case of using TE run trustchk -p te=off

Run the command:

alt_disk_copy -B -d <destination-disk>

You can add the -B option to prevent changing the bootlist.

Check / modify your boot list back to the original disk:

bootlist -m normal <original_rootvg_disk>

Step 11:
Check the profile for the lpar on the HMC.
For memory the minimum amount of memory for the active profile must be 2048 (2Gb).
Be aware LPAR must be booted with this profile, it cannot be changed afterwards.
This is actually a AIX 7.2 requirement. You need a minimum of 2Gb to run the OS.

Let’s update without doing a reboot!

Now that we did a lot of checks and verifications it’s now time for the actual goal, update the TL / SP level without a reboot.
In this section we describe several ways to initiate Live Update with NIM, and my advice is: if you are new with technology do first some test runs on a test environment, and to speed up you tests make one on a new disk first a alt_disk_copy. In this way you can rollback with a reboot, see step 10 in the previous section.

Live Update with NIM
For preview you can run the following examples with your valid lpp source and liveupdate_data NIM resource:

Dry run: mention the -p flag for preview

nim -o cust -a lpp_source=72000402_lpp_source \
-a installp_flags="-p" \
-a fixes=update_all \
-a live_update=yes \
-a live_update_data=liveupdate_data_<LPAR-name> <LPAR-name>

Real update with Live Update without the -p flag:

nim -o cust -a lpp_source=72000402_lpp_source \
-a fixes=update_all \
-a live_update=yes \
-a live_update_data=liveupdate_data_<LPAR-name> <LPAR-name>

When starting this command via a script on your NIM server or via the command line you can monitoring the standard output on your NIM master server.
For the logging on the LPAR see the chapter logging.

Testing Live Update on a LPAR without using NIM:
For testing purposes, you can start Live Update manually on the LPAR itself without using NIM.
And if you did not apply any new updates or ifixes, you can use this method as kind of “dry run” to see if it will create a surrogate, and move your LPAR to a new one (monitor the LPAR id).
Be aware: this is not a real “dry run” it does an Live Update to a new LPAR, but this is a good test if your workload can service a Live Update action.

Make sure that you first authenticate against the HMC where you created the liveupdate user.

Command example below:

hmcauth -u <liveupdate-user> -p <password> -a <hostname-hmc>

check with:

hmcauth -l

output example:

Address : www.xxx.yyy.zzz
User name: liveupdate
Port : 12443

On the LPAR run the command below to start the Live Update.
Check on LPAR in the path /var/adm/ras/liveupdate
The two free disks in the file: lvupdate.data

geninstall -kp <-- for preview first
geninstall -k   <-- for the real update

during the Live Update you can monitor the process, on either the NIM master, or if stared on the LPAR on the LPAR.
On the HMC you can monitor partly the process, such as creating the donor and cleaning up the old LPAR after the process is done. See below a screenshot of the HMC during the Live Update action:

Above picture shows that the original LPAR is renamed to _Live Update0 and the surrogate is created with the original name.

Apply an ifix with Live Update via the NIM master:
This is roughly the same method as a TL or SP upgrade, also for NIM you need the same steps:
See also the first part how to define an ifix NIM lpp_resource.
Also, the minimum requirements are the same, such as minimum two disks free on the LPAR, liveupdate.data file specified in NIM, connection and authentication to the HMC, disable ip filtering etc.

Apply the ifix preview with the following example:

nim -o cust -a live_update=yes \
-a lpp_source=Live Update-ifixes \
-a installp_flags="-p" \
-a live_update_data=liveupdate_data_<lpar_name> \
-a filesets=1022100a.210112.epkg.Z <lpar_name>

Leaving out the line with the -p (preview) flag will apply the ifix with Live Update.

Tip: You can find detailed logging for the ifix install on the LPAR in the following location:

/var/adm/ras/emgr.log

Workaround: Separate upgrade from Live Update actions:

Workaround for pitfall 3:
Pitfall 3 Description: PFC daemon waits too long to start.

Workaround for this problem is a good example mitigating problem that can arise during the upgrade of a TL or SP level. This workaround separates the upgrade form the actual Live Update actions.
And in this case, it makes it possible to start required daemons before the Live Update actions start.
See example below, the first part is an “normal” NIM update and after this it commits all the updates. Second part is the actual Live Update, but now packages to apply, so it only the Live Update actions.

nim -o cust -a lpp_source=72000402_lpp_source \
-a fixes=update_all \
-a accept_licenses=yes <lpar_name>

# now commit updates and start pfcdaemon
ssh <lpar_name> installp -c ALL
ssh <lpar_name> startsrc -s pfcdaemon

nim -o cust -a lpp_source=72000402_lpp_source \
-a fixes=update_all \
-a live_update=yes \
-a live_update_data=liveupdate_data_<lpar_name> <lpar_name>

Fallback option for pitfall 6:
The same workaround that it used above can also be used as a kind of failback option for the chicken egg problem that is described in pitfall 6.
When your workload fails to move to the surrogate LPAR (during backout a failure). The normal behavior without this workaround is rollback to the original state.
With this work around the actual upgrade is already done before the Live Update action starts.
So now you have the choice:
-1 stop your workload and do another attempt to move to the upgrade surrogate LPAR. So repeat Live Update action without workload, this is not a real Live Update action any longer because the goal was to keep your workload running.
OR
-2 Because you have to shut down your workload, and in this way it’s not a real Live Update action any longer you can of course also do a reboot of your LPAR, after this your upgrade is done, but not via Live Update.

Logging:
Main logging for Live Update can be found on the LPAR self in the following path:
/var/adm/ras/liveupdate/logs
In this path you can find the file lvupdlog or this file with a timestamp.
The log file can be cryptic read, but by searching on keywords like ERROR can help you by find a clue what went wrong in case of a failure.
Search ERROR all in caps in logfile.

When you must send logs to IBM for analysis it good to gather the following logs:
Make a tar file of everything in the path: /var/adm/ras/liveupdate
Example:
cd /var/adm/ras/liveupdate
tar cvf liveupdate_logs.tar *
make a snap -ac and send also this info to IBM.

Pitfall’s:
1: Open files in /proc:

Message during blackout:
1020-311 The state of the LPAR is currently non-checkpointable. Checkpoint should be retried later. [02.229.0730]

Problem programs such as truss or debugger have files open in /proc
In log file:

OLVUPDMCR 1610089744.160 DEBUG mcr_fd_srm_dev - Calling lvup_chckpnt_status(LIVE_UPDATE_TRY)
KLVUPD 1610089744 DEBUG - 6750576/24904167 DENY Blocking entry: procfs open, count: 9
OLVUPDMCR 1610089754.160 ERROR mcr_fd_srm_dev - 1020-311 The state of the LPAR is currently non-OLVUPDMCR 1610089754.160 ERROR mcr_sync - Action LU_FREEZE failed in SRM with status 6

2: WLM manager running or entry in inittab:
rcnetwlm:23456789:wait:/etc/rc.netwlm start> /dev/console 2>&1 # Start netwlm
change to:
chitab "rcnetwlm:23456789:off:/etc/rc.netwlm start> /dev/console 2>&1 # disable netwlm"
and run
/usr/sbin/wlmcntrl -o

3: PFC daemon wait too long to start
The script for this daemon waits to long (TL02SP2) to start 5 min.
Message in logging:

1430-134 Notification script '/opt/triton/daemon-stop' failed with exit status 1

Cause this happens because the pfc daemon is not started or takes too long to start.

Workaround firs apply all updates.
then start the pfcdaemon again before the Live Update starts with:

startsrc -s pfcdaemon

4: Using Live Update on TL02-SP2 strange license error during preview.
Hitting IJ08917 License errors when running Live Update through NIM. This is solved in 7200-02-03
On NIM client LPAR edit following file and put the -Y = agree license:

# vi /usr/lpp/bos.sysmgt/NIM/methods/c_cust_Live Update

Around line 892

Change:

geninstall_cmd="${geninstall_cmd} ${VERBOSE:+-D} -k"

To:

geninstall_cmd="${geninstall_cmd} ${VERBOSE:+-D} -k -Y"

Or apply the following ifix:

IJ02913 for level: 7200-01-05-1845
IJ08917 for level: 7200-02-03-1845
IJ05198 for TL level: 7200-03

5: Using Live Update on TL02-SP2 after the message notifying applications of impending Live Update:

1430-134 Notification script '/usr/lpp/nfs/autofs/autofs_Live Update orig' failed with exit status 127
....1430-025 An error occurred while notifying applications.
1430-045 The Live Update operation failed.
Initiating log capture.

Solution on the NIM maser run once and on every LPAR also:

lvupdateRegScript -r -n AUTOFS_UMOUNT -d orig -P LVUP_CHECK

This is fixed in:

IJ04531 - 7200-02-02-1832
IJ04532 - 7200-03-00-1837
IV89963 - 7200-01-02-1717

6: In blackout window you get several messages like below:

The Live Update logs will show the following error against the library that is causing Live Update to fail.

mcr: cke_errno 52 cke_Sy_error 23349 cke_Xtnd_error 534 cke_buf /usr/lib/libC.a(shr_64.o)

In the NIM output you get:

Blackout Time started.

1020-007 System call mcrk_sys_spl_getfd error (50). [03.156.0064]
1020-007 System call mcrk_sys_spl_getfd error (50). [03.156.0064]
1020-007 System call MCR_GET_SHLIB_ASYNCINFO error (52). A file, file system or message queue is no longer available. [02.239.0142]
........1430-030 An error occurred while moving the workload.
1430-045 The Live Update operation failed.

This is not quite a pitfall but a more a bug that you are hitting.
Problem description: the current kernel cannot release properly the shared libraries for non-root users.

Workaround: No real workaround.

Option 1: shutdown your workload and restart the Live Update operation. But in this way, it’s not a real Live Update anymore!

Option 2: Request an ifix for the level you are updating too. If you plan on updating to 7200-05-01, you need an ifix for 7200-05-01.
But with this option you have a chicken egg problem, because the ifix is only active (loaded) in the new kernel. In other words its fixed after a reboot or Live Update, is successfully succeeded (option 1).

Levels that suffer from this bug (impacted) levels are:
Service pack: 7200-05-01-2038 (bos.mp64 7.2.5.1) You need IJ30267 --> Official APAR is now in 7200-05-02.
Technology Level: 7200-05-00-2037 (bos.mp64 7.2.5.1) You need IJ30267
Service pack: 7200-04-02-2028 (bos.mp64 7.2.4.6) You need IJ29492
Service pack: 7200-04-01-1939 (bos.mp64 7.2.4.2) You need IJ29492
Technology Level: 7200-04-00-1937 (bos.mp64 7.2.4.2) You need IJ29492

Because this fix is for a shared library, the only way for your application to know to use the latest library code is to stop your application and restart it using the updated code. Or reboot your system.

Going forward though, you do not need stop your applications if you have the ifix installed and update to the next update of AIX.

7: program /usr/bin/domainname fails if no nis-domainname is defined.

This pitfall only happens when you have installed the nis Fileset

lslpp -w /usr/bin/domainname
File                Fileset            Type
----------------------------------------------------------------------------
/usr/bin/domainname bos.net.nis.client File

NIS is an old way for resolving hostnames also known as Yellow Pages, my advice is, remove this fileset because it’s old and insecure, unless you still use it.
Other workaround is to replace the executable with a script that does nothing but exit with exit 0 return code.

8: In blackout window you get the following misleading message:

1020-154 Resource /old/dev/nvram is not checkpointable (pid: 19005844) (name: lscfg_chrp). Query returned 22 Invalid argument [02.229.0462]

        1020-089 Internal error detected. [02.335.0458]

 

However investigating the logs you see also messages like:

mcr: cke_errno 52 cke_Sy_error 23349 cke_Xtnd_error 534 cke_buf /usr/lib/libC.a(shr_64.o)

 

Digging further into the logs it shows that the connection with HMC(s) is lost.

[RPP#rpp::ManagedSystem::display] #### MANAGED SYSTEM #################

[RPP#rpp::ManagedSystem::display] ## NAME       : p9-S924-zwl-1-7835130

[RPP#rpp::ManagedSystem::display] ## OP STATUS  : OperationError::OP_PARAM_ERROR

[RPP#rpp::ManagedSystem::display] ## RUUID      : d1197d99-bb08-38ea-b447-76f6073f4210

[RPP#rpp::ManagedSystem::display] ## MTMS       : 9009-42A*7835130

[RPP#rpp::ManagedSystem::display] ## HMC_RELS   : hmc-NAME 

[RPP#rpp::ManagedSystem::display] ## MEM / PU   : Values stored are not considered valid

[RPP#rpp::ManagedSystem::display] ##-- ONOFF COD POOL

[RPP#rpp::ManagedSystem::display] ## MEM        : Values stored are not considered valid

[RPP#rpp::ManagedSystem::display] ## PROC       : Values stored are not considered valid

[RPP#rpp::ManagedSystem::display] ## EPCOD      : No pool related to this managed system

[RPP#rpp::ManagedSystem::display] ##-- RESOURCE REQUEST

 And:

[RPP#rpp::ManagementConsole::rcToError] (hmc-zw-2): librpp error: Operation encountered an error that can be parsed. Check for REST or HSCL errors

[RPP#rpp::ManagementConsole::extractError] (hmc-zw-2): Content of error: hmcGetJsonQuick: -- Could not retrieve </rest/api/uom/PowerEnterprisePool/quick/All>: <[HSCL350B The user does not have the appropriate authority. ]>.

[RPP#rpp::ManagementConsole::parseHsclError] (hmc-zw-2): Found HSCL error code: 350B

[RPP#rpp::ManagementConsole::getAllEntitiesAttrs] (hmc-zw-2): Encountered an error related to the management console. Invalidate it

[RPP#rpp::ManagedSystem::refreshResources] (p9-S924-zwl-1-7835130): librpp error: No valid management console to execute the request

[RPP#rpp::ManagedSystem::refreshOnOffResources] (p9-S924-zwl-1-7835130): librpp error: No valid management console to execute the request

    <ReasonCode>Unknown internal error.</ReasonCode>

[2021-11-04][16:03:24]libhmc_http.c:getHmcErrorMessage:1395: HMC error msg: [[HSCL350B The user does not have the appropriate authority.

[2021-11-04][16:03:24]libhmc_http.c:getHmcErrorMessage:1395: HMC error msg: [[HSCL350B The user does not have the appropriate authority.


Solution for this problem:

Make sure that the connection between the LPAR and HMC is in a working state, and if not please follow the following procedure:

 First check the current status:

/usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc

This should give a similar output as below:

Management Domain Status: Management Control Points

  I A  0x7edb4439af21a36f  0002  xxx.xxx.xxx.xxx (ip address)

  I A  0x94580b842086855a  0001  xxx.xxx.xxx.xxx (ip address)

 

And on the HMC cli:

lspartition -dlpar

 <#55> Partition:<13*9009-42A*7xxxxxxx, lparname, lpar-ip-address>

       Active:<1>, OS:<AIX, 7.2, 7200-04-02-2028>, DCaps:<0xc2c5f>, CmdCaps:<0x4000003b, 0x3b>, PinnedMem:<6559>

 If not then you can reconfigure the RMC daemon on the failing LPAR with:

 First reconfig rsct:

 /usr/sbin/rsct/bin/rmcctrl -z # (Stop)

/usr/sbin/rsct/install/bin/uncfgct -n #(Unconfigure)

/usr/sbin/rsct/install/bin/recfgct #(Reconfigure)

output shows something like this:
2610-656 Could not obtain information about the cluster: cu_obtain_cluster_info:Invalid current cluster pointer. fopen(NODEDEF FILE=/var/ct/IW/cfg/nodedef.cfg) fail (errno=2)

0513-071 The ctcas Subsystem has been added.

0513-071 The ctrmc Subsystem has been added.

0513-059 The ctrmc Subsystem has been started. Subsystem PID is 3277166.

 

Allow remote connections again:

/usr/sbin/rsct/bin/rmcctrl -p #(allow remote connections)

 
Check again the status:

/usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc

Finally run a second attempt for LKU.  


References:
https://www.ibm.com/support/knowledgecenter/ssw_aix_72/install/live_update_NIM.html
gibsonnet.net/blog/cgaix/html/Chriss_AIX_Live_Update_Best_Practices.html
gibsonnet.net/blog/cgaix/html/AIX Live Update using NIM.html
https://www.ibm.com/support/pages/node/632547

I would like to thank the following people:
Nayden Stoyanov and Kenneth Anderson who helped me with solving cases and making LKU successful for us.
Chris Gibson who encouraged me and gave me invaluable help with reviewing this blog.


0 comments
294 views

Permalink