In general, this blog is to explain troubleshooting procedure when you find that IP is not pingable or booting up.
Starting PowerVC 1.4.4 release, support for SLES 11 is being withdrawn. This topic can be referred for RHEL 7.5, RHEL 7.6, RHEL 7.7, RHEL 8, RHEL 8.1, SLES 12, SLES 15, and Ubuntu 16.04 .
Cloud-init input and output files
Cloud-init paths
The paths in this document are the normal paths cloud-init uses on Linux. AIX uses slightly different paths. The Linux to AIX path mapping is as follows:
Linux | AIX |
/var/lib/cloud | /opt/freeware/var/lib/cloud |
/etc/cloud | /opt/freeware/etc/cloud |
/usr/lib/python2.7/site-packages/cloudinit SLES 15: /usr/lib/python3/site-packages/cloudinit | /opt/freeware/lib/python2.7/site-packages/cloudinit |
/usr/bin/cloud-init | /opt/freeware/bin/cloud-init |
Cloud-init config drive input
On a virtual machine, you can mount the CD drive and look at the files that provide input to cloud-init:
mkdir /cdrom
mount /dev/sr0 /cdrom
# Note the mount command on AIX is: mount -rv cdrfs /dev/cd0 /cdrom
cat /cdrom/openstack/content/0000 <--- this has the network configuration
cat /cdrom/openstack/latest/meta_data.json <--- this has server hostname and other props
cat /cdrom/openstack/latest/user_data <-- this has scripts, config passed on userdata
Cloud-init output files
Cloud-init logs its output to different files depending on the operating system release and cloud-init release that is used. You can find varying amounts of cloud-init output in the following files:
/var/log/cloud-init.log
/var/log/cloud-init-output.log
/var/log/messages
/var/lib/cloud/instance/boot-finished
/var/lib/cloud/data/result.json
/var/lib/cloud/data/status.json
On some distros, it may be necessary to edit /etc/cloud/cloud.cfg.d/05_logging.cfg and change the following WARNING to DEBUG in order to get log more data written to the files:
[handler_consoleHandler]
class=StreamHandler
level=WARNING |
Does the virtual machine boot into its operating system?
When a deployed virtual machine does not ping, you must first determine whether it is able to boot into its operating system. The easiest way to determine this is to open a partition console, also known as a console terminal, and try to start the virtual machine.
The console terminal will show the messages and the status of the boot. If the messages eventually show a login prompt, then the virtual machine is booting into its operating system.
There are several ways to open a console terminal. For more information, refer to the following topics:
For HMC managed systems: Remote console terminal - HMCconsole, mkvterm and vtmenu
For Novalink managed systems use the "mkvterm" command. See the command line interface documentation for more information.
For KVM managed systems you can use the “virsh console” command from the KVM host.
If the partition boots into its operating system continue to "1. The partition boots".
If the partition does not boot into its operating system continue to "2. The partition does not boot."
1. The virtual machine (partition) boots
Determine whether the IP configuration specified during the deploy is configured.
Run the following commands to determine whether the network configuration has been applied:
ifconfig -a and netstat -nr
The ifconfig command will show you the network interfaces and their IP addresses. Check if the IP address is the one specified during the deploy or if the interface is missing an IP address.
The netstat -nr command will show the routing table. Check the routing table to see if the default IPv4 gateway that was specified on the deploy is set. Note that the default gateway will be listed with either a destination of "default" or 0.0.0.0.
Here are two examples:
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.32.42.1 0.0.0.0 UG 0 0 0 eth0
Routing tables
Destination Gateway Flags Refs Use If Exp Groups
Route Tree for Protocol Family 2 (Internet):
default 10.32.42.1 UG 4 1016792 en2 - -
If the IP address and gateway are both set, continue to "1.1 IP address and gateway are both set."
If either the IP address or gateway is not set, continue to "1.2 IP address and gateway are not set."
1.1 IP address and gateway are both set
Run the ping command on the gateway, like this: ping 10.32.42.1
If the ping is successful, there is likely a problem with the network configuration outside of your virtual machine and operating system that is not allowing the virtual machine to be pinged.
If the ping is unsuccessful, there might be a configuration problem with the network or the Shared Ethernet Adapter that is preventing the operating system from reaching its gateway.
1.2 IP address and gateway are not set
Ensure that modifications to /etc/cloud/cloud.cfg did not leave an unparseable file.
When the IP address and gateway are not set as expected, first check to see if any modifications to the /etc/cloud/cloud.cfg file resulted in an unparseable file. The cloud.cfg file is in YAML format and this format is dependent on whitespace being correct. If /etc/cloud/cloud.cfg is unparseable, messages similar to the following will appear in one or more of the cloud-init logs on a newly-deployed virtual machine running cloud-init:
Jun 19 03:21:43 localhost cloud-init-local: 2016-06-19 03:21:43,425 - util.py[WARNING]: Failed loading yaml blob
Jun 19 03:21:43 localhost journal: [CLOUDINIT] util.py[WARNING]: Failed loading yaml blob
To invoke the YAML parser to check the syntax of a /etc/cloud/cloud.cfg file, you can run python and the following python subcommands on the virtual machine where the /etc/cloud/cloud.cfg file resides:
[root@svcimg-capture-1 ~]# python
Python 2.7.5 (default, Feb 20 2018, 09:25:48)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import yaml
>>> cloudcfg = open('/etc/cloud/cloud.cfg', 'rb').read()
>>> yaml.load(cloudcfg, Loader=yaml.SafeLoader)
{'ssh_pwauth': 0, 'users': ['default'], 'disable_root': False, 'resize_rootfs_tmp': '/dev', 'mount_default_fields': [None, None, 'auto', 'defaults,nofail', '0', '2'], 'system_info': {'paths': {'cloud_dir': '/var/lib/cloud/', 'templates_dir': '/etc/cloud/templates/'}, 'default_user': {'shell': '/bin/bash', 'gecos': 'fedora Cloud User', 'name': 'fedora', 'groups': ['wheel', 'adm', 'systemd-journal'], 'sudo': ['ALL=(ALL) NOPASSWD:ALL'], 'lock_passwd': True}, 'distro': 'fedora', 'ssh_svcname': 'sshd'}, 'cloud_init_modules': ['migrator', 'seed_random', 'bootcmd', 'write-files', 'growpart', 'resizefs', 'disk_setup', 'mounts', 'set_hostname_from_dns', 'update_etc_hosts', 'ca-certs', 'rsyslog', 'users-groups', 'ssh'], 'preserve_hostname': False, 'cloud_final_modules': ['package-update-upgrade-install', 'puppet', 'chef', 'mcollective', 'salt-minion', 'rightscale_userdata', 'scripts-vendor', 'scripts-per-once', 'scripts-per-boot', 'scripts-per-instance', 'scripts-user', 'ssh-authkey-fingerprints', 'keys-to-console', 'phone-home', 'final-message', 'power-state-change', 'reset_rmc'], 'datasource_list': ['ConfigDrive', 'None'], 'cloud_config_modules': ['ssh-import-id', 'locale', 'set-passwords', 'spacewalk', 'yum-add-repo', 'ntp', 'timezone', 'runcmd'], 'datasource': {'ConfigDrive': {'dsmode': 'local'}}}
>>> exit()
The output of the 'yaml.load(cloudcfg, Loader=yaml.SafeLoader)' subcommand above shows the result of a successful YAML load. If the yaml.load subcommand finds a syntax error in the cloud.cfg file, it will generate an error message such as this:
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
in "<string>", line 50, column 2:
- phone-home
^
1.2.2 Check the input files to ensure they contain the expected data
Refer to the input files section of this document. Check the content/0000 file to see if it contains the network configuration you expect.
1.2.3 Check if and when cloud-init ran.
Run the date command and verify that the time on the operating system is accurate. Compare the timestamp on the /var/lib/cloud/instance/boot-finished file with the date in the operating system and the time the virtual machine was deployed. Note that the boot-finished file contains the timestamp it was written in UTC. This may help in determining deploy and cloud-init times without needing to take into account a different UTC offset between the VM and PowerVC.
If the boot-finished file is not present then cloud-init did not run or it may still be running. You can check to see if cloud-init is still running by looking at the cloud-init log entries in the /var/log/cloud-init* logs or /var/log/messages. You can also check the status.json file to see the status of the various cloud-init stages.
If cloud-init did not run and is not currently running, one possible reason is that the operating system does not have device drivers installed for CD/DVD devices. This may be possible if you installed AIX using an lpp_source from a NIM server into a VM that did not have a CD/DVD device.
1.2.4 Activation takes a long time to run
If activation takes too long to run, the virtual optical can be removed before it completes. This will result in an incomplete activation.
Issues that can cause activation and the system in general to run slowly include:
1. Specifying DNS IP addresses in the /etc/resolv.conf of your image or PowerVC network which are not pingable in the environment you are deploying into. This will result in a generally slow system. To address this situation it's recommended that you remove /etc/resolv.conf before capturing images that will be deployed into environments where this is a problem. You can also set 'hosts=local,bind4' in /etc/netsvc.conf.
2. Specifying a gateway that is not pingable in the PowerVC network which you are deploying into.
3. Running AIX in a partition that has less than the operating system minimum recommended memory.
2. The partition (virtual machine) does not boot
There can be several reasons a virtual machine does not boot into its operating system. For details, read the following information.
2.1. Operating system level is not supported on the system
Some operating system levels are not supported on certain levels of server hardware. For example, AIX 6100-01 will not run on Power 7 servers.
2.2 The deployed virtual machine does not have a bootable disk
It is possible to capture virtual machines that do not have bootable disk images by misidentifying the boot disks when capturing a newly managed virtual machine for the first time, or when editing a virtual machine. If an image is captured without a bootable disk, the VM will not boot on deploy.
Conclusion:
Now you know about the general errors that might happen while configuring cloud-init and how to troubleshoot them.
If you have any questions about this topic, please comment below. Watch this space for more information about troubleshooting your environment. In the meantime, don't forget to follow us on LinkedIn, Facebook, and Twitter.
Authors:
Aman Kumar Sinha
SamMatzek