Cloud Pak for Data Group

 View Only

Debug issues in machineconfig when installing cloud pak for data

By DA WEI ZHANG posted Thu March 24, 2022 04:02 AM

  

When installing or upgrading cloud pak for data, there may be issues in Openshift Machine Config, this blog introduce how to handle it.


Machine Config Operator

The Machine Config Operator manages and applies configuration and updates of the base operating system and container runtime, including everything between the kernel and kubelet.

There are four components:

  • machine-config-server: Provides Ignition configuration to new machines joining the cluster.

  • machine-config-controller: Coordinates the upgrade of machines to the desired configurations defined by a MachineConfig object. Options are provided to control the upgrade for sets of machines individually.

  • machine-config-daemon: Applies new machine configuration during update. Validates and verifies the state of the machine to the requested machine configuration.

  • machine-config: Provides a complete source of machine configuration at installation, first start up, and updates for a machine.

Machine config

The Machine Config Operator (MCO) manages updates to systemd, CRI-O and Kubelet, the kernel, Network Manager and other system features. It also offers a MachineConfig CRD that can write configuration files onto the host (see machine-config-operator). Understanding what MCO does and how it interacts with other components is critical to making advanced, system-level changes to an OpenShift Container Platform cluster. 

The controller generates Machine Configs for pre-defined roles (master and worker) and monitors whether an existing Machine Config CR (custom resource) is modified or new Machine Config CRs are created. When the controller detects any of those events, it will generate a new rendered Machine Config object that contains all of the Machine Configs based on MachineConfigSelector from each MachineConfigPool.


MCP Issues

When installing cloud pak for data, we can use this command to watch mcp status.

watch -n1 'oc get mcp -o wide; echo; oc get node -o "custom-columns=NAME:metadata.name,STATE:metadata.annotations.machineconfiguration\\.openshift\\.io/state,DESIRED:metadata.annotations.machineconfiguration\\.openshift\\.io/desiredConfig,CURRENT:metadata.annotations.machineconfiguration\\.openshift\\.io/currentConfig,REASON:metadata.annotations.machineconfiguration\\.openshift\\.io/reason"'


The UPDATED status and UPDATING should be True and False, and DESIRED should be same with STATE after machineconfig finished:

NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-e1843a5f905763d8e9f2bd70fb8523b3 True False False 3 3 3 0 95d
worker rendered-worker-57ebc7c3ed99525130a09c63553cab00 True False False 3 3 3 0 95d

NAME STATE DESIRED CURRENT REASON
master0.dbai-wml-osp4813.cp.fyre.ibm.com Done rendered-master-e1843a5f905763d8e9f2bd70fb8523b3 rendered-master-e1843a5f905763d8e9f2bd70fb8523b3
master1.dbai-wml-osp4813.cp.fyre.ibm.com Done rendered-master-e1843a5f905763d8e9f2bd70fb8523b3 rendered-master-e1843a5f905763d8e9f2bd70fb8523b3
master2.dbai-wml-osp4813.cp.fyre.ibm.com Done rendered-master-e1843a5f905763d8e9f2bd70fb8523b3 rendered-master-e1843a5f905763d8e9f2bd70fb8523b3
worker0.dbai-wml-osp4813.cp.fyre.ibm.com Done rendered-worker-57ebc7c3ed99525130a09c63553cab00 rendered-worker-57ebc7c3ed99525130a09c63553cab00
worker1.dbai-wml-osp4813.cp.fyre.ibm.com Done rendered-worker-57ebc7c3ed99525130a09c63553cab00 rendered-worker-57ebc7c3ed99525130a09c63553cab00
worker2.dbai-wml-osp4813.cp.fyre.ibm.com Done rendered-worker-57ebc7c3ed99525130a09c63553cab00 rendered-worker-57ebc7c3ed99525130a09c63553cab00


When I add private registry to install cloud pak for data, I incorrectlly to add an empty regsitry, then MCP stucks on master0 and worker0.

#$PRIVATE_REGISTRY not defined
oc patch image.config.openshift.io/cluster --type=merge -p '{"spec":{"registrySources":{"insecureRegistries":["$PRIVATE_REGISTRY"]}}}'

Every 1.0s: oc get mcp -o wide; echo; oc get node -o "custom-columns=NAME:metadata.name,STATE:metadata.annotations.machineconfiguration\\.openshift\\.io/state,DESIRED:metadata.annotations.machineconfig... api.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com: Wed Mar 23 19:13:08 2022
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-00be86b020ab6063df5dbfebbcc407e9 False True False 3 0 0 0 5d23h
worker rendered-worker-7ae75ff9ac8c1b452e3f60e1d478ac8d False True False 3 0 0 0 5d23h
NAME STATE DESIRED CURRENT REASON
master0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Working rendered-master-51e0e3942c2ef29014445e3ca9eecbd3 rendered-master-00be86b020ab6063df5dbfebbcc407e9
master1.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Done rendered-master-00be86b020ab6063df5dbfebbcc407e9 rendered-master-00be86b020ab6063df5dbfebbcc407e9
master2.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Done rendered-master-00be86b020ab6063df5dbfebbcc407e9 rendered-master-00be86b020ab6063df5dbfebbcc407e9
worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Working rendered-worker-d0a347d6c8a9c3267fec666c1afecd5f rendered-worker-7ae75ff9ac8c1b452e3f60e1d478ac8d
worker1.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Done rendered-worker-7ae75ff9ac8c1b452e3f60e1d478ac8d rendered-worker-7ae75ff9ac8c1b452e3f60e1d478ac8d
worker2.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Done rendered-worker-7ae75ff9ac8c1b452e3f60e1d478ac8d rendered-worker-7ae75ff9ac8c1b452e3f60e1d478ac8d


I find the MachineConfig have incorrect contents of /etc/containers/registries.conf

oc describe mc rendered-worker-d0a347d6c8a9c3267fec666c1afecd5f

Source: data:text/plain,unqualified-search-registries%20%3D%20%5B%22registry.access.redhat.com%22%2C%20%22docker.io%22%5D%0A%0A%5B%5Bregistry%5D%5D%0A%20%20prefix%20%3D%20%22%22%0A%20%20insecure%20%3D%20true%0A
Mode: 420
Overwrite: true
Path: /etc/containers/registries.conf

Decode content and there is incorrect entry:

unqualified-search-registries = ["registry.access.redhat.com", "docker.io"]

[[registry]]
prefix = ""
insecure = true


When MCP update stuck due to other problems , it may be resolved after rebooting the nodes. But this time reboot won't work because crio failed to start. I find the errors in systemd.


ssh core@$node_name

$ sudo su -
Last login: Thu Mar 24 02:15:00 UTC 2022 on pts/0
[systemd]
Failed Units: 1
crio.service
# systemctl status crio
Warning: The unit file, source configuration file or drop-ins of crio.service changed on disk. Run 'systemctl daemon-reload' to reload units.
● crio.service - Open Container Initiative Daemon
Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/crio.service.d
└─10-mco-default-env.conf, 10-mco-default-madv.conf, 20-nodenet.conf
Active: failed (Result: exit-code) since Thu 2022-03-24 02:30:26 UTC; 14s ago
Docs: https://github.com/cri-o/cri-o
Process: 7138 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 7138 (code=exited, status=1/FAILURE)
CPU: 146ms

Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com crio[7138]: time="2022-03-24 02:30:26.956936601Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com crio[7138]: time="2022-03-24 02:30:26.957067683Z" level=info msg="Node configuration value for pid cgroup is true"
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com crio[7138]: time="2022-03-24 02:30:26.957082533Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com crio[7138]: time="2022-03-24 02:30:26.979861390Z" level=info msg="Node configuration value for systemd CollectMode is true"
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com crio[7138]: time="2022-03-24 02:30:26.989914800Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_>
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com crio[7138]: time="2022-03-24 02:30:26.990113183Z" level=fatal msg="Validating runtime config: invalid registries: error loading registries configuration \"/etc/containers/registries.conf\": invalid location: ca>
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com systemd[1]: crio.service: Failed with result 'exit-code'.
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com systemd[1]: Failed to start Open Container Initiative Daemon.
Mar 24 02:30:26 worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com systemd[1]: crio.service: Consumed 146ms CPU time


$  cat /etc/containers/registries.conf
unqualified-search-registries = ["registry.access.redhat.com", "docker.io"]

[[registry]]
prefix = ""
insecure = true

Solution


Firstly fix operation by oc command:

oc patch image.config.openshift.io/cluster --type=merge -p '{"spec":{"registrySources":{"insecureRegistries":["$REGISTRY"]}}}'


But I still find MCP blocking on some nodes. Why? It's because previous render not finished.  Since crio start failed , kubelet won't start too so MCP update won't finish. 

# oc get nodes

NAME STATUS ROLES AGE VERSION
master0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com NotReady,SchedulingDisabled master 5d23h v1.19.16+3d19195
master1.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Ready master 5d23h v1.19.16+3d19195
master2.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Ready master 5d23h v1.19.16+3d19195
worker0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com NotReady,SchedulingDisabled worker 5d23h v1.19.16+3d19195
worker1.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Ready worker 5d23h v1.19.16+3d19195
worker2.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Ready worker 5d23h v1.19.16+3d19195


Usually we can not manually update any file related to machine config, any mismatch will lead to MCP update failure. But this time I have to firstly fix crio:


ssh core@$node_name
Last login: Thu Mar 24 02:44:01 2022 from 10.17.37.52
sudo[systemd]
Failed Units: 1
crio.service

# Fix /etc/containers/registries.conf
# systemctl restart crio

# systemctl restart kubelet

In watch command there is mismatch error:

NAME STATE DESIRED CURRENT REASON
master0.dw-cpd-upgrade-ocp46.cp.fyre.ibm.com Degraded rendered-master-51e0e3942c2ef29014445e3ca9eecbd3 rendered-master-8d53441fc59a6f0e2e8118433b2eeab4 unexpected on-disk state validating against rendered-master-51e0e3942c2ef29014445e3ca9eecbd3: content mismatch f
or file "/etc/containers/registries.conf"

Then need to set an override flag on each blocking nodes:

ssh core@$node_name sudo touch /run/machine-config-daemon-force

Get correct render name and patch nodes:

oc patch node $node_name --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/currentConfig": "$CORRECT_RENDER"}}}'
oc patch node $node_name --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/desiredConfig": "$CORRECT_RENDER"}}}'
oc patch node $node_name --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/reason": ""}}}'
oc patch node $node_name --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/state": "Done"}}}'

Uncordon and reboot nodes:

kubectl uncordon $node_name
# reboot nodes

ssh core@$node_name sudo shutdown -r 1

In my env I encounterd "failed to drain nodes for 1 hour" error.  I have to run uncordon again, then MCP render continue to proceed on nodes correctly.

References



OpenShift Container Platform 4: How does Machine Config Pool work?
Machine configuration tasks
How to skip validation of failing / stuck MachineConfig in OCP 4?
Node in degraded state because of the use of a deleted machineconfig


​​​​​​
0 comments
19 views

Permalink