The flow diagram simplifies the overall flow of the OLM components, showing how each component (CatalogSource, OperatorGroup, Subscription, InstallPlan, CSV) interacts sequentially to install and manage operators in a Kubernetes cluster. Each component plays a crucial role in the declarative and automated lifecycle management of operators using OLM.
Step-by-Step Explanation:
- CatalogSource:
- Defines where the operator packages (bundles) are stored, such as in a container registry or an HTTP server.
- It is the starting point where OLM looks for available operator bundles and metadata.
- OperatorGroup:
- Defines the namespaces where the operator can be installed and provides scoping for RBAC resources.
- If multiple namespaces are required, the OperatorGroup ensures that operators can access the correct namespaces.
- It also helps manage shared RBAC permissions across multiple namespaces for the operator.
- Subscription:
- Created by the user to subscribe to a specific operator from a CatalogSource.
- Specifies the operator’s name, desired version, and target namespace for installation.
- It triggers the creation of an InstallPlan.
- InstallPlan:
- Created automatically by OLM when a Subscription is made.
- Defines the resources (CRDs, roles, deployments, etc.) that need to be installed to support the operator.
- It may require manual approval before the resources are applied to the cluster.
- ClusterServiceVersion (CSV):
- Defines the specific version of the operator, including its metadata, CRDs, roles, deployment strategies, and installation details.
- Ensures that the correct version of the operator is installed and guides OLM in handling operator lifecycle events like upgrades.
- Operator Installed:
- After the InstallPlan is approved and resources are applied, the operator is fully installed and begins running in the cluster.
- The operator manages resources as described in its CSV and follows its lifecycle from installation to upgrade.
“And that’s it! Now you know what OLM is and how it works.”
Common issues and Troubleshooting Steps
Now finally lets look into some practical everyday issues that you as a DevOps or SRE person may face when dealing with OLM and how can you fix them.
- First and foremost, if there seems to be any issue with OLM then check the logs of olm-operator, catalog-operator and package-server in the olm or openshift-operator-lifecycle-manager namespace.
oc get po -n openshift-operator-lifecycle-manager
NAME READY STATUS RESTARTS AGE
catalog-operator-58494bdc8b-qv6nb 1/1 Running 0 26d
olm-operator-7ddb9745d4-fmgd2 1/1 Running 0 26d
package-server-manager-d754fbd58-xl6c9 2/2 Running 0 34d
packageserver-fdb96cd46-kv264 1/1 Running 0 34d
packageserver-fdb96cd46-w7lbc 1/1 Running 0 34d
- Ideally if everything is normal and catalog operator pod is in running state, the CSV gets created/updated in within a minute after the subscription is applied but at times the olm-operator may choke down due to various reasons or even intermittent issues. During such scenarios rolling restart or simple pod deletion and recreation of the OLM operators is a good option.
oc rollout restart deploy -n openshift-operator-lifecycle-manager
deployment.apps/catalog-operator restarted
deployment.apps/olm-operator restarted
deployment.apps/package-server-manager restarted
deployment.apps/packageserver restarted
- There can be a scenario where there are multiple operators on the cluster and as a result there will be multiple catalog sources. Now even if one of those multiple catalog sources is not in a healthy state then this will affect the lifecycle management and operations of all the operators present on the cluster. In such cases we need to fix the troublesome catalog source and it will fix everything.
oc get catsrc -n openshift-marketplace
NAME DISPLAY TYPE PUBLISHER AGE
certified-operators Certified Operators grpc Red Hat 7d1h
community-operators Community Operators grpc Red Hat 7d1h
redhat-marketplace Red Hat Marketplace grpc Red Hat 7d1h
redhat-operators Red Hat Operators grpc Red Hat 7d1h
oc get po -n openshift-marketplace
NAME READY STATUS RESTARTS AGE
certified-operators-gnpzf 1/1 Running 0 36h
community-operators-xbbmb 1/1 Running 0 103m
marketplace-operator-d8bfdb9df-w6m56 1/1 Running 1 (7d1h ago) 7d1h
redhat-marketplace-j4jjx 1/1 Running 0 3d9h
redhat-operators-lgmll 1/1 Running 0 43m
- Bundle-unpacking job not coming up, this generally happens when there is a resource crunch issue on the node in which the unpacking job is scheduled or for some reason the node is not in ready state. This is not a OLM specific issue but a generic issue where you need to make sure that all the nodes on the cluster are in healthy and Ready state
oc get nodes
NAME STATUS ROLES AGE VERSION
master0.sc-kho.cp.fyre.ibm.com Ready control-plane,master 7d1h v1.32.7
master1.sc-kho.cp.fyre.ibm.com Ready control-plane,master 7d1h v1.32.7
master2.sc-kho.cp.fyre.ibm.com Ready control-plane,master 7d1h v1.32.7
worker0.sc-kho.cp.fyre.ibm.com Ready worker 7d1h v1.32.7
worker1.sc-kho.cp.fyre.ibm.com Ready worker 7d1h v1.32.7
worker2.sc-kho.cp.fyre.ibm.com Ready worker 7d1h v1.32.7
worker3.sc-kho.cp.fyre.ibm.com Ready worker 7d1h v1.32.7
worker4.sc-kho.cp.fyre.ibm.com Ready worker 7d1h v1.32.7
- I have experienced a very common problem of orphan CSV’s as well where during cleanup of operator from the cluster somehow the CSV is missed and when re-installing, the already existing CSV doesn’t identify the newly created subscription and hence the OLM chokes. In such situations deleting the orphan CSV is what I would recommend.
As a precautionary measure I would suggest to reset OLM/delete all the OLM components when cleaning up the operator from the cluster in the following sequence : Subscription, CSV, InstallPlan, OperatorGroup, Catalog Source.
oc delete subscription <SUBSCRIPTION_NAME> -n <NAMESPACE>
oc delete csv <CSV_NAME> -n <NAMESPACE>
oc delete ip --all -n <NAMESPACE>
oc delete catsrc <CATSRC_NAME> -n openshift-marketplace
NOTE:
- All the Install Plans can be deleted, it won't cause any harm to other operators in the namespace.
- If there exist multiple operators in the namespace and there exist only one operator group, then skipping the operator group deletion is fine.
All the above mentioned scenarios are the one’s I face frequently when dealing with OLM and the measures that I took to fix the problem.
“And now, finally, you are ready to start working with Kubernetes Operators, their lifecycle, and the challenges you might face when dealing with them. I hope this helped you learn something new today. Happy learning!”
References