IBM Guardium

IBM Guardium

Join this online user group to communicate across Security product users and IBM experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

Operator Lifecycle Manager in Kubernetes Operators

By Siddharth Choudhary posted yesterday

  

In this blog I will try to explain Operator Lifecycle Manager in Kubernetes Operators and how exactly it works in a fairly simple manner.

Prerequisites

This blog assumes that you are well aware about the following topics:

  • Docker
  • Kubernetes and/or Redhat Openshift
  • Kubernetes Operators

What exactly is OLM and why is it needed?

The definition that I could fetch from internet for OLM goes like this:
The Operator Lifecycle Manager (OLM) is a Kubernetes-native tool designed to help you manage the entire lifecycle of Operators and their dependencies in a Kubernetes cluster.
The above definition is fairly simple to understand and I would just extend it by explaining what is the lifecycle of Operators refers to in this context.

A Kubernetes operator’s lifecycle consists of its installationupgrade and its management

Okay! And how does it do that?

Now this is where things start to get interesting!
The OLM which is created to manage operators in Kubernetes also uses operators underneath to do what it does. Yes, that is right, you read it right that it itself uses operators to manage operators.

Technical Details please!

Lets try to understand the technicalities underneath.
Now as mentioned earlier OLM uses a bunch of operators named the olm-operator, catalog-operator and package-server which work collectively to manage the lifecycle of kubernetes operators. These operators reside in the olm namespace (They will reside in openshift-operator-lifecycle-manager namespace in Openshift clusters which have OLM installed by default).

There are a bunch of components with associated functionality which we should know about to know the working of OLM. Lets look into a flow diagram followed by each component and its functionality to understand the working process.

The flow diagram simplifies the overall flow of the OLM components, showing how each component (CatalogSource, OperatorGroup, Subscription, InstallPlan, CSV) interacts sequentially to install and manage operators in a Kubernetes cluster. Each component plays a crucial role in the declarative and automated lifecycle management of operators using OLM.

Step-by-Step Explanation:

  1. CatalogSource:
    - Defines where the operator packages (bundles) are stored, such as in a container registry or an HTTP server.
    - It is the starting point where OLM looks for available operator bundles and metadata.
  2. OperatorGroup:
    - Defines the namespaces where the operator can be installed and provides scoping for RBAC resources.
    - If multiple namespaces are required, the OperatorGroup ensures that operators can access the correct namespaces.
    - It also helps manage shared RBAC permissions across multiple namespaces for the operator.
  3. Subscription:
    Created by the user to subscribe to a specific operator from a CatalogSource.

    - Specifies the operator’s name, desired version, and target namespace for installation.
    - It triggers the creation of an InstallPlan.
  4. InstallPlan:
    - Created automatically by OLM when a Subscription is made.
    - Defines the resources (CRDs, roles, deployments, etc.) that need to be installed to support the operator.
    - It may require manual approval before the resources are applied to the cluster.
  5. ClusterServiceVersion (CSV):
    - Defines the specific version of the operator, including its metadata, CRDs, roles, deployment strategies, and installation details.
    - Ensures that the correct version of the operator is installed and guides OLM in handling operator lifecycle events like upgrades.
  6. Operator Installed:
    - After the InstallPlan is approved and resources are applied, the operator is fully installed and begins running in the cluster.
    - The operator manages resources as described in its CSV and follows its lifecycle from installation to upgrade.

“And that’s it! Now you know what OLM is and how it works.”

Common issues and Troubleshooting Steps

Now finally lets look into some practical everyday issues that you as a DevOps or SRE person may face when dealing with OLM and how can you fix them.

  • First and foremost, if there seems to be any issue with OLM then check the logs of olm-operator, catalog-operator and package-server in the olm or openshift-operator-lifecycle-manager namespace.
oc get po -n openshift-operator-lifecycle-manager
NAME                                     READY   STATUS      RESTARTS   AGE
catalog-operator-58494bdc8b-qv6nb        1/1     Running     0          26d
olm-operator-7ddb9745d4-fmgd2            1/1     Running     0          26d
package-server-manager-d754fbd58-xl6c9   2/2     Running     0          34d
packageserver-fdb96cd46-kv264            1/1     Running     0          34d
packageserver-fdb96cd46-w7lbc            1/1     Running     0          34d
  • Ideally if everything is normal and catalog operator pod is in running state, the CSV gets created/updated in within a minute after the subscription is applied but at times the olm-operator may choke down due to various reasons or even intermittent issues. During such scenarios rolling restart or simple pod deletion and recreation of the OLM operators is a good option.
oc rollout restart deploy -n openshift-operator-lifecycle-manager
deployment.apps/catalog-operator restarted
deployment.apps/olm-operator restarted
deployment.apps/package-server-manager restarted
deployment.apps/packageserver restarted
  • There can be a scenario where there are multiple operators on the cluster and as a result there will be multiple catalog sources. Now even if one of those multiple catalog sources is not in a healthy state then this will affect the lifecycle management and operations of all the operators present on the cluster. In such cases we need to fix the troublesome catalog source and it will fix everything.
oc get catsrc -n openshift-marketplace                                                           
NAME                  DISPLAY               TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     7d1h
community-operators   Community Operators   grpc   Red Hat     7d1h
redhat-marketplace    Red Hat Marketplace   grpc   Red Hat     7d1h
redhat-operators      Red Hat Operators     grpc   Red Hat     7d1h

oc get po -n openshift-marketplace
NAME                                   READY   STATUS    RESTARTS       AGE
certified-operators-gnpzf              1/1     Running   0              36h
community-operators-xbbmb              1/1     Running   0              103m
marketplace-operator-d8bfdb9df-w6m56   1/1     Running   1 (7d1h ago)   7d1h
redhat-marketplace-j4jjx               1/1     Running   0              3d9h
redhat-operators-lgmll                 1/1     Running   0              43m
  • Bundle-unpacking job not coming up, this generally happens when there is a resource crunch issue on the node in which the unpacking job is scheduled or for some reason the node is not in ready state. This is not a OLM specific issue but a generic issue where you need to make sure that all the nodes on the cluster are in healthy and Ready state
oc get nodes
NAME                             STATUS   ROLES                  AGE    VERSION
master0.sc-kho.cp.fyre.ibm.com   Ready    control-plane,master   7d1h   v1.32.7
master1.sc-kho.cp.fyre.ibm.com   Ready    control-plane,master   7d1h   v1.32.7
master2.sc-kho.cp.fyre.ibm.com   Ready    control-plane,master   7d1h   v1.32.7
worker0.sc-kho.cp.fyre.ibm.com   Ready    worker                 7d1h   v1.32.7
worker1.sc-kho.cp.fyre.ibm.com   Ready    worker                 7d1h   v1.32.7
worker2.sc-kho.cp.fyre.ibm.com   Ready    worker                 7d1h   v1.32.7
worker3.sc-kho.cp.fyre.ibm.com   Ready    worker                 7d1h   v1.32.7
worker4.sc-kho.cp.fyre.ibm.com   Ready    worker                 7d1h   v1.32.7
  • I have experienced a very common problem of orphan CSV’s as well where during cleanup of operator from the cluster somehow the CSV is missed and when re-installing, the already existing CSV doesn’t identify the newly created subscription and hence the OLM chokes. In such situations deleting the orphan CSV is what I would recommend.
    As a precautionary measure I would suggest to reset OLM/delete all the OLM components when cleaning up the operator from the cluster in the following sequence : Subscription, CSV, InstallPlan, OperatorGroup, Catalog Source.
oc delete subscription <SUBSCRIPTION_NAME> -n <NAMESPACE>
oc delete csv <CSV_NAME> -n <NAMESPACE>
oc delete ip --all -n <NAMESPACE>
oc delete catsrc <CATSRC_NAME> -n openshift-marketplace

NOTE: 

  1. All the Install Plans can be deleted, it won't cause any harm to other operators in the namespace.
  2. If there exist multiple operators in the namespace and there exist only one operator group, then skipping the operator group deletion is fine.

All the above mentioned scenarios are the one’s I face frequently when dealing with OLM and the measures that I took to fix the problem.

“And now, finally, you are ready to start working with Kubernetes Operators, their lifecycle, and the challenges you might face when dealing with them. I hope this helped you learn something new today. Happy learning!”

References

0 comments
5 views

Permalink