Cloud Pak for Data

 View Only

Tips for implementing your Cloud Pak for Data upgrade

By Hong Wei Jia posted Thu March 31, 2022 06:25 AM

  
Tips for implementing your Cloud Pak for Data upgrade
After  Pre-upgrade tasks for CPD upgrade done, then the next step is to implement your Cloud Pak for Data upgrade. In this article, I'd like to share some tips about the upgrade
implementation.

1.Pre-check before the upgrade
Capture the cluster state and make sure the cluster is in healthy status before the upgrade.
This step is critical to the success. We have to make sure the following conditions met before the upgrade.

1) Check the OpenShift cluster status
a)Make sure the cluster operator are in healthy status
Run the following command.
oc get co

All the cluster operators should be in AVAILABLE status. And not in PROGRESSING or DEGRADED status.
Example:

NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.52 TRUE FALSE FALSE 27m
cloud-credential 4.6.52 TRUE FALSE FALSE 2d2h
cluster-autoscaler 4.6.52 TRUE FALSE FALSE 2d2h
config-operator 4.6.52 TRUE FALSE FALSE 2d2h
console 4.6.52 TRUE FALSE FALSE 12h
csi-snapshot-controller 4.6.52 TRUE FALSE FALSE 57m
dns 4.6.52 TRUE FALSE FALSE 14h
etcd 4.6.52 TRUE FALSE FALSE 2d2h
image-registry 4.6.52 TRUE FALSE FALSE 20h
ingress 4.6.52 TRUE FALSE FALSE 2d2h
insights 4.6.52 TRUE FALSE FALSE 2d2h
kube-apiserver 4.6.52 TRUE FALSE FALSE 2d2h
kube-controller-manager 4.6.52 TRUE FALSE FALSE 2d2h
kube-scheduler 4.6.52 TRUE FALSE FALSE 2d2h
kube-storage-version-migrator 4.6.52 TRUE FALSE FALSE 172m
machine-api 4.6.52 TRUE FALSE FALSE 2d2h
machine-approver 4.6.52 TRUE FALSE FALSE 2d2h
machine-config 4.6.52 TRUE FALSE FALSE 148m
marketplace 4.6.52 TRUE FALSE FALSE 159m
monitoring 4.6.52 TRUE FALSE FALSE 154m
network 4.6.52 TRUE FALSE FALSE 2d2h
node-tuning 4.6.52 TRUE FALSE FALSE 2d2h
openshift-apiserver 4.6.52 TRUE FALSE FALSE 27m
openshift-controller-manager 4.6.52 TRUE FALSE FALSE 2d2h
openshift-samples 4.6.52 TRUE FALSE FALSE 2d2h
operator-lifecycle-manager 4.6.52 TRUE FALSE FALSE 2d2h
operator-lifecycle-manager-catalog 4.6.52 TRUE FALSE FALSE 2d2h
operator-lifecycle-manager-packageserver 4.6.52 TRUE FALSE FALSE 158m
service-ca 4.6.52 TRUE FALSE FALSE 2d2h
storage 4.6.52 TRUE FALSE FALSE 2d2h

b) Make sure all the node are in Ready status
Run the following command.
oc get nodes
All the nodes should be in Ready status.
Example:

NAME STATUS   ROLES  AGE  VERSION
master0.jhwocp4652.cp.xxx.com Ready master 2d2h v1.19.16+3d19195
master1.jhwocp4652.cp.xxx.com Ready master 2d2h v1.19.16+3d19195
master2.jhwocp4652.cp.xxx.com Ready master 2d2h v1.19.16+3d19195
worker0.jhwocp4652.cp.xxx.com Ready worker 2d2h v1.19.16+3d19195
worker1.jhwocp4652.cp.xxx.com Ready worker 2d2h v1.19.16+3d19195
worker2.jhwocp4652.cp.xxx.com Ready worker 2d2h v1.19.16+3d19195

c) Make sure all the machine configure pool are in healthy status.
Run the following command.
oc get mcp

Example:
NAME  CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-dd5251204366d7a3c25261ce8bc5c9fb True FALSE  False 3 3 3 0  2d3h
worker rendered-worker-b2b008ac8cae4a98839cbfe309007fea True FALSE  False 3 3 3 0  2d3h

2) Check the Cloud Pak for Data status
If you are to upgrade from Cloud Pak for Data 3.5 to 4.0, you can run the following command.
cpd-cli status -n your-cpd-project 
Make sure the Lite and all the services' status are in Ready status.

In addition, you can run the following command for checking whether all the pods are in healthy status.

oc get po --no-headers --all-namespaces -o wide| grep -Ev '([[:digit:]])/\1.*R' | grep -v 'Completed'

Sometime, some of the pods in unhealthy status doesn't mean the cluster in unhealthy status, such as some failed jobs. But we should identify why these pods are in unhealthy status. Once we confirmed that there's no impact to the upgrade, then you can ignore it.

3) Check Image registry
Since Cloud Pak for Data 4.0, a private image registry is recommended or required (air-gapped environment). And the private image registry is important as it hosts all the images that your Cloud Pak for Data 4.X services needed to be up and running. So I strongly recommend you check the image registry status and also have an overview of the images in it.

Run the following command for logging into your private image registry server.
podman login --username $PRIVATE_REGISTRY_USER --password $PRIVATE_REGISTRY_PASSWORD $PRIVATE_REGISTRY --tls-verify=false

If it could be logged in successfully, it means your private image registry is up and running.

Run the following command to have an overview of the images in it.
curl -k -u ${PRIVATE_REGISTRY_USER}:${PRIVATE_REGISTRY_PASSWORD} https://${PRIVATE_REGISTRY}/v2/_catalog?n=6000 | jq .

2. Implement the upgrade following the runbook prepared in pre-upgrade phrase
In the section 8) of this article Pre-upgrade tasks for CPD upgrade , a validated and well prepared upgrade runbook is recommended as one of the pre-upgrade task.
Following this runbook, you can implement the upgrade with less risk and efforts. But some tips here maybe helpful.
1)Temporarily disabling the route to the Cloud Pak for Data cluster
This can help to prevent the end-users from using this cluster during the upgrade.
Back up your Cloud Pak for Data route firstly with the following command. Note: change the your-cpd-route and your-cpd-project accordingly.
oc get route your-cpd-route -o yaml -n your-cpd-project > your-cpd-route-bakup.yaml
Then you can delete it.

2)For upgrade from 3.0.1 to 3.5, apply the latest patch for lite and upgrade your SPSS (if installed) to 3.0.2 are required. 

3)Uninstalled the services
deprecated in the target upgrade version
Pleas refer to the section 3) of the article Pre-upgrade tasks for CPD upgrade
4)Stop the environment runtimes and cron-jobs are recommended.

Actually, as mentioned in the section 5)Evaluate and decide the time window of the article Pre-upgrade tasks for CPD upgrade, the end-users are recommended to stop their own environment runtimes and scheduled jobs before the upgrade. But we'd better take following actions to make sure that environment runtimes and scheduled jobs are stopped or suspended.

List the active environment runtimes firstly

for mydeploy in $(oc get deploy -l created-by=spawner --no-headers| awk '{print $1}') ; do echo $mydeploy; done

Stop the active environment runtimes

for mydeploy in $(oc get deploy -l created-by=spawner --no-headers| awk '{print $1}') ; do oc delete deployment $mydeploy; done

Suspend cron jobs before the upgrade.
oc get cronjobs -n your-cpd-project | grep False| grep -v spark | cut -d' ' -f 1 | xargs oc patch cronjobs -p '{"spec" : {"suspend" : true }}'
 
But note that re-enable the cron jobs after the upgrade done.

5) OpenShift upgrade
For the OpenShift upgrade, I assume your OpenShift cluster version is OCP 4.X. If your OpenShift cluster version is still 3.11, migration rather than upgrade would be required as OpenShift doesn't support the in-place upgrade from 3.11 to 4.X. Regarding the migration, I'll introduce it in details in a separate article later.
a) OpenShift doesn't support hops upgrade for the major releases and sequential upgrade is required

For example, your current OpenShift version is 4.5.X and you plan to upgrade it to 4.8.  Your upgrade path would be 4.5.X -> 4.6.Y -> 4.7.Z -> 4.8.N.

b) When upgrading OpenShift,  subscribe to the EUS channel for getting the EUS support if it's available.
For example, the OCP 4.6 support has been End of Support.  But the Extended Update Support of OCP 4.6 (4.6-EUS) support is still available. If you still want to stay on OCP 4.6 with support, then you'll have to update the your OpenShift cluster's OCP channel to EUS-4.6.
For more information,  please refer to the OpenShift's lifecycle policy.
c)Validate the OpenShift cluster status
Make sure the cluster operator are in healthy status
oc get co
Make sure all the nodes are in Ready status
oc get nodes
Make sure all the machine configure pool are in healthy status.
oc get mcp

6) Storage upgrade
When upgrading from Cloud Pak for Data 3.5 to 4.0, if you are running an unsupported version of Red Hat® OpenShift® Container Storage or Portworx, you must upgrade your storage before you upgrade to IBM® Cloud Pak for Data Version 4.0. For information about supported versions of shared persistent storage, see Storage requirements.

7)Cloud Pak for Data platform and services upgrade
Different services may have different prerequisites and procedures for the upgrade from 3.5 to 4.0. For example, some services (e.g. Data virtualization) require that you create the Db2U operator subscription manually while some others not. And for some Watson services (e.g. Watson Discovery, Watson Assistant), you may have to enable the License Service of the IBM Cloud Pak Foundational service.

There maybe service instances provisioned for some particular services, such as Spark instance, Data Virtualization instance, Db2WH instance. For these kind of services, apart from upgrade the services themselves, you also need to upgrade the service instances accordingly.

Record the commands and the corresponding results during the upgrade. After each service upgrade done, make sure it is in healthy status before you proceeding to next upgrade of another service. 

8)Troubleshooting 
When your Cloud Pak for Data services' upgrade failed, be cautious about the rollback during the troubleshooting. The rollback during the upgrade from 3.5 to 4.0.X is not supported. Even for the upgrade from 3.0.1 to 3.5, the rollback for some services are not supported, e.g. WML.

Reach out to IBM Support with a support ticket for the help and assistance if needed.


Restore should be the last resort.

Summary
In this article, I introduced some tips about the upgrade implementation. Most of them are from the lesson learnt and experience we accumulated. Hope it's helpful! And in my next article, I'll introduce some tips about the post upgrade.



#CloudPakforDataGroup
0 comments
18 views

Permalink