Tips for implementing your Cloud Pak for Data upgrade
After Pre-upgrade tasks for CPD upgrade done, then the next step is to implement your Cloud Pak for Data upgrade. In this article, I'd like to share some tips about the upgrade implementation.
1.Pre-check before the upgrade
Capture the cluster state and make sure the cluster is in healthy status before the upgrade.
This step is critical to the success. We have to make sure the following conditions met before the upgrade.
1) Check the OpenShift cluster status
a)Make sure the cluster operator are in healthy status
Run the following command.
oc get co
All the cluster operators should be in AVAILABLE status. And not in PROGRESSING or DEGRADED status.
Example:
NAME |
VERSION |
AVAILABLE |
PROGRESSING |
DEGRADED |
SINCE |
authentication |
4.6.52 |
TRUE |
FALSE |
FALSE |
27m |
cloud-credential |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
cluster-autoscaler |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
config-operator |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
console |
4.6.52 |
TRUE |
FALSE |
FALSE |
12h |
csi-snapshot-controller |
4.6.52 |
TRUE |
FALSE |
FALSE |
57m |
dns |
4.6.52 |
TRUE |
FALSE |
FALSE |
14h |
etcd |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
image-registry |
4.6.52 |
TRUE |
FALSE |
FALSE |
20h |
ingress |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
insights |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
kube-apiserver |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
kube-controller-manager |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
kube-scheduler |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
kube-storage-version-migrator |
4.6.52 |
TRUE |
FALSE |
FALSE |
172m |
machine-api |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
machine-approver |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
machine-config |
4.6.52 |
TRUE |
FALSE |
FALSE |
148m |
marketplace |
4.6.52 |
TRUE |
FALSE |
FALSE |
159m |
monitoring |
4.6.52 |
TRUE |
FALSE |
FALSE |
154m |
network |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
node-tuning |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
openshift-apiserver |
4.6.52 |
TRUE |
FALSE |
FALSE |
27m |
openshift-controller-manager |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
openshift-samples |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
operator-lifecycle-manager |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
operator-lifecycle-manager-catalog |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
operator-lifecycle-manager-packageserver |
4.6.52 |
TRUE |
FALSE |
FALSE |
158m |
service-ca |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
storage |
4.6.52 |
TRUE |
FALSE |
FALSE |
2d2h |
b) Make sure all the node are in Ready status
Run the following command.
oc get nodes
All the nodes should be in Ready status.
Example:
NAME |
STATUS |
ROLES |
AGE |
VERSION |
master0.jhwocp4652.cp.xxx.com |
Ready |
master |
2d2h |
v1.19.16+3d19195 |
master1.jhwocp4652.cp.xxx.com |
Ready |
master |
2d2h |
v1.19.16+3d19195 |
master2.jhwocp4652.cp.xxx.com |
Ready |
master |
2d2h |
v1.19.16+3d19195 |
worker0.jhwocp4652.cp.xxx.com |
Ready |
worker |
2d2h |
v1.19.16+3d19195 |
worker1.jhwocp4652.cp.xxx.com |
Ready |
worker |
2d2h |
v1.19.16+3d19195 |
worker2.jhwocp4652.cp.xxx.com |
Ready |
worker |
2d2h |
v1.19.16+3d19195 |
c) Make sure all the machine configure pool are in healthy status.
Run the following command.
oc get mcp
Example:
NAME |
CONFIG |
UPDATED |
UPDATING |
DEGRADED |
MACHINECOUNT |
READYMACHINECOUNT |
UPDATEDMACHINECOUNT |
DEGRADEDMACHINECOUNT |
AGE |
master |
rendered-master-dd5251204366d7a3c25261ce8bc5c9fb |
True |
FALSE |
False |
3 |
3 |
3 |
0 |
2d3h |
worker |
rendered-worker-b2b008ac8cae4a98839cbfe309007fea |
True |
FALSE |
False |
3 |
3 |
3 |
0 |
2d3h |
2) Check the Cloud Pak for Data status
If you are to upgrade from Cloud Pak for Data 3.5 to 4.0, you can run the following command.
cpd-cli status -n your-cpd-project
Make sure the Lite and all the services' status are in Ready status.
In addition, you can run the following command for checking whether all the pods are in healthy status.
oc get po --no-headers --all-namespaces -o wide| grep -Ev '([[:digit:]])/\1.*R' | grep -v 'Completed'
Sometime, some of the pods in unhealthy status doesn't mean the cluster in unhealthy status, such as some failed jobs. But we should identify why these pods are in unhealthy status. Once we confirmed that there's no impact to the upgrade, then you can ignore it.
3) Check Image registry
Since Cloud Pak for Data 4.0, a private image registry is recommended or required (air-gapped environment). And the private image registry is important as it hosts all the images that your Cloud Pak for Data 4.X services needed to be up and running. So I strongly recommend you check the image registry status and also have an overview of the images in it.
Run the following command for logging into your private image registry server.
podman login --username $PRIVATE_REGISTRY_USER --password $PRIVATE_REGISTRY_PASSWORD $PRIVATE_REGISTRY --tls-verify=false
If it could be logged in successfully, it means your private image registry is up and running.
Run the following command to have an overview of the images in it.
curl -k -u ${PRIVATE_REGISTRY_USER}:${PRIVATE_REGISTRY_PASSWORD} https://${PRIVATE_REGISTRY}/v2/_catalog?n=6000 | jq .
2. Implement the upgrade following the runbook prepared in pre-upgrade phrase
In the section 8) of this article Pre-upgrade tasks for CPD upgrade , a validated and well prepared upgrade runbook is recommended as one of the pre-upgrade task. Following this runbook, you can implement the upgrade with less risk and efforts. But some tips here maybe helpful.
1)Temporarily disabling the route to the Cloud Pak for Data cluster
This can help to prevent the end-users from using this cluster during the upgrade.
Back up your Cloud Pak for Data route firstly with the following command. Note: change the your-cpd-route and your-cpd-project accordingly.
oc get route your-cpd-route -o yaml -n your-cpd-project > your-cpd-route-bakup.yaml
Then you can delete it.
2)For upgrade from 3.0.1 to 3.5, apply the latest patch for lite and upgrade your SPSS (if installed) to 3.0.2 are required.
3)Uninstalled the services deprecated in the target upgrade version
Pleas refer to the section 3) of the article Pre-upgrade tasks for CPD upgrade
4)Stop the environment runtimes and cron-jobs are recommended.
Actually, as mentioned in the section 5)Evaluate and decide the time window of the article Pre-upgrade tasks for CPD upgrade, the end-users are recommended to stop their own environment runtimes and scheduled jobs before the upgrade. But we'd better take following actions to make sure that environment runtimes and scheduled jobs are stopped or suspended.
List the active environment runtimes firstly
for mydeploy in $(oc get deploy -l created-by=spawner --no-headers| awk '{print $1}') ; do echo $mydeploy; done
Stop the active environment runtimes
for mydeploy in $(oc get deploy -l created-by=spawner --no-headers| awk '{print $1}') ; do oc delete deployment $mydeploy; done
Suspend cron jobs before the upgrade.
oc get cronjobs -n your-cpd-project | grep False| grep -v spark | cut -d' ' -f 1 | xargs oc patch cronjobs -p '{"spec" : {"suspend" : true }}'
But note that re-enable the cron jobs after the upgrade done.
5) OpenShift upgrade
For the OpenShift upgrade, I assume your OpenShift cluster version is OCP 4.X. If your OpenShift cluster version is still 3.11, migration rather than upgrade would be required as OpenShift doesn't support the in-place upgrade from 3.11 to 4.X. Regarding the migration, I'll introduce it in details in a separate article later.
a) OpenShift doesn't support hops upgrade for the major releases and sequential upgrade is required
For example, your current OpenShift version is 4.5.X and you plan to upgrade it to 4.8. Your upgrade path would be 4.5.X -> 4.6.Y -> 4.7.Z -> 4.8.N.
b) When upgrading OpenShift, subscribe to the EUS channel for getting the EUS support if it's available.
For example, the OCP 4.6 support has been End of Support. But the Extended Update Support of OCP 4.6 (4.6-EUS) support is still available. If you still want to stay on OCP 4.6 with support, then you'll have to update the your OpenShift cluster's OCP channel to EUS-4.6.
For more information, please refer to the OpenShift's lifecycle policy.
c)Validate the OpenShift cluster status
Make sure the cluster operator are in healthy status
oc get co
Make sure all the nodes are in Ready status
oc get nodes
Make sure all the machine configure pool are in healthy status.
oc get mcp
6) Storage upgrade
When upgrading from Cloud Pak for Data 3.5 to 4.0, if you are running an unsupported version of Red Hat® OpenShift® Container Storage or Portworx, you must upgrade your storage before you upgrade to IBM® Cloud Pak for Data Version 4.0. For information about supported versions of shared persistent storage, see Storage requirements.
7)Cloud Pak for Data platform and services upgrade
Different services may have different prerequisites and procedures for the upgrade from 3.5 to 4.0. For example, some services (e.g. Data virtualization) require that you create the Db2U operator subscription manually while some others not. And for some Watson services (e.g. Watson Discovery, Watson Assistant), you may have to enable the License Service of the IBM Cloud Pak Foundational service.
There maybe service instances provisioned for some particular services, such as Spark instance, Data Virtualization instance, Db2WH instance. For these kind of services, apart from upgrade the services themselves, you also need to upgrade the service instances accordingly.
Record the commands and the corresponding results during the upgrade. After each service upgrade done, make sure it is in healthy status before you proceeding to next upgrade of another service.
8)Troubleshooting
When your Cloud Pak for Data services' upgrade failed, be cautious about the rollback during the troubleshooting. The rollback during the upgrade from 3.5 to 4.0.X is not supported. Even for the upgrade from 3.0.1 to 3.5, the rollback for some services are not supported, e.g. WML.
Reach out to IBM Support with a support ticket for the help and assistance if needed.
Restore should be the last resort.
SummaryIn this article, I introduced some tips about the upgrade implementation. Most of them are from the lesson learnt and experience we accumulated. Hope it's helpful! And in my next article, I'll introduce some tips about the post upgrade.
#CloudPakforDataGroup