In Cloud pak for data, the Spark service creates a Spark cluster on demand to run a notebook or a Spark job. When getting the failure of start the kernel from notebookUI. you can find jkg-deployment-xxxx pods are stuck in pending or CrashLoopBackOff using the following command, it means spark kernel failed to start up. this issued deployment will also use cluster resource. You can use helm tool to cleanup the issued deployments.
Pre-requisties
Install Helm tool on Linux bastion node, you can use the follow steps:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod +x get_helm.sh
./get_helm.sh
helm version
Use helm tool to clean up the release of spark runtime
Check the release id you need to clean up
oc get pods --show-labels | egrep "jkg-deployment" | grep release
Here is sample of output:
jkg-deployment-5abf9792-5a3b-4c99-a26c-e5b982704074-964846fxq98 1/1 Running 0 56m app=kernel-start-deployment,chart=create-kernel-v3-icp4d-1.1.1,heritage=Helm,icpdsupport/addOnId=spark,icpdsupport/app=api,icpdsupport/cloudpakInstanceId=e4bc0741-143b-45bf-a782-d88380834ee8,icpdsupport/createdBy=1000330999,icpdsupport/environmentType=python310,icpdsupport/jobRunId=5abf9792-5a3b-4c99-a26c-e5b982704074,icpdsupport/projectId=2307fdf2-246c-4ff7-a0a6-2ae66ad0dcc0,icpdsupport/runtimeEnvId=spark33py310-2307fdf2-246c-4ff7-a0a6-2ae66ad0dcc0,isDynamic=true,kernel_id=5abf9792-5a3b-4c99-a26c-e5b982704074,name=jkg-selector-5abf9792-5a3b-4c99-a26c-e5b982704074,pod-template-hash=96484648b,release=597ef0d1-1d85-40db-a55e-7b3f947fb1cf,spark/exclude-from-backup=true,unique_id=5abf9792-5a3b-4c99-a26c-e5b982704074,velero.io/exclude-from-backup=true
Check helm list to confirm if the release id showed in the list
helm list
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/auth/kubeconfig
NAME NAMESPACE REVISION UPDATED STATUS CHART PP VERSION
0ffc5baa-c97c-4134-802c-b324a3516037 1 2024-04-12 07:03:44.259173312 +0000 UTC deployed create-kernel-v3-icp4d-1.1.1
597ef0d1-1d85-40db-a55e-7b3f947fb1cf 1 2024-05-22 14:06:39.395715714 +0000 UTC deployed create-kernel-v3-icp4d-1.1.1
b7c5addc-575e-420f-a926-427a2685164e 1 2024-04-12 07:04:45.052835848 +0000 UTC deployed create-kernel-v3-icp4d-1.1.1
Delete release id and cleanup the resource which the deployment is using
helm delete 597ef0d1-1d85-40db-a55e-7b3f947fb1cf
Check helm list to confirm the release has been released.
helm list