Cloud Pak for Data

 View Only

Clean up the issued spark runtime

By Yun Dai posted 16 days ago

  

Problem

In Cloud pak for data, the Spark service creates a Spark cluster on demand to run a notebook or a Spark job. When getting the failure of start the kernel from notebookUI. you can find  jkg-deployment-xxxx pods are stuck in pending or CrashLoopBackOff using the following command, it means spark kernel failed to start up. this issued deployment will also use cluster resource. You can use helm tool to cleanup the issued deployments.
oc get pods |grep  jkg-deployment

jkg-deployment-7ba8b2ff-8879-4018-8b5f-90f222eaad8c-65fbf8wnkqv   0/1     CrashLoopBackOff   11         1h

Resolving the Problem

Pre-requisties

Install Helm tool on Linux bastion node, you can use the follow steps:

# 1. Download the Helm installation script:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3

# 2. Make the script executable:
chmod +x get_helm.sh

# 3. Run the installation script:
./get_helm.sh

# This script will download and install Helm version 3 on your Linux system.
# 4. Verify the installation by checking the Helm version:
helm version

Use helm tool to clean up the release of spark runtime

Check the release id you need to clean up

oc get pods --show-labels | egrep "jkg-deployment" | grep release
Here is sample of output:
jkg-deployment-5abf9792-5a3b-4c99-a26c-e5b982704074-964846fxq98 1/1 Running 0 56m app=kernel-start-deployment,chart=create-kernel-v3-icp4d-1.1.1,heritage=Helm,icpdsupport/addOnId=spark,icpdsupport/app=api,icpdsupport/cloudpakInstanceId=e4bc0741-143b-45bf-a782-d88380834ee8,icpdsupport/createdBy=1000330999,icpdsupport/environmentType=python310,icpdsupport/jobRunId=5abf9792-5a3b-4c99-a26c-e5b982704074,icpdsupport/projectId=2307fdf2-246c-4ff7-a0a6-2ae66ad0dcc0,icpdsupport/runtimeEnvId=spark33py310-2307fdf2-246c-4ff7-a0a6-2ae66ad0dcc0,isDynamic=true,kernel_id=5abf9792-5a3b-4c99-a26c-e5b982704074,name=jkg-selector-5abf9792-5a3b-4c99-a26c-e5b982704074,pod-template-hash=96484648b,release=597ef0d1-1d85-40db-a55e-7b3f947fb1cf,spark/exclude-from-backup=true,unique_id=5abf9792-5a3b-4c99-a26c-e5b982704074,velero.io/exclude-from-backup=true


Check helm list to confirm if the release id showed in the list

helm list
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/auth/kubeconfig
NAME                                	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART                       	PP VERSION
0ffc5baa-c97c-4134-802c-b324a3516037	         	1       	2024-04-12 07:03:44.259173312 +0000 UTC	deployed	create-kernel-v3-icp4d-1.1.1	
597ef0d1-1d85-40db-a55e-7b3f947fb1cf	         	1       	2024-05-22 14:06:39.395715714 +0000 UTC	deployed	create-kernel-v3-icp4d-1.1.1	
b7c5addc-575e-420f-a926-427a2685164e	         	1       	2024-04-12 07:04:45.052835848 +0000 UTC	deployed	create-kernel-v3-icp4d-1.1.1	

Delete release id and cleanup the resource which the deployment is using

helm delete 597ef0d1-1d85-40db-a55e-7b3f947fb1cf


Check helm list to confirm the release has been released.

helm list

Finally, you can also use the command to review all resources of spark runtime.

helm status 45453b98-8778-4d75-bb2c-3656a8369307 --show-resources WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/auth/kubeconfig NAME: 45453b98-8778-4d75-bb2c-3656a8369307 LAST DEPLOYED: Thu May 23 05:41:17 2024 NAMESPACE: STATUS: deployed REVISION: 1 RESOURCES: ==> v1/Deployment NAME READY UP-TO-DATE AVAILABLE AGE jkg-deployment-21068868-7fa3-4774-aaa8-e3189da7b135 1/1 1 1 56s spark-worker-deployment-21068868-7fa3-4774-aaa8-e3189da7b135 2/2 2 2 56s ==> v1/Pod(related) NAME READY STATUS RESTARTS AGE jkg-deployment-21068868-7fa3-4774-aaa8-e3189da7b135-7c9fbdrnn7j 1/1 Running 0 56s spark-worker-deployment-21068868-7fa3-4774-aaa8-e3189da7b17z7h2 1/1 Running 0 56s spark-worker-deployment-21068868-7fa3-4774-aaa8-e3189da7b1zbr92 1/1 Running 0 56s ==> v1/ConfigMap NAME DATA AGE spark-conf-21068868-7fa3-4774-aaa8-e3189da7b135 3 56s spark-hb-kernel-wrapper-21068868-7fa3-4774-aaa8-e3189da7b135 1 56s ==> v1/Service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE jkg-headless-21068868-7fa3-4774-aaa8-e3189da7b135 ClusterIP None <none> 8888/TCP,4040/TCP 56s spark-master-headless-21068868-7fa3-4774-aaa8-e3189da7b135 ClusterIP None <none> 8080/TCP,7077/TCP 56s sparkui-21068868-7fa3-4774-aaa8-e3189da7b135 ClusterIP 172.30.116.92 <none> 4040/TCP 56s
Additional References

https://helm.sh/docs/intro/install/

https://www.redhat.com/en/technologies/cloud-computing/openshift/helm

0 comments
4 views

Permalink