Hi all
I have the same issue. The environment was OK about a week after installation but now the zen-watchdog pod is constantly crashing and Platform Management is not available.
zen-watchdog log:
oc logs -f zen-watchdog-74875c4645-w2m9n
time="2021-02-15T10:34:47Z" level=info msg=Started CORE_API_TRACE= event=initLog
2021/02/15 10:34:47 [INFO] Cockroachdb database is selected the connection string is postgresql://zen_user@zen-metastoredb-public:26257/zen?ssl=true&sslmode=require&sslrootcert=/tmp/metastore/ca.crt&sslkey=/tmp/metastore/client.zen_user.key&sslcert=/tmp/metastore/client.zen_user.crt&application_name=zen-watchdog-74875c4645-w2m9n
2021/02/15 10:34:47 [INFO] DB URL: postgresql://zen_user@zen-metastoredb-public:26257/zen?ssl=true&sslmode=require&sslrootcert=/tmp/metastore/ca.crt&sslkey=/tmp/metastore/client.zen_user.key&sslcert=/tmp/metastore/client.zen_user.crt&application_name=zen-watchdog-74875c4645-w2m9n
2021/02/15 10:34:47 [INFO] Cockroachdb database is selected the connection string is postgresql://zen_user@zen-metastoredb-public:26257/zen?ssl=true&sslmode=require&sslrootcert=/tmp/metastore/ca.crt&sslkey=/tmp/metastore/client.zen_user.key&sslcert=/tmp/metastore/client.zen_user.crt&application_name=zen-watchdog-74875c4645-w2m9n
2021/02/15 10:34:47 [INFO] Cronjob already exists
2021/02/15 10:34:47 [INFO] Metrics collection started at 2021-02-15 10:34:47.795172814 +0000 UTC m=+0.666982260
2021/02/15 10:34:48 [INFO] Metrics collection finished at 2021-02-15 10:34:48.280921075 +0000 UTC m=+1.152730514
2021/02/15 10:34:48 [INFO] Inside init scheduler...
2021/02/15 10:34:48 [INFO] Get resource plan for parent -- 404
2021/02/15 10:34:48 [INFO] Issue with scheduler - 404 page not found
2021/02/15 10:34:48 [INFO] Scheduler status500
2021/02/15 10:34:48 [ERROR] Issue with scheduler
time="2021-02-15T10:34:48Z" level=info msg="Pods found for ccs"
time="2021-02-15T10:34:48Z" level=info msg="Pods found for cognos-analytics-app"
time="2021-02-15T10:34:48Z" level=info msg="Pods found for dfd"
time="2021-02-15T10:34:49Z" level=info msg="Pods not found for jupyter-py37"
time="2021-02-15T10:34:49Z" level=info msg="Pods found for rshaper"
time="2021-02-15T10:34:50Z" level=info msg="Pods not found for volumes"
time="2021-02-15T10:34:50Z" level=info msg="Pods not found for wkc-full"
time="2021-02-15T10:34:51Z" level=info msg="Pods found for wkc"
time="2021-02-15T10:34:52Z" level=info msg="Pods found for ws"
time="2021-02-15T10:34:52Z" level=info msg="Pods found for zen-lite"
2021/02/15 10:34:55 [INFO] Checking purge for check-replica-status
2021/02/15 10:34:55 [INFO] Expected next purge at - 12 Feb 21 00:00 +0000
2021/02/15 10:34:55 [INFO] Cutoffseconds 1613126095
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
oc get po |grep zen-watchdog
zen-watchdog-74875c4645-w2m9n 0/1 Running 3 7m10s
zen-watchdog-cronjob-1613376600-d9dwl 0/1 Completed 0 146m
Any ideas?
------------------------------
Andrey Kirilov
------------------------------
Original Message:
Sent: Fri January 15, 2021 11:41 AM
From: Valerie Le Roy
Subject: Unable to administer CPD 3.5
Hi Tomasz,
I am facing a kind of similar issue.
I installed CP4D 3.5 a few weeks ago and everything was OK.
A few days ago I noticed that I got an error when accessing Manage the platform. The zen-watchdog pod was running. I have been told to delete it.
So I deleted it. But since then this pod cannot start successfully. I could not find any interesting message in the logs or in the events.
Would you have any idea ?
Many thanks in advance for your support !
------------------------------
Valerie Le Roy
Original Message:
Sent: Thu January 14, 2021 04:47 AM
From: TOMASZ HANUSIAK
Subject: Unable to administer CPD 3.5
Hi,
The over commitment is fine (to a certain degree), and thats not the cause of the issue.
I doubt the resources were causing this, Id be more inclined towards storage issues (intermittent drop of connectivity for example)
If you face this again, please raise a support ticket, and we should be able to pin-point the root cause.
Thanks
------------------------------
TOMASZ HANUSIAK
Original Message:
Sent: Fri January 08, 2021 01:24 PM
From: Phil Fox
Subject: Unable to administer CPD 3.5
Thanks Tomasz, I appreciate the time you've taken to look into this. I am now able to pull up the Platform Management page. Could this be caused simply by a lack of resources? The cluster is a bare min config with 3 masters & 3 workers, running the wsl and wml assemblies. I was hoping the Platform Management page would give me some pointers into how 'full' the environment is, but nothing jumps out. Most of the 633 issues seem to be related to failed cron jobs (diagnostics, watchdog-alert-monitoring, zen-watchdog)
------------------------------
Phil Fox
Original Message:
Sent: Thu January 07, 2021 07:08 AM
From: TOMASZ HANUSIAK
Subject: Unable to administer CPD 3.5
Hi,
I know its not much of an answer, but I see the pod & jobs working fine now.
Can you check the UI?
If you still face these problems, please open a ticket via IBM Support.
Thanks
------------------------------
TOMASZ HANUSIAK
Original Message:
Sent: Tue January 05, 2021 01:23 PM
From: Phil Fox
Subject: Unable to administer CPD 3.5
Hi,
oc describe pod zen-watchdog-778fb6bbb7-shqjm > oc_describe.txt
oc get events --sort-by='{.lastTimestamp}' > oc_events.txt
Thanks,
------------------------------
Phil Fox
Original Message:
Sent: Tue January 05, 2021 05:52 AM
From: TOMASZ HANUSIAK
Subject: Unable to administer CPD 3.5
Hi,
Please note, that the pod restarted a lot over the last few weeks:
(x142209 over 21d)
Some events/details may got removed.
Can you try to delete that pod (oc delete pod ....), it will come back with a slightly different name.
Please collect describe of the new pod + `oc get events --sort-by='{.lastTimestamp}' ( or oc get events --sort-by=.metadata.creationTimestamp)
Thanks
------------------------------
TOMASZ HANUSIAK
Original Message:
Sent: Mon January 04, 2021 01:05 PM
From: Phil Fox
Subject: Unable to administer CPD 3.5
Hi Tomasz, Happy New Year
As requested:
oc describe po zen-watchdog-778fb6bbb7-5gqxz
------------------------------
Phil Fox
Original Message:
Sent: Fri December 18, 2020 04:39 AM
From: TOMASZ HANUSIAK
Subject: Unable to administer CPD 3.5
Hi,
Yes, in fact a lot of pods/collectors are not working.
What's causing the 500 is:
zen-watchdog-778fb6bbb7-5gqxz 0/1 CreateContainerError 1 21d
Can you please describe that pod?
oc describe po zen-watchdog-778fb6bbb7-5gqxz
Thanks
------------------------------
TOMASZ HANUSIAK
Original Message:
Sent: Thu December 17, 2020 11:39 AM
From: Phil Fox
Subject: Unable to administer CPD 3.5
Hi Tomasz, thanks for taking a look. Seems like many of the pods are actually in an error state.
------------------------------
Phil Fox
Original Message:
Sent: Thu December 17, 2020 04:44 AM
From: TOMASZ HANUSIAK
Subject: Unable to administer CPD 3.5
Hi Phil.
Could you please try to find a pod with name like:
zen-watchdog
and provide us the logs?
oc logs zen-watchdog-XXXX > log.txt
Thanks
------------------------------
TOMASZ HANUSIAK
Original Message:
Sent: Wed December 16, 2020 12:57 PM
From: Phil Fox
Subject: Unable to administer CPD 3.5
Hi all,
On a new install of CPD3.5 I'm getting this error when I choose 'Manage the platform'. I'm logged in as admin and confirmed that the admin user has 'Administer platform' permission enabled. I also tried creating a new user with Administrator role but still get the same error.
Any ideas?
------------------------------
Phil Fox
------------------------------
#CloudPakforDataGroup