Cloud Pak for Data

 View Only
Expand all | Collapse all

Unable to administer CPD 3.5

  • 1.  Unable to administer CPD 3.5

    Posted Wed December 16, 2020 01:24 PM
    Hi all,

    On a new install of CPD3.5 I'm getting this error when I choose 'Manage the platform'. I'm logged in as admin and confirmed that the admin user has 'Administer platform' permission enabled. I also tried creating a new user with Administrator role but still get the same error.

    Any ideas?



    ------------------------------
    Phil Fox
    ------------------------------

    #CloudPakforDataGroup


  • 2.  RE: Unable to administer CPD 3.5

    Posted Thu December 17, 2020 04:45 AM

    Hi Phil.

    Could you please try to find a pod with name like:

    zen-watchdog

    and provide us the logs?

    oc logs zen-watchdog-XXXX > log.txt

    Thanks



    ------------------------------
    TOMASZ HANUSIAK
    ------------------------------



  • 3.  RE: Unable to administer CPD 3.5

    Posted Thu December 17, 2020 11:39 AM

    Hi Tomasz, thanks for taking a look. Seems like many of the pods are actually in an error state.



    ------------------------------
    Phil Fox
    ------------------------------

    Attachment(s)

    txt
    zen-watchdog-log.txt   6.32 MB 1 version
    txt
    oc-get-pods.txt   27 KB 1 version


  • 4.  RE: Unable to administer CPD 3.5

    Posted Fri December 18, 2020 04:39 AM

    Hi,

    Yes, in fact a lot of pods/collectors are not working.

    What's causing the 500 is:

    zen-watchdog-778fb6bbb7-5gqxz                                 0/1     CreateContainerError   1          21d

    Can you please describe that pod?

    oc describe po zen-watchdog-778fb6bbb7-5gqxz 

    Thanks



    ------------------------------
    TOMASZ HANUSIAK
    ------------------------------



  • 5.  RE: Unable to administer CPD 3.5

    Posted Mon January 04, 2021 01:06 PM
      |   view attached
    Hi Tomasz, Happy New Year

    As requested:
    oc describe po zen-watchdog-778fb6bbb7-5gqxz


    ------------------------------
    Phil Fox
    ------------------------------

    Attachment(s)

    txt
    oc-describe-zen.txt   7 KB 1 version


  • 6.  RE: Unable to administer CPD 3.5

    Posted Tue January 05, 2021 05:52 AM
    Hi,

    Please note, that the pod restarted a lot over the last few weeks:
    (x142209 over 21d)

    Some events/details may got removed.
    Can you try to delete that pod (oc delete pod ....), it will come back with a slightly different name.

    Please collect describe of the new pod + `oc get events --sort-by='{.lastTimestamp}' ( or oc get events --sort-by=.metadata.creationTimestamp)

    Thanks

    ------------------------------
    TOMASZ HANUSIAK
    ------------------------------



  • 7.  RE: Unable to administer CPD 3.5

    Posted Tue January 05, 2021 01:23 PM
    Hi,  

    oc describe pod zen-watchdog-778fb6bbb7-shqjm > oc_describe.txt
    oc get events --sort-by='{.lastTimestamp}' > oc_events.txt


    Thanks,

    ------------------------------
    Phil Fox
    ------------------------------

    Attachment(s)

    txt
    oc_events.txt   535 KB 1 version
    txt
    oc_describe.txt   8 KB 1 version


  • 8.  RE: Unable to administer CPD 3.5

    Posted Thu January 07, 2021 07:09 AM

    Hi,

    I know its not much of an answer, but I see the pod & jobs working fine now.
    Can you check the UI?

    If you still face these problems, please open a ticket via IBM Support.

    Thanks



    ------------------------------
    TOMASZ HANUSIAK
    ------------------------------



  • 9.  RE: Unable to administer CPD 3.5

    Posted Fri January 08, 2021 01:25 PM
    Thanks Tomasz, I appreciate the time you've taken to look into this. I am now able to pull up the Platform Management page. Could this be caused simply by a lack of resources? The cluster is a bare min config with 3 masters & 3 workers, running the wsl and wml assemblies. I was hoping the Platform Management page would give me some pointers into how 'full' the environment is, but nothing jumps out. Most of the 633 issues seem to be related to failed cron jobs (diagnostics, watchdog-alert-monitoring, zen-watchdog)



    On the OCP dashboard the only one that looks worrying is the CPU Limits Commitment @ 185.14%



    ------------------------------
    Phil Fox
    ------------------------------



  • 10.  RE: Unable to administer CPD 3.5

    Posted Thu January 14, 2021 04:48 AM
    Hi,

    The over commitment is fine (to a certain degree), and thats not the cause of the issue.

    I doubt the resources were causing this, Id be more inclined towards storage issues (intermittent drop of connectivity for example)

    If you face this again, please raise a support ticket, and we should be able to pin-point the root cause.

    Thanks

    ------------------------------
    TOMASZ HANUSIAK
    ------------------------------



  • 11.  RE: Unable to administer CPD 3.5

    Posted Fri January 15, 2021 11:42 AM
    Hi Tomasz,

    I am facing a kind of similar issue.
    I installed CP4D 3.5 a few weeks ago and everything was OK.
    A few days ago I noticed that I got an error when accessing Manage the platform. The zen-watchdog pod was running. I have been told to delete it.
    So I deleted it. But since then this pod cannot start successfully. I could not find any interesting message in the logs or in the events.
    Would you have any idea ?
    Many thanks in advance for your support !

    ------------------------------
    Valerie Le Roy
    ------------------------------



  • 12.  RE: Unable to administer CPD 3.5

    Posted Tue January 19, 2021 04:11 AM
    Hi,

    Let's review your system together and share our findings here.

    Thanks

    ------------------------------
    TOMASZ HANUSIAK
    ------------------------------



  • 13.  RE: Unable to administer CPD 3.5

    Posted Thu January 21, 2021 05:07 AM
    Hi,

    Tomasz investigated this issue on my system and found that the problem was related to the dsx-influxdb pod.

    So he deleted the dsx-influxdb pod (which restarted) and the zen-watchdog pod could restart successfully.

    A big thank you to Tomasz for his great support !

    ------------------------------
    Valerie Le Roy
    ------------------------------



  • 14.  RE: Unable to administer CPD 3.5

    Posted Mon February 15, 2021 05:40 AM
    Hi all
    I have the same issue. The environment was OK about a week after installation but now the zen-watchdog pod is constantly crashing and Platform Management is not available.

    zen-watchdog log:

    oc logs -f zen-watchdog-74875c4645-w2m9n
    time="2021-02-15T10:34:47Z" level=info msg=Started CORE_API_TRACE= event=initLog
    2021/02/15 10:34:47 [INFO] Cockroachdb database is selected the connection string is postgresql://zen_user@zen-metastoredb-public:26257/zen?ssl=true&sslmode=require&sslrootcert=/tmp/metastore/ca.crt&sslkey=/tmp/metastore/client.zen_user.key&sslcert=/tmp/metastore/client.zen_user.crt&application_name=zen-watchdog-74875c4645-w2m9n
    2021/02/15 10:34:47 [INFO] DB URL: postgresql://zen_user@zen-metastoredb-public:26257/zen?ssl=true&sslmode=require&sslrootcert=/tmp/metastore/ca.crt&sslkey=/tmp/metastore/client.zen_user.key&sslcert=/tmp/metastore/client.zen_user.crt&application_name=zen-watchdog-74875c4645-w2m9n
    2021/02/15 10:34:47 [INFO] Cockroachdb database is selected the connection string is postgresql://zen_user@zen-metastoredb-public:26257/zen?ssl=true&sslmode=require&sslrootcert=/tmp/metastore/ca.crt&sslkey=/tmp/metastore/client.zen_user.key&sslcert=/tmp/metastore/client.zen_user.crt&application_name=zen-watchdog-74875c4645-w2m9n
    2021/02/15 10:34:47 [INFO] Cronjob already exists
    2021/02/15 10:34:47 [INFO] Metrics collection started at 2021-02-15 10:34:47.795172814 +0000 UTC m=+0.666982260
    2021/02/15 10:34:48 [INFO] Metrics collection finished at 2021-02-15 10:34:48.280921075 +0000 UTC m=+1.152730514
    2021/02/15 10:34:48 [INFO] Inside init scheduler...
    2021/02/15 10:34:48 [INFO] Get resource plan for parent -- 404
    2021/02/15 10:34:48 [INFO] Issue with scheduler - 404 page not found
    2021/02/15 10:34:48 [INFO] Scheduler status500
    2021/02/15 10:34:48 [ERROR] Issue with scheduler
    time="2021-02-15T10:34:48Z" level=info msg="Pods found for ccs"
    time="2021-02-15T10:34:48Z" level=info msg="Pods found for cognos-analytics-app"
    time="2021-02-15T10:34:48Z" level=info msg="Pods found for dfd"
    time="2021-02-15T10:34:49Z" level=info msg="Pods not found for jupyter-py37"
    time="2021-02-15T10:34:49Z" level=info msg="Pods found for rshaper"
    time="2021-02-15T10:34:50Z" level=info msg="Pods not found for volumes"
    time="2021-02-15T10:34:50Z" level=info msg="Pods not found for wkc-full"
    time="2021-02-15T10:34:51Z" level=info msg="Pods found for wkc"
    time="2021-02-15T10:34:52Z" level=info msg="Pods found for ws"
    time="2021-02-15T10:34:52Z" level=info msg="Pods found for zen-lite"
    2021/02/15 10:34:55 [INFO] Checking purge for check-replica-status
    2021/02/15 10:34:55 [INFO] Expected next purge at - 12 Feb 21 00:00 +0000
    2021/02/15 10:34:55 [INFO] Cutoffseconds 1613126095
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116
    2021/02/15 10:34:55 [INFO] Deletion - Total number of events - 116

    oc get po |grep zen-watchdog
    zen-watchdog-74875c4645-w2m9n 0/1 Running 3 7m10s
    zen-watchdog-cronjob-1613376600-d9dwl 0/1 Completed 0 146m

    Any ideas?

    ------------------------------
    Andrey Kirilov
    ------------------------------



  • 15.  RE: Unable to administer CPD 3.5

    Posted Mon February 15, 2021 06:07 AM
    Hi Andrey,

    After investigation it seems that my problem was related to the dsx-influxdb pod.

    So we deleted the dsx-influxdb pod (which restarted) and the zen-watchdog pod could restart successfully.

    I hope this might help you !

    ------------------------------
    Valerie Le Roy
    ------------------------------



  • 16.  RE: Unable to administer CPD 3.5

    Posted Mon February 15, 2021 08:45 AM
    Hi Valerie,
    Thanks for reply. Influxdb pod restart didn't help but after I deleted the 3 zen-metastoredb pods one by one the watchdog is up and running :)

    ------------------------------
    Andrey Kirilov
    ------------------------------



  • 17.  RE: Unable to administer CPD 3.5

    Posted Wed February 17, 2021 02:17 PM
    --HERE--

    Please follow the below instructions if you face this issue. this occasionally happens due to overloading of metastore and we have a patch being readied for 3.5.2 to address this.​ 

    --- Clear metastore events ---
    
    oc exec -it zen-metastoredb-0 /bin/bash
    cp -r /certs/ /tmp/
    cd /tmp/ && chmod -R  0700 certs/
    cd  /cockroach 
    ./cockroach sql --certs-dir=/tmp/certs/ --host=zen-metastoredb-0.zen-metastoredb
    use zen;
    drop table policies;
    drop table products;
    drop table monitors;
    drop table monitor_events;
    drop table event_types;
    
    --- Delete cronjobs to ensure all pending jobs, if any, are killed ) and Restart watchdog pod (recreates all cronjobs and the deleted tables) ---
    
    oc delete cronjob watchdog-alert-monitoring-cronjob watchdog-alert-monitoring-purge-cronjob zen-watchdog-cronjob diagnostics-cronjob
    oc delete pod <zen-watchdog-xxxx>

    ​Let me know if you still face any concerns.. thanks.

    ------------------------------
    Lalit Somavarapha
    ------------------------------



  • 18.  RE: Unable to administer CPD 3.5

    Posted Mon March 08, 2021 08:36 AM


    ------------------------------
    Đồng Phục Hoàng Phúc
    ------------------------------