AIOps

Expand all | Collapse all

Model training debug

  • 1.  Model training debug

    Posted Fri August 27, 2021 07:05 AM
    Hi all.

    I just started with Watson AIOPs 3.1.1 and would like to train some models based on log files coming from ELK. I was able to create the integration and it tests successfully, however, after creating a new model and starting the training, I get and error: "Start training failed: Could not find any data to train on". The error is quite obvious, however, I can see data using Kibana for the defined training period.

    I would like to know what are the general steps to debug the issue in this case. Mainly:

    - Where are the relevant logs located?
    - How can I check why/if data is not being transferred over from my target towards the internal Elastic?
    - Any other hint?

    Thanks

    Danilo

    ------------------------------
    Danilo Luna
    ------------------------------


  • 2.  RE: Model training debug

    Posted Fri August 27, 2021 07:11 AM
    Hi. Thanks for the question. @Angus Jamieson @Fred Harald Klein, For your attention please.​​

    ------------------------------
    VEERAMANI NAMBI
    Offering Manager, GoToMarket - Communities
    ------------------------------



  • 3.  RE: Model training debug

    User Group Leader
    Posted Fri August 27, 2021 10:14 AM
    Edited by Angus Jamieson Fri August 27, 2021 10:15 AM
    Hi Danilo,

    Some ideas from my colleagues. As you are a Business Partner if send me your email I can share some internal material with you if this doesn't get things going.
    As I know there may be few ways this could have happened.
    1. Either the data is not transformed to training,
    2. data is not enough or data is not healthy or
    3. we need to see if the cluster is not broken
    ** To verify the data is present, we can use below steps track current data flow which is enabled,
    1. oc projects <namepace>
    2. oc get pods | grep api-server
    3. oc exec -it api-server-pod-name bash
    4. curl -X GET -u $ES_USERNAME:$ES_PASSWORD $ES_URL/_cat/indices -k |sort (use this curl command to see the logs indices).
    For an app without much logging you may need to wait a number of days to start training (you need this amount of days to generate data enough for training)
    These are all common use case we faced for this error but yes if these looks fine, then it may be something else we need to debug even more to see this.
    Also a couple of other things;
    i) Make sure there is a data available for those dates that you are trying to train your model.
    ii) Please check the kafka integration is set correctly as shown in the image below.


    ------------------------------
    Angus Jamieson
    IT Service Management Solutions Architect
    IBM
    Edinburgh
    ------------------------------



  • 4.  RE: Model training debug

    Posted Fri August 27, 2021 11:09 AM
    Thanks for the fast answer. I tried the command mentioned and got the following:

    sh-4.4$ curl -X GET -u $ES_USERNAME:$ES_PASSWORD $ES_URL/_cat/indices -k |sort
    yellow open 1000-1000-20210823-logtrain tUfS8nWcQ769QOco9AVvAw 3 1 1302000 0   143mb   143mb
    yellow open 1000-1000-20210826-logtrain glKokYdgSF-XOTeV8N6z8g 3 1 5651888 0 440.6mb 440.6mb
    yellow open algorithmregistry           TgB0lRAuRuOtXgTZDn8uow 1 1       4 0  19.9kb  19.9kb
    yellow open buildended                  PLPeek70RvO10za3IU75aw 1 1       0 0    208b    208b
    yellow open buildinfo                   l7feMWMgQ3iJ7RVKduI3kQ 1 1       0 0    208b    208b
    yellow open buildresult                 Ye9Y7rqyQfmr7RvfzBVQWQ 1 1       0 0    208b    208b
    yellow open buildstarted                27GYtmRmRaeXDxQfrEjF4g 1 1       0 0    208b    208b
    yellow open comment                     zNfEQWASS2m25EyIbrBdUg 1 1       0 0    208b    208b
    yellow open commit                      kONcVRJjQdWaIYGuIYca7w 1 1       0 0    208b    208b
    yellow open connection                  dsGVfusKRcqTNBUxzSvMXQ 1 1       0 0    208b    208b
    yellow open dataset                     nuKfCvtJQ0aj6tqUpGMK7g 1 1       1 1   5.2kb   5.2kb
    yellow open filechange                  nazyejwVQca9HIA9tlelfQ 1 1       0 0    208b    208b
    yellow open issue                       eg68Y4DqQAGcGYiwT7yyvg 1 1       0 0    208b    208b
    yellow open language                    fjd12P1hT2mgeQWyvPB5JA 1 1       0 0    208b    208b
    yellow open postchecktrainingdetails    V7bKW23sQ_WSz4nCpSJiEQ 1 1       0 0    208b    208b
    yellow open prechecktrainingdetails     WPQdHUU3QJi5wu8fA0Pt2w 1 1       0 0    208b    208b
    yellow open pullrequest                 oc6lpTp7SCeOF9OZnxxuIw 1 1       0 0    208b    208b
    yellow open repository                  I3BohAqrQKinXAyiIm3QqA 1 1       0 0    208b    208b
    yellow open repositoryscan              NN64k02FTrSkZ5Y5z3KPqw 1 1       0 0    208b    208b
    yellow open repositoryscanreport        kZUe-zpIQcqBFnbPMG-QKQ 1 1       0 0    208b    208b
    yellow open repositoryscanreportdata    9LYUsJQbTEKTyLHyIn6vKQ 1 1       0 0    208b    208b
    yellow open snowchangerequest           Z8qaHZMjTQ2ZS53N1n59ew 1 1       0 0    208b    208b
    yellow open snowincident                ATSzyZLdQb2N1GHmsEUlTQ 1 1       0 0    208b    208b
    yellow open snowproblem                 4wUzvPRNSsq-6_c-h1az3g 1 1       0 0    208b    208b
    yellow open trainingdefinition          RSFvStURSkKh98qIefolvA 1 1       1 0   8.7kb   8.7kb
    yellow open trainingsrunning            QS03CgRJQNG__cEDQdA6kw 1 1       0 0    208b    208b​

    Not sure how to interpret this though. The first two lines looks promising :) 

    I will find a way to share my email with you via private channel. 



    ------------------------------
    Danilo Luna
    ------------------------------



  • 5.  RE: Model training debug

    User Group Leader
    Posted Fri August 27, 2021 04:42 PM

    Yes that looks good …

    You should set the parallelism in your log connector to 4 both for base and for source (field is below the date range there).

    This is assuming you have turned on the connection for historical data for initial training and set consistent dates there and in the training definition under model management.




    ------------------------------
    Angus Jamieson
    IT Service Management Solutions Architect
    IBM
    Edinburgh
    ------------------------------