Cloud Pak for Data

 View Only
Expand all | Collapse all

Ask the room: Question for data scientists: Out of all the data you have, how do you figure out which data you need to to feed the AI models so that they give you good results?

  • 1.  Ask the room: Question for data scientists: Out of all the data you have, how do you figure out which data you need to to feed the AI models so that they give you good results?

    Posted Thu November 12, 2020 02:06 PM
    Edited by System Fri January 20, 2023 04:40 PM
    How would you answer this question:

    Question for data scientists: Out of all the data you have, how do you figure out which data you need to to feed the AI models so that they give you good results?

    About the "Ask the room" questions: We'll be posting regularly for the next several weeks to ask the full Cloud Pak for Data community how they would answer your questions. These questions were selected from the results of the survey we conducted recently. If you took the survey, you might remember this prompt: If you were in a room full of Cloud Pak for Data users, what question would you ask?


    ------------------------------
    Shannon Rouiller
    ------------------------------


    #AskTheRoom
    #CloudPakforDataGroup


  • 2.  RE: Ask the room: Question for data scientists: Out of all the data you have, how do you figure out which data you need to to feed the AI models so that they give you good results?

    Posted Fri November 13, 2020 09:31 AM
    The answer is easy: you don't!  You can't know which data is "best", a priori.  Indeed, the grail of data science is to "discover" the best data - the data that will give the "best" results.

    Now this notion is failing, in my opinion, in the modern data science community.  The thing is, most data will produce "something."  That is, blindly applying some statistical formula to some data will produce some results. The question should be, but is often not, is this result valid.  And voila, you are at the beginning of the scientific method.

    The problem of today is that "scientists" are pressured to publish, or to get results.  So, quite often, if the results simply match the scientist's expectation, they are considered valid.  This is not good science.

    The process is iterative. We have , in fact, extensive tools at our disposal, especially at IBM, to test the validity of a specific formula, applied to a specific data set.  In a product like Watson Discovery, for example,  the tools are built in to the UI used to create the model: the results will show you, clearly, if your data provides accuracy and recall, one more than the other, or both. Therein lies a use case question: which do you prefer? Police would prefer recall; lawyers accuracy.

    A tool like SPSS give you the ability to test 20+ predictive and descriptive analytic formulae on your data, and view the statistical accuracy of those formulae, on that data.  In other words, you will see which data work, and which don't.  But then again, painful science requires that you know WHY one works, and not the other.

    Just a final note about machine learning: the "machine" is not so smart.  This means that the training data you provide the model must be "perfect".  Otherwise, you just don't get results.  Think of training a dog (not that dogs aren't smart! they just don't think like humans).  You need to consistently provide the dog with the same feedback.  Every time. With exactly the right input data, the ML model will come out "perfect".  Unfortunately, if you apply this model to human communication, you are possibly making the wrong assumption.  Humans don't think like machines.

    ------------------------------
    Kameron Cole
    ------------------------------



  • 3.  RE: Ask the room: Question for data scientists: Out of all the data you have, how do you figure out which data you need to to feed the AI models so that they give you good results?

    Posted Thu December 10, 2020 04:55 PM
    It is important to understand the business needs and then move forward in the analysis and interpretation of the data.
    
    With these business drivers and understanding of the data, in my opinion, there are important steps to discover a good AI model to solve your business problem.
    
    Possibly in this stage of selecting the model it will be necessary to test several options until identifying the best accuracy and efficiency. AutoAI has been an excellent option to support this discovery.


    ------------------------------
    Miguel Povoa
    ------------------------------