Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only
  • 1.  Data Model Question

    Posted Wed May 06, 2020 05:12 PM

    I'm used to working with 4th or 5th normal form data models.  I'm wondering why the data structures used in platforms like Jupyter use non-normalized structures.  They are very wide horizontally (columns) vs vertical with rows.  Today I was watching the video of part 2 Pandas and noticed that John Hopkins modified its covad-19 data by pivoting dates that were (vertical) in rows to columns horizontally (across).  Can anyone explain this or point to an article or book on best practices designing or preparing data for analytical models?   

     

    Jerome P Roberts

    IT Specialist

    Philips Oral Healthcare - Los Angeles

    jerome.roberts@philips.com

    Description: <a href=image002.jpg@01CF4423.BA6A3360">

     



    The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

    #GlobalAIandDataScience
    #GlobalDataScience


  • 2.  RE: Data Model Question

    Posted Thu May 14, 2020 11:56 AM
    Hi Jerome,

    Like you, I have a long history of working hard to normalize data for all of the great reasons to do that in database, so these big fat rows make me shudder a bit.  :-)  Most of the machine learning algorithms are trying to determine the effect of a bunch features on a target, so they like to have the data arranged that way, even if it's been denormalized.  You'll probably find that much less time is spent on data design in DS/ML.  Data scientists often have to just play data "where it lays" because its the right data to answer a problem, even its not organized in the "right" manner.  Loads of articles and classes out there that cover the topic.

    https://www.edx.org/course/data-science-wrangling
    https://developers.google.com/machine-learning/data-prep/
    https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73

    ------------------------------
    Craig Maddux
    ------------------------------