Data Science

Expand all | Collapse all

Finding Messy Datasets and Combining them?

  • 1.  Finding Messy Datasets and Combining them?

    Posted Thu March 14, 2019 02:41 PM
    What kaggle competitions have the messiest data sets, and how can you find them?

    Can you give (or link to) some advice on combining multiple data sets from Kaggle and outside Kaggle for an overall analysis?

    I might be starting a professionally supervised Data Science project that will involve Data Collection, Business Understanding, Data Cleaning, Missing Data, Systematic Noise, Data Visualization, Feature Engineering, Modeling / Forecasting, Optimization and Report Writing.

    I want the data cleaning and control of systematic noise to be challenging and accurstely represent the Data Wrangling a Data Scientist or Data Engineer will do for a large project. Thus, I need appropriate data sets.

    Three sources of messy data I currently know of outside Kaggle are: Medicare advantage plan Benefit Package data, Medicare Provider Utilization / Payment Data, and MIT's MIMIC Critical Healthcare Database.

    However, these datasets external to Kaggle are only about Healthcare.

    ------------------------------
    Bryan Atkinson
    ------------------------------


  • 2.  RE: Finding Messy Datasets and Combining them?

    Posted 30 days ago
    Hello @Bryan Atkinson I admire you are looking for the messiest or noisiest  data-sets. Data cleaning and wrangling are fun. I do them a lot. When it comes to Kaggle competition, most of the dataset they have are messy. Maybe not the kind of messiness you want tho. But one question please, are you looking for datasets on heathcare only ?

    Cheers,
    Damilola


    ------------------------------
    Damilola Omifare
    ------------------------------



  • 3.  RE: Finding Messy Datasets and Combining them?

    Posted 30 days ago
    @Damilola Omifare I am interested in any messy datasets which deal with complex network data.

    I also recently found a database on something like 30,000 ​anime characters which categorizes them by skin tone, eye color, hair color, role, and a dozen other characteristics.

    I'm thinking about doing an analysis on how strongly anime character casts are associated with popular anime.

    ------------------------------
    Bryan Atkinson
    ------------------------------



  • 4.  RE: Finding Messy Datasets and Combining them?

    Posted 28 days ago
    @Bryan Atkinson this is interesting to hear honestly. I would not mind work on it with you on it though. ​You might end up with something that can be turned into a machine learning algorithm where other can build on it. I am trying to see what i can find on this.
    where did you get the data from ? if you do not mind to share ? please

    ------------------------------
    Damilola Omifare
    ------------------------------