What kaggle competitions have the messiest data sets, and how can you find them?
Can you give (or link to) some advice on combining multiple data sets from Kaggle and outside Kaggle for an overall analysis?
I might be starting a professionally supervised Data Science project that will involve Data Collection, Business Understanding, Data Cleaning, Missing Data, Systematic Noise, Data Visualization, Feature Engineering, Modeling / Forecasting, Optimization and Report Writing.
I want the data cleaning and control of systematic noise to be challenging and accurstely represent the Data Wrangling a Data Scientist or Data Engineer will do for a large project. Thus, I need appropriate data sets.
Three sources of messy data I currently know of outside Kaggle are: Medicare advantage plan Benefit Package data, Medicare Provider Utilization / Payment Data, and MIT's MIMIC Critical Healthcare Database.
However, these datasets external to Kaggle are only about Healthcare.
------------------------------
Bryan Atkinson
------------------------------
#GlobalAIandDataScience#GlobalDataScience