Hello Jojo,
Thanks for your advise. I came to realize that the column contains only 11% of data as outliers (I miscalculated to 90% earlier). In this case, I have replaced the outliers with their median.
However, the distribution of these columns still tend to be extremely positively skewed even after Square Root, Log, and Box-Cox transformations. Hence, I'm looking to implement Non-Parametric ML models to this data. As Meni suggested, I will try the C 5.0 Decision Tree model.
That being said, I have another questions if you could help clarify - is there a difference between data cleaning and feature engineering? From my knowledge, data cleaning falls under the category of feature engineering which includes OneHotEncoding, Feature Selection etc. Is my interpretation correct?
------------------------------
Abhinay Kattingeri
------------------------------
Original Message:
Sent: Wed July 14, 2021 02:53 AM
From: Jojo Jacob
Subject: What if a column has about 90% of data as outliers?
Hi Abhinay,
That is definitely extremely right skewed data. In case you have a large number of explanatory variables may be you could drop these variables. With so much outliers the effect on prediction will be bad.
In case you have to use these variables, transforming is the only way. You may use log to do it.
Also try to replace the most extreme 10% (Heuristic figure) of outliers with median/mean and see the results.
Jojo Jacob
------------------------------
Jojo Jacob
Original Message:
Sent: Tue July 13, 2021 04:29 AM
From: Abhinay Kattingeri
Subject: What if a column has about 90% of data as outliers?
Hi Meni,
Last night when I was going through the problem again I realized that the columns consists only 11% of data as outliers and not 90%. I apologize for the confusion.
However, could you please suggest if I'm taking the right approach here? I've attached the information that you requested as screenshots.
The distributions of these columns are positively skewed, would it be a good option to replace all the outliers with their median? Or do you suggest any other way to handle the outliers? Please go through the descriptive statistics and let me know if I should be taking any additional steps?
Please advise.
------------------------------
Abhinay Kattingeri
Original Message:
Sent: Tue July 13, 2021 03:31 AM
From: Meni Berger
Subject: What if a column has about 90% of data as outliers?
hello Abhinay,
can you share please some descriptive statistics supporting the conclusion you have made? e.g. central tendency, dispersion, distribution maybe even a Box Plot?
------------------------------
Meni Berger
Original Message:
Sent: Mon July 12, 2021 11:05 AM
From: Abhinay Kattingeri
Subject: What if a column has about 90% of data as outliers?
I'm working on a prediction problem where one of the columns, when checked for outliers show that almost 90% of its data are outliers. What should be done in this scenario? Should the column be dropped or should we continue to treat the outliers like any other column?
Please advise.
------------------------------
Abhinay Kattingeri
------------------------------
#GlobalAIandDataScience
#GlobalDataScience