Global AI and Data Science

 View Only
Expand all | Collapse all

What if a column has about 90% of data as outliers?

  • 1.  What if a column has about 90% of data as outliers?

    Posted Mon July 12, 2021 11:05 AM
    I'm working on a prediction problem where one of the columns, when checked for outliers show that almost 90% of its data are outliers. What should be done in this scenario? Should the column be dropped or should we continue to treat the outliers like any other column?

    Please advise.

    ------------------------------
    Abhinay Kattingeri
    ------------------------------

    #GlobalAIandDataScience
    #GlobalDataScience


  • 2.  RE: What if a column has about 90% of data as outliers?

    Posted Tue July 13, 2021 03:32 AM
    hello Abhinay,

    can you share please some descriptive statistics supporting the conclusion you have made? e.g. central tendency, dispersion, distribution maybe even a Box Plot?


    ------------------------------
    Meni Berger
    ------------------------------



  • 3.  RE: What if a column has about 90% of data as outliers?

    Posted Tue July 13, 2021 04:30 AM
    Hi Meni, 

    Last night when I was going through the problem again I realized that the columns consists only 11% of data as outliers and not 90%. I apologize for the confusion. 
    However, could you please suggest if I'm taking the right approach here? I've attached the information that you requested as screenshots. 

    The distributions of these columns are positively skewed, would it be a good option to replace all the outliers with their median? Or do you suggest any other way to handle the outliers? Please go through the descriptive statistics and let me know if I should be taking any additional steps?

    Please advise.

    ------------------------------
    Abhinay Kattingeri
    ------------------------------



  • 4.  RE: What if a column has about 90% of data as outliers?

    Posted Tue July 13, 2021 05:10 AM
    Yes. it looks awfully right-skewed. I would approach this with transformations- first I will try square rooting to see if it has any effect. if not (usually it doesn't work...) log transformation might be used. this approach usually has good effects when dealing with right-skewed data. if Log fails (e.g. transformed data displays Skewness and Kurtosis above +/- 0.5-1.0) you might try and use more "aggressive" techniques like Box-Cox transformation which are highly effective but much more complicated to implement.
    *** please remember. once transformed, you must consistently use the transformed variable in all your future models. you also must noticeably mark it so other users/clients will know for sure that they are dealing with a transformed variable!!!*** 



    ------------------------------
    Meni Berger
    ------------------------------



  • 5.  RE: What if a column has about 90% of data as outliers?

    Posted Tue July 13, 2021 05:24 AM
    Thanks for your suggestions, Meni. There are few other columns in this data which behaves similarly, I'll try to apply both square-root and Log transformations to see if it helps. This brings me to another question, let's say I've performed Log transformations and it did work. Then after this, I wouldn't have to do a Min-Max scaler to the numeric variables, right? I'm assuming the log transformation will take care of scaling the data as well?

    ------------------------------
    Abhinay Kattingeri
    ------------------------------



  • 6.  RE: What if a column has about 90% of data as outliers?

    Posted Wed July 14, 2021 02:45 AM
    Yes. it is not the best practice to retransform a transformed variable. this might lead to scale reduction and loss of data.

    ------------------------------
    Meni Berger
    ------------------------------



  • 7.  RE: What if a column has about 90% of data as outliers?

    Posted Wed July 14, 2021 04:04 AM
    Sure. As you suggested, I performed Square Root, Log, and Box-Cox transformations. Neither of them were successful in transforming the data into a near Gaussian distribution. In this case, I'm assuming I should be using Non-Parametric ML Regressor models such as KNN, Decision Tree, and SVR. Do you think this is a good approach?

    ------------------------------
    Abhinay Kattingeri
    ------------------------------



  • 8.  RE: What if a column has about 90% of data as outliers?

    Posted Wed July 14, 2021 07:28 AM
    sorry to hear. I think the C 5.0 decision tree is the most efficient approach.


    ------------------------------
    Meni Berger
    ------------------------------



  • 9.  RE: What if a column has about 90% of data as outliers?

    Posted Wed July 14, 2021 11:16 AM
    Could you please explain why C 5.0 DT would be the most efficient approach in this case? I'd like to understand the thought process behind it. Please note that this is a Regression problem.

    ------------------------------
    Abhinay Kattingeri
    ------------------------------



  • 10.  RE: What if a column has about 90% of data as outliers?

    Posted Tue July 13, 2021 05:23 AM
    also please do not RMV with Median or Mean! this will have a great bias on your data and severely affect your model validity. if you must (above 25% missing values) please use multiple imputations to handle missing values.

    ------------------------------
    Meni Berger
    ------------------------------



  • 11.  RE: What if a column has about 90% of data as outliers?

    Posted Tue July 13, 2021 05:45 AM
    Edited by System Fri January 20, 2023 04:16 PM
    In the current case study, the % of missing values are around 11% for couple of columns. How would you suggest me to handle these missing values?

    Also, I tried Sqrt, Log and Box-Cox transformations and none of them managed to transform the data into Gaussian distribution. Is there anything else that I can do?

    ------------------------------
    Abhinay Kattingeri
    ------------------------------



  • 12.  RE: What if a column has about 90% of data as outliers?

    Posted Wed July 14, 2021 08:42 AM
    If one column has 90% outliers, it clearly means that the variable does not contribute much to prediction of the target variable.
    Hence it would be prudent to remove it from the model
    Jojo Jacob




  • 13.  RE: What if a column has about 90% of data as outliers?

    Posted Wed July 14, 2021 08:42 AM
    Hi Abhinay,
    That is definitely extremely right skewed data. In case you have a large number of explanatory variables may be you could drop these variables. With so much outliers the effect on prediction will be bad.
    In case you have to use these variables, transforming is the only way. You may use log to do it.
    Also try to replace the most extreme 10% (Heuristic figure) of outliers with median/mean and see the results.

    Jojo Jacob

    ------------------------------
    Jojo Jacob
    ------------------------------



  • 14.  RE: What if a column has about 90% of data as outliers?

    Posted Wed July 14, 2021 09:54 AM
    Hello Jojo, 

    Thanks for your advise. I came to realize that the column contains only 11% of data as outliers (I miscalculated to 90% earlier). In this case, I have replaced the outliers with their median. 

    However, the distribution of these columns still tend to be extremely positively skewed even after Square Root, Log, and Box-Cox transformations. Hence, I'm looking to implement Non-Parametric ML models to this data. As Meni suggested, I will try the C 5.0 Decision Tree model. 

    That being said, I have another questions if you could help clarify - is there a difference between data cleaning and feature engineering? From my knowledge, data cleaning falls under the category of feature engineering which includes OneHotEncoding, Feature Selection etc. Is my interpretation correct?

    ------------------------------
    Abhinay Kattingeri
    ------------------------------



  • 15.  RE: What if a column has about 90% of data as outliers?

    Posted Wed July 14, 2021 12:20 PM
    Yes, there is a clear difference between data cleaning and feature engineering.
    Data Cleaning is part of data processing where we remove anomalies in data. Common data cleaning steps include, removing or imputing missing values, renaming column headers in a single pattern like all small case letters and replacing spaces with "_", checking and ensuring correct data types for variables like changing date field from string to date time etc depending on the data set.

    Feature engineering refers to creating a new feature or variable out of the existing ones. Creating a bin of age groups for age given as continuous variable, creating a total no of family members field in toy Titanic data set, or finding differences between start date and end date if not give etc.

    Hope you find this helpful
    Jojo

    ------------------------------
    Jojo Jacob
    ------------------------------



  • 16.  RE: What if a column has about 90% of data as outliers?

    Posted Thu July 15, 2021 05:59 AM
    Thanks for the clarification, Jojo.

    ------------------------------
    Abhinay Kattingeri
    ------------------------------



  • 17.  RE: What if a column has about 90% of data as outliers?

    Posted Mon July 19, 2021 04:10 PM
    If 90% are outliers, they're not outliers.

    ------------------------------
    Scott Terry
    ------------------------------



  • 18.  RE: What if a column has about 90% of data as outliers?

    Posted Tue July 20, 2021 02:24 AM
    I agree with Scott. If a column's  has 90% outliers, they are not outliers. This Suggests that your assumption or understanding is different from the actual data available.
    If you had said 90% is missing that is understandable. When  your assumption is 90% are outliers, you need to validate your data source or you need to change your assumption.

    Thanks,
    Rajkumar Rajasekaran

    ------------------------------
    Rajkumar Rajasekaran
    ------------------------------



  • 19.  RE: What if a column has about 90% of data as outliers?

    Posted Mon August 09, 2021 09:17 AM
    A debt of gratitude is in order for your ideas, Meni. There are not many different segments in this information which acts likewise, I'll attempt to apply both square-root and Log changes to check whether it makes a difference. Reverse VAT This carries me to another inquiry, suppose I've performed Log changes and it managed job.

    ------------------------------
    Theresa Chaney
    ------------------------------



  • 20.  RE: What if a column has about 90% of data as outliers?

    Posted Mon August 23, 2021 02:35 AM
    You are welcome @Theresa Chaney.​

    ------------------------------
    Meni Berger
    ------------------------------



  • 21.  RE: What if a column has about 90% of data as outliers?

    Posted Thu August 19, 2021 11:29 AM
    That is definitely extremely right skewed data. 

    ------------------------------
    Stephen Crenshaw
    ------------------------------