Global AI and Data Science

 View Only

Tukey Fences for Outliers

By Moloy De posted Thu March 25, 2021 10:32 PM

  
An outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution. In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations, or may indicate 'correct trial' versus 'measurement error'; this is modeled by a mixture model.

Deletion of outlier data is a controversial practice frowned upon by many scientists and science instructors; while mathematical criteria provide an objective and quantitative method for data rejection, they do not make the practice more scientifically or methodologically sound, especially in small sets or where a normal distribution cannot be assumed. Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known. An outlier resulting from an instrument reading error may be excluded but it is desirable that the reading is at least verified.

There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise. There are various methods of outlier detection. Some are graphical such as normal probability plots. Others are model-based. Box plots are a hybrid.

Model-based methods which are commonly used for identification assume that the data are from a normal distribution, and identify observations which are deemed "unlikely" based on mean and standard deviation:
1. Chauvenet's criterion
2. Grubbs's test for outliers
3. Dixon's Q test
4. ASTM E178 Standard Practice for Dealing With Outlying Observations
5. Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models.
6. Subspace and correlation based techniques for high-dimensional numerical data.

A nonparametric outlier detection method. It is calculated by creating a “fence” boundary a distance of 1.5 IQR beyond the 1st and 3rd quartiles. Any data beyond these fences are considered to be outliers.
for some nonnegative constant k. John Tukey proposed this test, where k = 1.5 indicates an "outlier", and k = 3 indicates data that is "far out". Being nonparametric Tukey Fences are robust methods in detecting outliers.

QUESTION I: Could we apply Tukey Fence on ranked observation?
QUESTION II: Could we find the probability of an observation escaping Tukey Fence?


REFERENCE: Wikipedia
#GlobalAIandDataScience
#GlobalDataScience
0 comments
6 views

Permalink