Skip main navigation (Press Enter).
Log in
Toggle navigation
Log in
Community
Topic Groups
Champions
Directory
Program overview
Rising Champions
IBM Champions group
User Groups
Directory
Benefits
Events
Dev Days
Conference
Community events
User Groups events
All TechXchange events
Participate
TechXchange Group
Welcome Corner
Blogging
Member directory
Community leaders
Resources
IBM TechXchange
Community
Conference
Events
IBM Developer
IBM Training
IBM TechXchange
Community
Conference
Events
IBM Developer
IBM Training
Global AI and Data Science
×
Global AI & Data Science
Train, tune and distribute models with generative AI and machine learning capabilities
Group Home
Threads
4K
Blogs
907
Events
0
Library
370
Members
28.3K
View Only
Share
Share on LinkedIn
Share on X
Share on Facebook
Back to Blog List
Tukey Fences for Outliers
By
Moloy De
posted
Thu March 25, 2021 10:32 PM
Like
An outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.
Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution. In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations, or may indicate 'correct trial' versus 'measurement error'; this is modeled by a mixture model.
Deletion of outlier data is a controversial practice frowned upon by many scientists and science instructors; while mathematical criteria provide an objective and quantitative method for data rejection, they do not make the practice more scientifically or methodologically sound, especially in small sets or where a normal distribution cannot be assumed. Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known. An outlier resulting from an instrument reading error may be excluded but it is desirable that the reading is at least verified.
There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise. There are various methods of outlier detection. Some are graphical such as normal probability plots. Others are model-based. Box plots are a hybrid.
Model-based methods which are commonly used for identification assume that the data are from a normal distribution, and identify observations which are deemed "unlikely" based on mean and standard deviation:
1. Chauvenet's criterion
2. Grubbs's test for outliers
3. Dixon's Q test
4. ASTM E178 Standard Practice for Dealing With Outlying Observations
5. Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models.
6. Subspace and correlation based techniques for high-dimensional numerical data.
A nonparametric outlier detection method. It is calculated by creating a “fence” boundary a distance of 1.5 IQR beyond the 1st and 3rd quartiles. Any data beyond these fences are considered to be outliers.
for some nonnegative constant k. John Tukey proposed this test, where k = 1.5 indicates an "outlier", and k = 3 indicates data that is "far out". Being nonparametric Tukey Fences are robust methods in detecting outliers.
QUESTION I: Could we apply Tukey Fence on ranked observation?
QUESTION II: Could we find the probability of an observation escaping Tukey Fence?
REFERENCE:
Wikipedia
#GlobalAIandDataScience
#GlobalDataScience
0 comments
6 views
Permalink
Copy
https://community.ibm.com/community/user/blogs/moloy-de1/2021/03/23/points-to-ponder
Powered by Higher Logic