Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Mahalanobis Distance

By Moloy De posted Fri July 28, 2023 10:37 PM

  
Consider the problem of estimating the probability that a test point in N-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the centroid or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.
 
However, we also need to know if the set is spread out over a large range or a small range, so that we can decide whether a given distance from the center is noteworthy or not. The simplistic approach is to estimate the standard deviation of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we might conclude that it is highly probable that the test point belongs to the set. The further away it is, the more likely that the test point should not be classified as belonging to the set.
 
This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be (test point - sample mean) / standard deviation. By plugging this into the normal distribution we can derive the probability of the test point belonging to the set.
 
The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis the test point must be closer, while in those where the axis is long the test point can be further away from the center.
 
Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.
 
For a population with mean mu and covariance matrix Sigma the distance of an observation from mean is defined as 
that follows F distribution under normality assumption
where n is the number of observations and p is the number of attributes or dimensions.
 
Mahalanobis distance is widely used in cluster analysis and classification techniques. It is closely related to Hotelling's T-square distribution used for multivariate statistical testing and Fisher's Linear Discriminant Analysis that is used for supervised classification.
 
In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal.
 
Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation. Mahalanobis distance is also used to determine multivariate outliers. Regression techniques can be used to determine if a specific case within a sample population is an outlier via the combination of two or more variable scores. Even for normal distributions, a point can be a multivariate outlier even if it is not a univariate outlier for any variable , making Mahalanobis distance a more sensitive measure than checking dimensions individually.
 
Mahalanobis distance has also been used in ecological niche modelling, as the convex elliptical shape of the distances relates well to the concept of the fundamental niche.
 
Another example of usage is in finance, where Mahalanobis distance has been used to compute an indicator called the "turbulence index", which is a statistical measure of financial markets abnormal behavior. An implementation as a Web API of this indicator is available online.


Here is the original paper by Mahalanobis.

QUESTION I: What is the distribution of Mahalanobis Distance in univariate setup?
QUESTION II: How to modify Mahalanobis Distance when the variance covariance matrix is singular?

REFERENCE: Wikipedia, Mahalanobis's Original Paper, My Previous Blog

0 comments
5 views

Permalink