Scenario: When deputed to work for Johnson and Johnson, I was given 3 GB of data in csv format extracted from J&J’s Molding Machine Sensors. It was a twelve week data spanning over 120 variables collected every one fifth of a second. I was asked to identify the outliers, if any, and perform a root cause analysis on top of it. I was using R to analyze the data. First I tried the ‘mvoutlier’ package in R that failed to run on such a big dataset of over one million records. Previously I used Mahalanobis Distance in a similar scenario of Credit Card Fraud Detection. So I tried the same methodology here.
Mahalanobis Distance in R: P. C. Mahalanobis being the founder of Indian Statistical Institute from where I earned my graduate degrees, let me confess, I had an inclination towards using Mahalanobis Distance in my analysis. The mahalanobis() function is available in base R package and is called up as mahalanobis(x, mean(x), cov(X)) that readily generates the distance vector for each row of x from the central value. With a proper scaling (see the embedded presentation below for the details) Mahalanobis Distance follows F distribution and one can generate the p-values that falling below 0.05 identifies outliers.
Singular cov(x): As I ran mahalanobis() function it failed raising the error as ‘The variance covariance matrix is singular’.
First I dropped the constant columns that have no effect on generating outliers. Still the same error came up in R.
Then I dropped the columns that showed high correlation (above 0.95 or below -0.95) with an earlier column in the dataset. Still the same error came up in R.
Then I looked at the variance covariance matrix cov(x) that turned out to be a 44 by 44 matrix with rank only 29. Surely it is a singular matrix and cannot be used in calculating the distances as it needs to use the inverse of cov(x). The only option left was to use canonical reduction reducing the number of variables to 29. A = eigen(cov(x))$vectors[,1: rankMatrix (x)] provided me the linear transformation that when applied on x generated 29 columns that had non-singular variance covariance matrix. Point to be noted is that the transformation being orthonormal did not affect the Mahalanobis Distance. This time R ran well and I could generate the required p-values.
Later I could do a root cause analysis producing best predictive rule for the system outliers.
Happy to share with you the original Mahalanobis’ 1936 paper on Mahalanobis Distance.#GlobalAIandDataScience#GlobalDataScience
Question I : How to find the most significant attribute causing outliers at the system level?
Question II : What is the relation between the outliers at attribute level and at system level?