To answer your questions clearly I need to precise some things : The correlation coefficients in a correlation matrix represent the strength and direction of the linear relationship between pairs of variables. However, logistic regression does not rely on the same correlation coefficient as Pearson's pairwise correlation. Instead, it utilizes a measure called the "odds ratio" to quantify the relationship between predictor variables and the likelihood of a specific outcome.
The odds ratio in logistic regression is calculated using the maximum likelihood estimation method. For a binary logistic regression with two predictor variables, the odds ratio can be calculated as follows:
- Fit a logistic regression model to the data.
- Obtain the estimated coefficients (β) for each predictor variable.
- Calculate the exponentiation of each coefficient, which represents the odds ratio (OR).
The formula for calculating the odds ratio (OR) is:
OR = exp(β)
Where:
- β: Estimated coefficient obtained from the logistic regression model.
The odds ratio indicates how the odds of the outcome change with a one-unit increase in the predictor variable. For example, an odds ratio of 2 means that a one-unit increase in the predictor variable leads to double the odds of the outcome.
Regarding your second question about multicollinearity in logistic regression, it is important to consider and address multicollinearity when performing regression analysis. Multicollinearity occurs when predictor variables in the model are highly correlated with each other. High multicollinearity can lead to unstable or unreliable coefficient estimates.
There are several ways to handle multicollinearity:
- Variable selection: Remove one or more highly correlated variables from the model.
- Combining variables: If there are multiple correlated variables representing similar information, consider creating composite variables or using dimensionality reduction techniques like principal component analysis (PCA).
- Ridge regression: Use regularization techniques like ridge regression, which can mitigate the impact of multicollinearity by adding a penalty term to the regression coefficients.
- Variance Inflation Factor (VIF): Calculate the VIF for each predictor variable to assess the degree of multicollinearity. Variables with high VIF values (typically above 5 or 10) indicate strong multicollinearity and may need to be addressed.
Addressing multicollinearity is not included by default in the logistic regression algorithm itself. It requires additional steps and considerations during the model-building process.
------------------------------
Youssef Sbai Idrissi
Software Engineer
------------------------------
Original Message:
Sent: Thu June 22, 2023 05:08 AM
From: Jack Hunter
Subject: Correlation matrix question
Hello.
Based on the results of performing a logistic regression of the data, a correlation matrix is displayed. When choosing any method of including variables in the model, correlation coefficients of different value are noted at each step (unlike Pearson's pairwise correlation coefficients). What are these coefficients in the matrix and how are they calculated? Please provide formulas for their calculation. As an example, in the picture, the correlation coefficients between the variables "amplitude" and "sH" are marked in yellow. They are different at every step.
The second question is how to take into account the multicollinearity of variables when performing logistic regression? Or is this procedure included by default in the logistic regression algorithm?
------------------------------
Jack Hunter
------------------------------