Multicollinearity is a phenomenon unique to multiple regressions that occurs when two variables that are supposed to be independent in reality have a high amount of correlation and are overlapping in what they measure.
In other words, each variable doesn’t give you entirely new information.
To picture what multicollinearity is, let’s start by picturing what it is not.
In the below diagram, we have a model with three predictors—X1, X2, and X3—and one response variable, Y. The colored sections are where each X explains some of the variances in Y.
As you can see, the X’s don’t overlap at all—each is distinctly measured. If you checked correlations, each X would have a mild correlation with Y, but the X’s would not correlate at all with each other. In other words, each one measures unique information. Knowing the value of one X variable tells you nothing about the value of another.
We also have Moderate and extreme multicollinearity ,But more common is to have a moderate overlap. The overlap can become so extreme that the model becomes unstable. Let’s look at both situations.
On the left is the situation where there is mild overlap among the predictors. We can still measure the unique effect of each predictor on Y—those are the yellow, red, and blue sections.
The orange and purple sections will not be included in the Type III regression coefficients. This means the coefficients themselves are not telling you the full picture of the effect of each predictor on Y.
As long as you keep that in mind as you interpret coefficients, you’re not violating assumptions and your model is reliable.
But you can’t tell if that orange section is attributable to X1 or X2 or if that distinction even has any meaning. They just cannot be distinguished.
But in the situation on the right, the overlap between X1 and X2 becomes so extreme that it can cause the model to have estimation problems. This is usually what we mean when we say we have multicollinearity. The model is trying to estimate the unique effect of each predictor on Y, but they’re just isn’t enough unique information about X1 and X2 to calculate it.
Multicollinearity occurs when two variables that should be independent are moving along the same linear trend. If you have two variables that are supposed to be used to predict the outcome of another variable, but simultaneously can be used to predict each other, multicollinearity is present.
Example: Take the housing data set from King’s County in Washington State where we can see multicollinearity among variables.
Situation 1: In this dataset, there are four features, in particular, that tell us different things that could be useful but are all highly correlated: square foot living space, square foot lot, a square-foot above (without basement), square foot basement. We can assume that as the square footage of the basement increases, so does the square footage of the areas above the basement. Likewise, we assume that the overall square footage of the house will increase in tandem with these variables.
Situation 2: Another example from the housing data was bedrooms and bathrooms. We assume that as the number of bedrooms increases, so will the number of bathrooms. It would be strange to have a 6 bedroom house with 1 bathroom. On the other end of the gamut, a 1 bedroom home with 6 bathrooms is even more bizarre.
To measure the presence of multicollinearity in a regression problem. A correlation matrix and variance inflation factor are going to be two of the most useful strategies to check for the presence of multicollinearity.
When there are only two predictors (explanatory variables) it’s easy to check, by plotting the two variables. With more variables, it’s harder to detect.
Multicollinearity can be detected via various methods. In this article, we will focus on the most common one – VIF – Variance Inflation Factor
Variance Inflation Factor (VIF)
It identifies a correlation between independent variables and the strength of that correlation.
Step 1. Choose and run a regression analysis on a predictor variable you are trying to calculate VIF. In other words, instead of calculating the Y(target) calculate Xi (predictor variable) using the other predictor variables in a linear model.
Step 2. Use the resulting R-squared value in the VIF formula. The formula is quite simple:
Step 3. Evaluate the magnitude of collinearity.
A rule of thumb for interpreting the variance inflation factor:
- 1 = not correlated.
- Between 1 and 5 = moderately correlated.
- Greater than 5 = highly correlated.
For more information on detecting multicollinearity in a dataset to give you a flavor of what can go wrong, you can refer to this: https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/