Dealing with multicollinearity using VIFs

Besides normality of residuals and homogeneity of variance, one of the biggest assumptions of linear modeling is independence of predictors. If one or more of the predictors in a model are correlated, then the model may produce unstable parameter estimates with highly inflated standard errors, resulting in an overall significant model with no significant predictors. In other words, bad news if your goal is to try and determine the contribution of each predictor in explaining the response. But there is hope!

There are several methods for dealing with multicollinearity. The simplest is to construct a correlation matrix and corresponding scatterplots. If the correlations between predictors approach 1, then multicollinearity might be a problem. In that case, one can make some educated guesses about which predictors to retain in the analysis (based on biological significance, ease of measurement, etc.).

Another way to identify collinear predictors is by calculating a variance inflation factor (VIF) for each predictor. The variance inflation factor represents the proportion of variance in one predictor explained by all the other predictors in the model. A VIF = 1 indicates no collinearity, whereas increasingly higher values suggest increasing multicollinearity. The approach suggested by Zuur et al. (2010, reference below) is to calculate VIFs for each parameter in the model, and if they are larger than some cutoff, sequentially drop the predictor with the largest VIF, recalculate, and repeat until all values are below the cutoff (they suggest a cutoff of 2). VIFs are especially nice for dealing with collinearity of interaction terms.

The vif function in the “car” package in R will calculate VIFs for a linear model. I’ve written a quick function that will identify if any VIFs > cutoff, remove the largest value, recalculate, and repeat until all VIFS < cutoff. It produces a final model with the same name as the original model. I’ve also included a function for calculating VIFs for linear mixed effects models built using the lmer function in the “lme4” package (From: http://hlplab.wordpress.com/2011/02/24/diagnosing-collinearity-in-lme4/).

A WORD OF CAUTION! The following code is a brute force method to removing variables. Its always important to consider the identity of variables that are being removed, and whether certain variables are more biologically meaningful, or whether certain variables are better representations of underlying processes. I recommend going through the sequential deletion step-by-step after this method to see whether you agree with what is happening mathematically. To that end, the function also outputs a table showing the VIF values and deleted variables at each step in the sequence.

# Load psych package
library(psych) #Calls: pairs.panels
# Load car package
library(car) #Calls: vif
# Load plyr package
library(plyr) #Calls: rbind.fill

# Create a fake data frame with two pairs of highly collinear variables
df=data.frame(y=rnorm(500,0,20),x1=rnorm(500,50,100),x2=rnorm(500,10,40))
df$x3=df$x1+runif(500,-50,50); df$x4=df$x2+runif(500,-5,5)
# Fit a linear model to the data
mod=lm(y~.,data=df); plot(mod,which=1:2)
# Generate scatterplots with Pearson correlations
pairs.panels(df)
# Use function `vif` to calculate variance inflation factors for each variable
# in the model
vif(mod)

# Choose a VIF cutoff under which a variable is retained (Zuur et al. 2010 
# MEE recommends 2)
cutoff=2
# Create function to sequentially drop the variable with the largest VIF until 
# all variables have VIF > cutoff
flag=TRUE
viftable=data.frame()
while(flag==TRUE) {
  vfit=vif(mod)
  viftable=rbind.fill(viftable,as.data.frame(t(vfit)))
  if(max(vfit)>cutoff) { mod=
	update(mod,as.formula(paste(".","~",".","-",names(which.max(vfit))))) }
  else { flag=FALSE } }
# Look at the final model
print(mod)
# And associated VIFs
print(vfit)
# And show the order in which variables were dropped
print(viftable)

And the function for lmer (from here):

vif.mer <- function (fit) {
    ## adapted from rms::vif

    v <- vcov(fit)
    nam <- names(fixef(fit))

    ## exclude intercepts
    ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
    if (ns > 0) {
        v <- v[-(1:ns), -(1:ns), drop = FALSE]
        nam <- nam[-(1:ns)]
    }

    d <- diag(v)^0.5
    v <- diag(solve(v/(d %o% d)))
    names(v) <- nam
    v
}

Created by Pretty R at inside-R.org

References:
Zurr et al. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1: 3-14.

Created by Pretty R at inside-R.org

12 thoughts on “Dealing with multicollinearity using VIFs”

jebyrnes says:
December 29, 2012 at 6:28 PM


Note, often collinarities will be generated with interaction terms. By calculating a centered interaction, you can remove these, rather than worrying about removing the interaction. You just have to be careful about interpretation of main effects afterwards!
- jslefche says:
  December 30, 2012 at 5:42 AM
  
  
  Good point, Jarrett! Centering is yet another good way to deal with collinearity. I will have to ponder on situations where certain techniques are more or less appropriate…
itfeature.com says:
May 21, 2013 at 3:30 AM


Centering and scaling is a good way to deal with multicollinearity. Thanks for R code and ideas about programming.
Martí says:
December 5, 2013 at 6:24 AM


Hello, If i fit a Logistic Mixed Model with lme4 and I want to take into account the colineality and correlation between discrete covariates, what is your opinion? I should’nt use “vif.mer “, isn’t it?

Thanks.
Great post!!

Martí
- Jon Lefcheck says:
  December 9, 2013 at 12:07 PM
  
  
  Hmm, good question. You could calculate variance inflation factors based on a pseudo-R^2 including the variance of the random effects (i.e., the conditional R^2 from this post: https://jonlefcheck.net/2013/03/13/r2-for-linear-mixed-effects-models/). I’m not sure how legitimate that is or whether you would want to take a different approach to assessing collinearity (like scatterplots, etc.). Good luck! Cheers, Jon
  - Martí says:
    December 9, 2013 at 12:27 PM
    
    Thank you so much. I’ll try it. I think the the conditional R^2 is useful to select the “best” model such as AIC or BIC selection. However, in the new version of lme4 we can fit the model with the script ” print(model, corr=FALSE).”. I don’t know if it takes account the correlation.
    
    Thank you again.
    
    Martí
Matteo says:
December 5, 2013 at 6:40 AM


There is an error in the function to test the collinearity! After ## exclude intercepts
It’s missing this part:
ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
Cheers,
Matteo
- Jon Lefcheck says:
  December 9, 2013 at 12:12 PM
  
  
  Hi Matteo, Thanks for the heads up but I’m afraid I don’t follow your comment. Can you be more specific? Cheers, Jon
  - Jon Lefcheck says:
    December 9, 2013 at 12:16 PM
    
    Nevermind, didn’t scroll down far enough! Error corrected (as well as a few other minor errors). Thanks again, Jon
Pingback: A practical guide to machine learning in ecology | sample(ECOLOGY)
Pingback: Deep Learning-ML Tutorials | Tony Deep Techs
Pingback: Awesome Machine Learning Tutorials – Massive Collection of Resources – Learn Practice & Share