It has been quiet the last months here - main reason is that I’m working on my master’s thesis. I have already prepared some more examples from ‘Quantitative Ecotoxicolgy’, but I didn’t come to publish them here.
This post is about collinearity and the implications for linear models. The best way to explore this is by simulations - where I create data with known properties and look what happens.
Create correlated random variables
The first problem for this simulation was: How can we create correlated random variables?
This simulation is similar to the one in Dormann (2013), where it is also mentioned that one could use Cholesky Decompostion to create correlated variables:
What this function function does:
- Create two normal random variables (X1, X2 ~ N(0, 1))
- Create desired correlation matrix
- Compute the Choleski factorization of the correlation matrix ( R )
- Apply the resulting matrix on the two variables (this rotates, shears and scales the variables so that they are correlated)
- Create a dependent variable
yafter the model:
y ~ 5 + 7*X1 + 7*X2 + e with
e ~ N(0, 1)
Let’s see if it works. This creates two variables with 1000 observations with a correlation of 0.8 between them and a dependent variable.
The correlation between X1 and X2 is as desired nearly 0.8
And the data follows the specified model.
Methods to spot collinearity
Dormann lists eight methods to spot collinearity (see their Table 1). I will only show how to calculate two of those (but see the Appendix of the Dormann paper for code to all methods):
Absolute value of correlation coefficients r
Dormann (2013) found that ‘coefficients between predictor variables of r > 0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation’.
Variance Inflation Factors
Which is equivalent to (for variable X1):
Unfortunately there are many ‘rules of thumb’ associated with VIF: > 10, > 4, …
Simulation 1: How does collinearity affect the precision of estimates?
Here we simulate datasets with correlations ranging from -0.8 to 0.8:
Next we fit a linear model to each of the datasets and extract the coefficients table:
Finally we extract the Standard Errors and plot them against the degree of correlation:
It can be clearly seen, that collinearity inflates the Standard Errors for the correlated variables. The intercept is not affected.
Having large standard errors parameter estimates are also variable, which is demonstrated in the next simulation:
Simulation 2: Are estimated parameters stable under collinearity?
Here I simulate for different correlations 100 datasets each in order to see how stable the estimates are.
The code creates the datasets, fits a linear model to each dataset and extracts the estimates.
The we create a plot of the estimates for the three coefficients, each boxplot represents 100 datasets.
The red line indicates the coefficients after the data has been generated (7 * X1 and 7 * X2). We see that the spread of estimates increases as correlation increases.
This is confirmed by looking at the standard deviation of the estimates:
If the standard errors are large enough it may happen that parameter estimates may be so variable that even their sign is changed.