Thursday, March 22, 2012

Multicollinearity and the solutions


In his book, Rudolf Freund described a confounding phenomenon while fitting a linear regression. Given a small data set below, there are three variables - dependent variable(y) and independent variables(x1 and x2). Using x2 to fit y alone, the estimated parameter of x2 f is positive that is 0.78. Then using x1 and x2 together to fit y, the parameter of x2 becomes -1.29, which is hard to explain since clearly x2 and y has a positive correlation.

data raw;
input y x1 x2;
cards;
2 0 2
3 2 6
2 2 7
7 2 5
6 4 9
8 4 8
10 4 7
7 6 10
8 6 11
12 6 9
11 8 15
14 8 13
;;;
run;

ods graphics on / border = off;
proc sgplot data = raw;
   reg x = x2 y = y;
   reg x = x2 y = y / group = x1 datalable = x1;
run;


The reason is that x1 and x2 have strong correlation each other. Diagnostics are well when using x2 to fit y. However, counting x1 and x2 together into the regression model causes multicollinearity, and therefore demonstrates severe heteroskedasticity and a skewed distribution of the residuals, which violates the assumptions for OLS regressions. Shown in the top scatter plot, 0.78 is the slope of the regression line by y ~ x2 (the longest straight line), while -1.29 is actually the slope of the partial regression lines by y ~ x2|x1 (four short segments).

proc reg data = raw;
   model y = x2;
   ods select parameterestimates diagnosticspanel;
run;

proc reg data = raw;
   model y = x1 x2;
   ods select parameterestimates diagnosticspanel;
run;
Solutions:

1. Drop a variable
Standing alone, x1 seems like a better predictor (higher R-square and lower MSE) than x2. The easiest way to remove this multicollinearity is to keep only x1 in the model.

proc reg data = raw;
   model y = x1;
   ods select parameterestimates diagnosticspanel;
run;

2. Principle component regression
If we want to keep both variables to avoid information loss, principle component regression is a good option. PCA would transform the correlated variables to the orthogonal factors. In this case, the 1st eigenvector explains 97.77% of the total variance, which is fairly enough for the following regression. SAS's PLS procedure can also perform the principle component regression.


proc princomp data = raw out = pca;
   ods select screeplot corr eigenvalues eigenvectors;
   var x1 x2;
run;

proc reg data = pca;
   model y = prin1;
   ods select parameterestimates diagnosticspanel;
run;
3. Repeated measure
We may apply the mixed model and compute the R matrix using x1 as covariate.

proc mixed data = raw plots(only) = ResidualPanel method = ml;
   model y = x2 / s;
   repeated / type=ar(1) subject=x1 r;
run;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...