In his book, Rudolf Freund described a confounding phenomenon while fitting a linear regression. Given a small data set below, there are three variables - dependent variable(y) and independent variables(x1 and x2). Using x2 to fit y alone, the estimated parameter of x2 f is positive that is 0.78. Then using x1 and x2 together to fit y, the parameter of x2 becomes -1.29, which is hard to explain since clearly x2 and y has a positive correlation.
data raw; input y x1 x2; cards; 2 0 2 3 2 6 2 2 7 7 2 5 6 4 9 8 4 8 10 4 7 7 6 10 8 6 11 12 6 9 11 8 15 14 8 13 ;;; run; ods graphics on / border = off; proc sgplot data = raw; reg x = x2 y = y; reg x = x2 y = y / group = x1 datalable = x1; run;
assumptions for OLS regressions. Shown in the top scatter plot, 0.78 is the slope of the regression line by y ~ x2 (the longest straight line), while -1.29 is actually the slope of the partial regression lines by y ~ x2|x1 (four short segments).
proc reg data = raw; model y = x2; ods select parameterestimates diagnosticspanel; run; proc reg data = raw; model y = x1 x2; ods select parameterestimates diagnosticspanel; run;
1. Drop a variable
Standing alone, x1 seems like a better predictor (higher R-square and lower MSE) than x2. The easiest way to remove this multicollinearity is to keep only x1 in the model.
proc reg data = raw; model y = x1; ods select parameterestimates diagnosticspanel; run;
2. Principle component regression
If we want to keep both variables to avoid information loss, principle component regression is a good option. PCA would transform the correlated variables to the orthogonal factors. In this case, the 1st eigenvector explains 97.77% of the total variance, which is fairly enough for the following regression. SAS's PLS procedure can also perform the principle component regression.
3. Repeated measure
proc princomp data = raw out = pca; ods select screeplot corr eigenvalues eigenvectors; var x1 x2; run; proc reg data = pca; model y = prin1; ods select parameterestimates diagnosticspanel; run;
proc mixed data = raw plots(only) = ResidualPanel method = ml; model y = x2 / s; repeated / type=ar(1) subject=x1 r; run;