Friday, November 29, 2013

A cheat sheet for linear regression validation

The link of the cheat sheet is here.
I have benefited a lot from the UCLA SAS tutorial, especially the chapter of regression diagnostics. However, the content on the webpage seems to be outdated. The great thing for PROC REG is that it creates a beautiful and concise 3X3 plot panel for residual analysis.
I created a cheat sheet to try to interpret the diagnosis panel. Thanks to the ESS project, the BASEBALL data set used by SAS is available for public. Hereby I borrowed this data set as an example, and the cheat sheet also contains the data and the SAS program. The regression model attempts to predict the baseball players’ salary by their performance statistics. The plot panel can be partitioned into four functional zones:
  1. OLS assumption check
    The three OLS assumption is essential to linear regression for BLUE estimators. However, the residual plot above on the left-top panel has a funnel-like shape, which is usually corrected by a log transformation in practice.
  2. Normality check
    In reality the normality is not required for linear regression. However, most people like to see t-test, F-test or P value which needs the normality of residual. The histogram and Q-Q plot on the left-bottom are good reference.
  3. Outlier and influential points check
    The three top-right plots can be used to rule out some extraordinary data points by leverage, Cook’s D and R-studentized residues.
  4. Non-linearity check
    Rick Wicklin has a thorough introduction about the fit-mean plot. We can also look at r-square in the most bottom-right plot . If the linearity is not very satisfied, SAS/STAT has a few powerful procedures to correct non-linearity and increase the fitting performance, such as the latest ADAPTIVEREG procedure (see a diagram in my previous post).
There are still a few other concerns that need to be addressed for linear regressio such as multicolinearity (diagnosed by the VIF and other options in PROC REG) and overfitting (PROC GLMSELECT now weights in).
The PROC procedure in SAS solves the parameters by the normal equation instead of the gradient descent, which makes it always an ideal tool for linear regression diagnosis.
/*I. Grab the football data set from web */
filename myfile url 'https://svn.r-project.org/ESS/trunk/fontlock-test/baseball.sas';
%include myfile;

proc contents data=baseball   position;
   ods output position = pos;
run;

/*II. Diagnose the multiple linear regression for the players’ salaries*/
proc sql;   
   select variable into: regressors separated by ' '
   from pos
   where num between 5 and 20;
quit;
%put &regressors;

proc reg data=baseball;
   model salary = &regressors;
run;

/*III. Deal with heteroscedasticity*/
data baseball_t;
   set baseball;
   logsalary = 
      log10(salary);
run;

proc reg data=baseball_t;
   model logsalary =    
      &regressors;
run;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...