Friday, November 29, 2013

A cheat sheet for linear regression validation

The link of the cheat sheet is here.
I have benefited a lot from the UCLA SAS tutorial, especially the chapter of regression diagnostics. However, the content on the webpage seems to be outdated. The great thing for PROC REG is that it creates a beautiful and concise 3X3 plot panel for residual analysis.
I created a cheat sheet to try to interpret the diagnosis panel. Thanks to the ESS project, the BASEBALL data set used by SAS is available for public. Hereby I borrowed this data set as an example, and the cheat sheet also contains the data and the SAS program. The regression model attempts to predict the baseball players’ salary by their performance statistics. The plot panel can be partitioned into four functional zones:
  1. OLS assumption check
    The three OLS assumption is essential to linear regression for BLUE estimators. However, the residual plot above on the left-top panel has a funnel-like shape, which is usually corrected by a log transformation in practice.
  2. Normality check
    In reality the normality is not required for linear regression. However, most people like to see t-test, F-test or P value which needs the normality of residual. The histogram and Q-Q plot on the left-bottom are good reference.
  3. Outlier and influential points check
    The three top-right plots can be used to rule out some extraordinary data points by leverage, Cook’s D and R-studentized residues.
  4. Non-linearity check
    Rick Wicklin has a thorough introduction about the fit-mean plot. We can also look at r-square in the most bottom-right plot . If the linearity is not very satisfied, SAS/STAT has a few powerful procedures to correct non-linearity and increase the fitting performance, such as the latest ADAPTIVEREG procedure (see a diagram in my previous post).
There are still a few other concerns that need to be addressed for linear regressio such as multicolinearity (diagnosed by the VIF and other options in PROC REG) and overfitting (PROC GLMSELECT now weights in).
The PROC procedure in SAS solves the parameters by the normal equation instead of the gradient descent, which makes it always an ideal tool for linear regression diagnosis.
/*I. Grab the football data set from web */
filename myfile url 'https://svn.r-project.org/ESS/trunk/fontlock-test/baseball.sas';
%include myfile;

proc contents data=baseball   position;
   ods output position = pos;
run;

/*II. Diagnose the multiple linear regression for the players’ salaries*/
proc sql;   
   select variable into: regressors separated by ' '
   from pos
   where num between 5 and 20;
quit;
%put &regressors;

proc reg data=baseball;
   model salary = &regressors;
run;

/*III. Deal with heteroscedasticity*/
data baseball_t;
   set baseball;
   logsalary = 
      log10(salary);
run;

proc reg data=baseball_t;
   model logsalary =    
      &regressors;
run;

Thursday, November 21, 2013

Kernel selection in PROC SVM

The support vector machine (SVM) is a flexible classification or regression method by using its many kernels. To apply a SVM, we possibly need to specify a kernel, a regularization parameter c and some kernel parameters like gamma. Besides the selection of regularization parameter c in my previous post, the SVM procedure and the iris flower data set are used here to discuss the kernel selection in SAS.

Exploration of the iris flower data

The iris data is classic for classification exercise. If we use the first two components from Principle Component Analysis (PCA) to compress the four predictors, petal length, petal width, sepal length, sepal width, to 2D space, then two linear boundaries seem barely able to separate the three different species such as Setosa, Versicolor and Virginica. In general, the SASHELP.IRIS is a well-separated data set for the response variable
****(1). Data exploration of iris flower data set****;
data iris;
   set sashelp.iris;
run;

proc contents data=iris position;
run;

proc princomp data=iris out=iris_pca;
   var Sepal: Petal:;
run;

proc sgplot data=iris_pca;
   scatter x = prin1 y = prin2 / group = species;
run;

PROC SVM with four different kernels

Kernel methodsOption in SASFormulaParameter in SAS
linearlinearu’*vNA
polynomialpolynom(gamma*u’*v + coef)^degreeK_PAR
radial basisRBFexp(-gamma*|u-v|^2)K_PAR
sigmoidsigmoidtanh(gamma*u’*v + coef)K_PAR; K_PAR2
PROC SVM in SAS has provided a range of kernels for selection, including ERBFFOURIERLINEARPOLYNOMRBFRBFRECSIGMOID and TANH. Another great thing is that it supports cross-validation including Leave-One-Out Cross-Validation ( by loo option in PROC SVM) and k-Fold Cross-Validation (by split option in PROC SVM).
Here the error rates of Leave-One-Out Cross-Validation is used to compare the performance among the four common kernels including linear, radial basis function, polynomial and sigmoid. And in this experiment most time the parameters such as c and gamma are arbitrarily set to be 1. As the result showed in the bar plot, the RBF and linear kernels bring great results, while RBF is slightly better than linear. On the contrary, the polynomial and sigmoid kernels behave very badly. In conclusion, the selection of kernel for SVM depends on the reality of the data set. A non-linear or complicated kernel is actually not necessary for an easily-classified example like the iris flower data set.

****(2). Cross validation error comparison of 4 kernels****;
proc dmdb batch data=iris dmdbcat=_cat out=_iris;
   var Sepal: Petal:;
   class species;
run;

%let knl = linear;
proc svm data=_iris dmdbcat=_cat kernel=&knl c=1 cv =loo;
   title "The kernel is &knl";
   ods output restab = &knl;
   var Sepal: Petal:;
   target species;
run;

%let knl = rbf;
proc svm data=_iris dmdbcat=_cat kernel=&knl c=1 K_PAR=1 cv=loo;
   title "The kernel is &knl";
   ods output restab = &knl;
   var Sepal: Petal:;
   target species;
run;

%let knl = polynom;
proc svm data=_iris dmdbcat=_cat kernel=&knl c=1 K_PAR =3 cv=loo;
   title "The kernel is &knl";
   ods output restab = &knl;
   var Sepal: Petal:;
   target species;
run;

%let knl = sigmoid;
proc svm data=_iris dmdbcat=_cat kernel=&knl c=1 K_PAR=1 K_PAR2=1 cv=loo;
   title "The kernel is &knl";
   ods output restab = &knl;
   var Sepal: Petal:;
   target species;
run;

data total;   
   set linear rbf polynom sigmoid;
   where label1 in ('Kernel Function','Classification Error (Loo)');
   cValue1 = lag(cValue1);
   if missing(nValue1) = 0;
run;

proc sgplot data=total;
   title " ";
   vbar cValue1 / response = nValue1;
   xaxis label = "Selection of kernel";
   yaxis label = "Classification Error by Leave-one-out Cross Validation"; 
run;

Wednesday, November 13, 2013

When ROC fails logistic regression for rare-event data


ROC or AUC is widely used in logistic regression or other classification methods for model comparison and feature selection, which measures the trade-off between sensitivity and specificity. The paper by Gary King warns the dangers using logistic regression for rare event and proposed a penalized likelihood estimator. In PROC LOGISTIC, the FIRTH option implements this penalty concept.
When the event in the response variable is rare, the ROC curve will be dominated by minority class and thus insensitive to the change of true positive rate, which provides litter information for model diagnosis. For example, I construct a subset of SASHELP.CARS with the response variable Type including 3 hybrid cars and 262 sedan cars, and hope to use the regressors, Weight, Wheelbase, Invoice to predict whether a car’s type is either hybrid or sedan. After the logistic regression, the AUC tents to be 0.9109 that is a pretty high value. However, the model is still ill-fitted and needs tuning, since the classification table shows the sensitivity is zero.
data rare;
    set sashelp.cars;
    where type in ("Sedan", "Hybrid");
run;

proc freq data = rare;
    tables type;
run;

proc logistic data = rare;
    model Type(event='Hybrid') = Weight Wheelbase Invoice 
       / pprob = 0.01 0.05 pevent = 0.5 0.05 ctable; 
    roc;
run;
Prob EventProb LevelCorrect EventCorrect Non-EventIncorrect EventIncorrect Non-EventAccuracySensitivitySpecificityFalse POSFalse NEG
0.5000.010221151173.666.780.522.629.3
0.5000.50002620350.00.0100.0.50.0
0.0500.010221151179.866.780.584.72.1
0.0500.50002620395.00.0100.0.5.0

Solution

In case that ROC won’t help PROC LOGISTIC any more, there seem three ways that may increase the desired sensitivity or boost the ROC curve.
  1. Lower the cut-off probability
    In the example above, moving the cut-off probability to an alternative value to 0.01 will significant increase the sensitivity. However, the result comes with the drastic loss of specificity as the cost.
  2. Up-sampling or down-sampling
    Imbalanced classes in the response variable could be adjusted by unequal weight such as up-sampling or down-sampling. Down-sampling, would be easy to fulfill using a stratified sampling by PROC SURVEYSELECT. Up-sampling is more appropriate for this case, but may need over-sampling techniques in SAS.
  3. Use different criterions such as F1 score
    For modeling rare event classification, the most important factors should be sensitivity and precision, instead of accuracy that combines sensitivity and specificity. On the contrary, the F1 score can be interpreted as a weighted average of the sensitivity and precision, which makes it a better candidate to replace AUC.
    \text{F1 score} = {2* {precision * sensitivity\over precision + sensitivity}}

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...