Wednesday, May 18, 2011

A macro calls random forest in SAS




SASHELP.CARS, with 428 observations and 15 variables, is a free dataset in SAS for me to exercise any classification methods. I always have the fantasy to predict which country a random car is manufactured by, such as US, Japan or Europe. After trying many methods in SAS, including decision tree, logistic regression, k-NN and SVM, I eventually found that random forest, an ensemble classifier of many decision trees [Ref. 1], can slash the overall misclassification rate to around 25%. The SAS code is powered by R’s package ‘randomForest’. In my tiny experiment, it seems that the ensemble of 100 trees would achieve optimum effect.

The concept of random forest was first raised by Leo Breiman and Adele Cutler [Ref. 2]. They also developed elegant Fortran codes for it. Andy Liaw in Merck did a fantastic job to port those Fortran codes into R [Ref. 3]. Now everybody with a computer can use this state of the art classification method for fun or work.

Reference:
1. Albert Montillo. ‘Random Forest’. http://www.ist.temple.edu/
2. Leo Breiman and Adele Cutler. http://stat-www.berkeley.edu/users/breiman/RandomForests/
3. Andy Liaw. ‘randomForest: Breiman and Cutler's random forests for classification and regression’. http://cran.r-project.org/web/packages/randomForest/index.html

/*******************READ ME*********************************************
* - A macro calls random forest in SAS by R -
*
* SAS VERSION:    9.1.3
* R VERSION:      2.13.0 (library: 'randomForest', 'foreign')
* DATE:           18may2011
* AUTHOR:         hchao8@gmail.com
*
****************END OF READ ME******************************************/

****************(1) MODULE-BUILDING STEP********************************;
%macro rf(train = , validate = , result = , targetvar = , ntree = , 
          tmppath = , rpath = );
   /*****************************************************************
   *  MACRO:      rf()
   *  GOAL:       invoke randomForest in R to perform random forest
   *              classification in SAS 
   *  PARAMETERS: train     = dataset for training
   *              validate  = dataset for validation
   *              result    = dataset after prediction
   *              ntree    =  number of trees specified
   *              targetvar = target variable
   *              tmppath   = temporary path for exchagne files
   *              rpath     = installation path for R
   *****************************************************************/
   proc export data = &train outfile = "&tmppath\sas2r_train.csv" replace; 
   run;
   proc export data = &validate outfile = "&tmppath\sas2r_validate.csv" replace; 
   run;
   proc sql;
      create table _tmp0 (string char(200));
      insert into _tmp0  
      set string = 'train=read.csv("sas_path/sas2r_train.csv",header=T)'
      set string = 'validate=read.csv("sas_path/sas2r_validate.csv",header=T)'
      set string = 'sink("sas_path/result.txt", append=T, split=F)'
      set string = 'require(randomForest,quietly=T)'
      set string = 'model=randomForest(sas_targetvar~ .,data=train,'
      set string = 'do.trace=10,ntree=sas_treenumber,importance=T)'
      set string = 'predicted = predict(model,newdata=validate,type="class")'
      set string = 'result=as.data.frame(predicted)' 
      set string = 'importance(model)' 
      set string = 'table(validate$sas_targetvar, predicted)' 
      set string = 'require(foreign, quietly=T)'
      set string = 'write.foreign(result,"sas_path/r2sas_tmp.dat",'
      set string = '"sas_path/r2sas_tmp.sas",package="SAS")';
   quit;
   data _tmp1;
      set _tmp0;
      string = tranwrd(string, "sas_treenumber", "&ntree");
      string = tranwrd(string, "sas_targetvar", propcase("&targetvar"));
      string = tranwrd(string, "sas_path", translate("&tmppath", "/", "\"));
   run;
   data _null_;
      set _tmp1;
      file "&tmppath\sas_r.r";
      put string;
   run;

   options xsync xwait;
   x "cd &rpath";
   x "R.exe CMD BATCH --vanilla --slave &tmppath\sas_r.r";   

   data _null_;
      infile "&tmppath\result.txt";
      input;
      if _n_ = 1 then put "NOTE: Statistics by R";
      put _infile_;
   run;

   %include "&tmppath\r2sas_tmp.sas";
   data &result;
      set &validate;
      set rdata;
   run;
%mend rf;

****************(2) TESTING STEP****************************************;
%rf(train = cars_train, validate = cars_validate, result = cars_result, 
    targetvar = origin, ntree = 100, tmppath = c:\tmp, 
    rpath = D:\Program Files\R\R-2.13.0\bin);

****************END OF ALL CODING***************************************;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...