Tuesday, May 3, 2011

Support vector machine in SAS by R


I just recently discovered endless fun to synchronize SAS and R to do something meaningful. Yep, I am a SAS programmer: during the day time, I use SAS for my work; at the evening, I use R for entertainment. It is always exciting to hook up them together. How about a SAS/R module, like SAS/STAT or SAS/BASE, in the future?

Some SAS programmers or SAS ‘developers’ already utilized coding to communicate SAS and R [Ref. 1 and 2] (thanks to Rick Wicklin’s mentioning). Since R can write dataset in SAS code (the ‘foreign’ package) and SAS can use call R in X command to do batch execution, so far I didn’t find much difficulty to use SAS as a GUI for R.

Support vector machine (SVM) is a cool and fancy classification method. It is said that a secret SVM procedure is already running under SAS Enterpriser Miner (I did not get a chance to try it yet). The package ‘e1071’ in R provides the state-of-art protocols for SVM classification. Then I made a small macro to call it in SAS, which performs like a typical SAS procedure. Hope everyone who is doing data could enjoy it.

Reference:
1.Philip R Holland ‘SAS to R to SAS’. Holland Numerics Limited.
2.Phil Rack. ‘A Bridge to R for SAS Users’. www.MineQuest.com

/*******************READ ME*********************************************
* - SUPPORT VECTOR MACHINE FOR CLASSIFICATION IN SAS BY R  -
*
* SAS VERSION:  SAS 9.1.3
* R VERSION:    R 2.13.0 (library: 'e1071', 'foreign')
* DATE:         03may2011
* AUTHOR:       hchao8@gmail.com
*
****************END OF READ ME******************************************/

****************(1) MODULE-BUILDING STEP******************;
%macro svm(train = , validate = , result = , targetvar = , tmppath = , rpath = );
   /*****************************************************************
   *  MACRO:      svm()
   *  GOAL:       invoke e1071 in R to perform support vector machine
   *              classification in SAS 
   *  PARAMETERS: train     = dataset for training
   *              validate  = dataset for validation
   *              result    = dataset after prediction
   *              targetvar = target variable
   *              tmppath   = temporary path for exchagne files
   *              rpath     = installation path for R
   *****************************************************************/
   proc export data = &train outfile = "&tmppath\sas2r_train.csv" replace; 
   run;
   proc export data = &validate outfile = "&tmppath\sas2r_validate.csv" replace; 
   run;
  
   proc sql;
      create table _tmp0 (string char(80));
      insert into _tmp0  
      set string = 'train=read.csv("sas_path/sas2r_train.csv",header=T)'
      set string = 'validate=read.csv("sas_path/sas2r_validate.csv",header=T)'
      set string = 'require(e1071,quietly=T)'
      set string = 'model=svm(sas_targetvar ~ . ,data=train)'
      set string = 'predicted=predict(model,newdata=validate,type="class")'
      set string = 'result=as.data.frame(predicted)' 
      set string = 'require(foreign, quietly=T)' 
      set string = 'write.foreign(result,"sas_path/r2sas_tmp.dat",'
      set string = '"sas_path/r2sas_tmp.sas",package="SAS")';
   quit;
   data _tmp1;
      set _tmp0;
      string = tranwrd(string, "sas_targetvar", propcase("&targetvar"));
      string = tranwrd(string, "sas_path", translate("&tmppath", "/", "\"));
   run;
   data _null_;
      set _tmp1;
      file "&tmppath\sas_r.r";
      put string;
   run;

   options xsync xwait;
   x "cd &rpath";
   x "R.exe CMD BATCH --vanilla --slave &tmppath\sas_r.r";   

   data _null_;
       infile "&tmppath\sas_r.r.rout";
       input;
       if _n_ = 1 then put "NOTE: Time used by R";
       put _infile_;
   run;

   %include "&tmppath\r2sas_tmp.sas";
   data &result;
      set &validate;
      set rdata;
   run;

   proc datasets nolist;
      delete _: rdata;
   quit;
%mend;

****************(2) TESTING STEP******************;
******(2.1) BUILD A PARTITION MACRO TO SEPARATE TESTING DATASET*************;
%macro partbyprop2(data = , targetvar = , samprate = , seed =  , train = , validate = );
   /**************************************************************
   *  MACRO:      partbyprop2()
   *  GOAL:       partition dataset by target variable's proportion
   *              and choose numerical variables for classification
   *  PARAMETERS: data       = input dataset
   *              targetvar  = target variable
   *              samprate   = ratio of train v.s. validate datasets
   *              seed       = random seed for sampling
   **************************************************************/
   ods listing close;
   ods output variables = _varlist;
   proc contents data = &data;
   run;
   proc sql;
      select variable into: num_var separated by ' ' 
      from _varlist
      where lowcase(type) = "num";
   quit;

  proc sort data = &data out = _tmp1;
      by &targetvar;
   run;

   proc surveyselect data = _tmp1 samprate = &samprate  
      out = _tmp2 seed = &seed outall;
      strata &targetvar / alloc = prop;
   run;

   data &train &validate;   
      set _tmp2;
      format _numeric_  best13.;
      keep &num_var &targetvar;
      if selected = 0 then output &train;
      else output &validate;
   run;

   proc datasets;
      delete _:;
   quit;
   ods listing;
%mend;

******(2.2) DIVIDE SASHELP.IRIS DATASET INTO TWO EQUAL PARTS*************;
%partbyprop2(data = sashelp.iris, targetvar = species, samprate = 0.5, seed = 20110503, 
            train = iris_train, validate = iris_validate);

******(2.3) USE THE SVM MACRO*************;
%svm(train = iris_train, validate = iris_validate, result = iris_result, 
     targetvar = species, tmppath = c:\tmp, rpath = c:\Program Files\R\R-2.13.0\bin);

****************(3) VISUALIZATION STEP******************;  
data iris_visual;
  set iris_result;
  length color shape $8.;
  predvalue = put(predicted, predictd.);
  if species = "Setosa" then shape = "club";  
  if species = "Versicolor" then shape = "diamond";  
  if species = "Virginica" then shape = "spade";
  if predvalue = "Setosa" then color = "blue";
  if predvalue = "Versicolor" then color = "red";
  if predvalue = "Virginica" then color = "green";
run;

ods html style = harvest 
proc g3d data = iris_visual;
  scatter PetalLength * PetalWidth = SepalLength /
           color = color shape = shape;
run;quit;
ods html close;

****************END OF ALL CODING***************************************;
 
Link of r2sas_tmp.sas

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...