Tuesday, October 26, 2010

Proc Arboretum: a secret weapon in decision tree

Introduction: Decision tree, such as CHAID and CART, is a power predicative tool in statistical learning and business intelligence. Starting from SAS®9.1, the ARBORETUM procedure provided facilities to interactively build and deploy decision tress. Even though it is still an experiment procedure, the ARBORETUM procedure has comprehensive features for classification and predication. And the ARBORETUM procedure is also the foundation of decision tree node in SAS Enterprise Miner.
Method: A common SAS dataset ’sashelp.cars’ was divided into three parts of equal size: training, validation and scoring. Two methods were applied: the target variable ‘origin’ as nominal level and the target variable ’ MSRP’ as interval level.
Result: the codes below introduced how to use PROC RBORETUM to train, validate and score datasets based on decision tree. The generated DATA step codes were stored in two flat text files.
Conclusion: the ARBORETUM procedure is quick and versatile for applying decision tree for any size of dataset. It is really a secret weapon in the procedure stockpile of SAS.

Reference: Xiangxiang Meng. Using the SGSCATTER Procedure to Create High-Quality Scatter Plots. SAS Global Forum 2010.

/*DIVIDE THE ORIGINAL DATA INTO 3 PARTS: 1:1:1*/
data cars;
set sashelp.cars;
_index=_n_;
run;
proc sort data=cars;by origin;run;
proc surveyselect data=cars samprate=0.3333  out=train;
strata origin /alloc=prop  ;
run;
proc sql;
create table cars2 as
select  * from cars
where _index not in ( select _index from train)
;quit;
proc surveyselect data=cars2 samprate=0.5  out=validation;
strata origin /alloc=prop  ;
run;
proc sql;
create table test as
select  * from cars2
where _index not in ( select _index from validation)
;quit;
proc datasets;
delete cars2 cars;
run;

/*TARGET VARIABLE: NOMINAL*/
filename code_1 'C:\code_1.txt';
proc arboretum data=train;
target origin / level=nominal;
input MSRP Cylinders Length  Wheelbase MPG_City MPG_Highway Invoice Weight Horsepower/ level=interval;
input EngineSize/level=ordinal;
input  DriveTrain Type /level=nominal;
assess validata=validation;
code file=code_1;
score data=test out=scorecard outfit=scorefit;
save   IMPORTANCE=imp1 MODEL=mymodel  NODESTATS=nodstat1  RULES=rul1 SEQUENCE=seq1 SIM=sim1  STATSBYNODE= statb1 SUM=sum1
;
run;
quit;

/*TARGET VARIABLE: INTERVAL*/
filename code_2 'C:\code_2.txt';
proc arboretum data=train;
target MSRP / level=interval;
input  Cylinders Length  Wheelbase MPG_City MPG_Highway Weight Horsepower/ level=interval;
input EngineSize/level=ordinal;
input  DriveTrain Type origin /level=nominal;
assess validata=validation;
code file=code_2;
score data=test out=scorecard2 outfit=scorefit2;
save   IMPORTANCE=imp2 MODEL=mymode2  NODESTATS=nodstat2  RULES=rul2 SEQUENCE=seq2 SIM=sim2  STATSBYNODE= statb2 SUM=sum2
;
run;
quit;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...