Wednesday, April 27, 2011

A macro calls R in SAS for paneled 3d plotting


SAS and R could complement each other. SAS is a versatile ETL (extraction, transformation and loading) machine and its statistical procedures based on generalized linear model are impeccable. R would bring cutting-edge data mining and data visualization technologies at low cost (or no cost). Although the two packages dwell in distinctive ecosystems (for example: different OS/ETL/database/reporting layers) [Ref. 1], mixed programming by combining them together would make an analytics shop invincible.

Some SAS programmers like to use SAS/IML to call R’s functions [Ref. 2]. However, it seems that SAS/IML fails to work with the latest versions of R since 2.12 [Ref. 3]. Others tend to play tricks to call R into SAS’s data step to meet their daily needs [Ref. 4]. In this post, the macro below would call the ‘lattice’ package of R in SAS, on a PC platform, to draw paneled three dimension images, since currently SAS’s SG procedures don’t own such an option. The good thing is that there is no need to check the version of R installed before running it. And the modification of this macro can be extended to other applications to call R in SAS.

Reference:
1. ‘Keep an Eye on the emerging Open-Source Analytics Stack’. Revolution R Blog.
2. Zhengping Ma. ‘Data mining in SAS with open source software’. SAS Global 2011.
3. ‘SAS/IML incompatible with latest releases of R’. SAS-L. 11APR2011.
4. Liang Xie. ‘Regularized Discriminant Analysis’. www.sas-programming.com

/*******************READ ME*********************************************
* - A MACRO CALLS R IN SAS FOR PANELED 3D PLOTTING  -
*
* VERSION:     SAS 9.2(ts2m0), windows 64bit
* DATE:        25apr2011
* AUTHOR:      hchao8@gmail.com
*
****************END OF READ ME*********************j********************/

****************(1) MODULE-BUILDING STEP******************;
%macro scatter3dpanel(data = , x = , y = , z = , factor = , 
                      width = , height = , outfile = );
   /***********************************************************
   *  MACRO:      scatter3dpanel()
   *  PARAMETERS: data   = dataset for plotting
   *              x      = x-axis variable
   *              y      = y-axis variable
   *              z      = z-axis variable
   *              factor = partition factor variable
   *              width  = width of output graph
   *              height = height of output graph
   *              outfile= location of output image
   ***********************************************************/
  proc export data = &data outfile = "d:\tmp.csv" replace;
  run;
  
  proc sql;
    create table _tmp0 (string char(80));
    insert into _tmp0 
    set string = 'tmp=read.csv("d:/tmp.csv", header=T)'
    set string = 'attach(tmp)'
    set string = 'library(lattice)'
    set string = 'windows()'
    set string = 'cloud(sas_zvar~sas_xvar+sas_yvar|as.factor(sas_factor), pretty=T)'
    set string = 'dev.print(device=png, width=sas_width, height=sas_height, file="sas_file")';
  quit;
  
  data _tmp1;
    set _tmp0;
    string = tranwrd(string, "sas_xvar", propcase("&x"));
    string = tranwrd(string, "sas_yvar", propcase("&y"));
    string = tranwrd(string, "sas_zvar", propcase("&z"));
    string = tranwrd(string, "sas_factor", propcase("&factor"));
    string = tranwrd(string, "sas_width", "&width");
    string = tranwrd(string, "sas_height", "&height");
    string = tranwrd(string, "sas_file", translate("&outfile", "/", "\"));
  run;

  data _null_;
    set _tmp1;
    file "d:\callRinSAS.r";
    put string;
  run;

  options noxsync noxwait;
  x ' "d:\Program Files\R\R-2.12.1\bin\R.exe" CMD BATCH --vanilla --slave "d:\callRinSAS.r" ';
%mend;

****************(2) TESTING STEP******************;
%scatter3dpanel(data = sashelp.cars, x = length, y = wheelbase, z = horsepower, 
            factor = type, width = 1200, height = 600, outfile = d:\test1.png );

****************END OF ALL CODING***************************************;

Tuesday, April 26, 2011

Some analysis on university ranking by US News


The yearly US News best college ranking is an important tool in comparing schools for students and their eager parents. The latest data is publicly available (paying 20 bucks would get full access) [Ref.1]. And the methodology is easy to find and explain [Ref.2]: a score would be weighted by peer assessment, retention, faculty resources, student selectivity, graduation rate, etc; therefore the final ranking would be based on the scores of a number of colleges.

It is interesting to explore and dissect the ranking process by US News. Still the dirty job of data extraction, transformation and loading occupied 90% of the working time. Data crunching was performed with logistic regression (for private/public), and selective linear regression (for score), by the nice tools from SAS/STAT. Factor analysis and partial least square regression were used to minimize the multicollinearity that is widespread in this data.

The analysis leads to two conclusions. First, the ranking is relatively qualitative instead of quantitative. The ranking heavily depends on the reputation opinion form surveying institutions’ administrators and high schools’ counselors. Other variables just modify the result. Second, the ranking favors private universities. Being a private university would add 3 points to the overall score. The best public university, UC Berkeley, is ranked as 22nd. I didn’t find any reason why it is inferior to some private universities ahead. At the data level, the public universities and private ones are distinguishable. And apparently they target different customer groups. To be fair, the US News may divide the university ranking into two subsystems: public universities and private universities, which could be more helpful in understanding the universities' standing in their sectors.

References:
1. http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data
2. http://collegethrive.com/college-rankings-us-news-world-report-method


****************(1)CLUSTERING STEP******************;
ods listing close;
ods output variables = _varlist;
proc contents data = uscr11;
run;

proc sort data = _varlist;
   by num;
run;

proc sql;
   select variable into: num_vars separated by ' '
   from _varlist
   where lowcase(type) = "num" and num not in (4, 5)
;quit;

proc varclus data = uscr11 summary outtree=tree;   
   var &num_vars;
run;

ods html style = harvest;
ods graphics on;
goptions htext = 4pct ftext = "Albany AMT";
axis1 order = (0.5 to 1 by 0.1);
axis2 label = none;

proc tree horizontal haxis=axis1 vaxis=axis2;
   height _propor_;
   id _label_;
run;

proc sgscatter data = uscr11;
   matrix &num_vars /ellipse=(alpha=0.25) markerattrs=(size=1);
run;

****************(2)IMPUTATION STEP******************;
proc mi data = uscr11 nimpute = 1 round = .01 
        seed = 20110425 out = _tmp0;
   monotone regpmm(donaterate = score ugrepidx gradrate retention);
   var score ugrepidx gradrate retention donaterate;
run;

proc mi data = _tmp0 nimpute = 1 round = .01 
        seed = 20110425 out = imputed;
   monotone reg(top10fresh = score ugrepidx sat25p sat75p acceptrate);
   var score ugrepidx sat25p sat75p acceptrate top10fresh;
run;

****************(3)FACTOR ANALYSIS STEP******************;
proc factor data = imputed nfactors = 3 rotate=promax
            reorder out = factorized plots=(scree);
   var &num_vars;
run;

data _tmp1;
   set factorized;
   if type = "private" then do; shape = "club"; color = "blue"; end;
   else do; shape = "diamond"; color = "red"; end;
   keep shape color factor:;
run;

proc g3d data = _tmp1;
   scatter factor2*factor3 = factor1 / color = color shape = shape;
run;

****************(4)LOGISTIC REGRESSION STEP******************;
proc logistic data = imputed plots = (roc);
   model type = &num_vars / 
         selection = stepwise slentry = 0.3 slstay = 0.3;
run;

proc pls data = imputed plot = (corrloadplot variableimportanceplot);
   model score = &num_vars;
run;

proc sql;
   select variable into: vars separated by ' '
   from _varlist
   where num in (3, 6, 7, 8, 12, 13, 14, 15, 16)
;quit; 

****************(5)VARIABLE SELECTION STEP******************;
proc glmselect data = imputed plot = (coefficientpanel aseplot);
   partition fraction(validate = 0.5);
   class type;
   model score = &vars / 
         selection = stepwise(choose = validate select = sl);
run;
ods graphics off;
ods html close;
ods listing;

****************END OF ALL CODING***************************************;

Sunday, April 17, 2011

A subroutine in SAS to simulate asset pricing paths


For matrix computation in SAS, SAS/IML is the choice. This module has its own syntax, functions and even plotting subsystem. Some statisticians used it to realize the algorithms beyond the reach of SAS’s procedures, for example, boosting [Ref. 1]. However, comparing with other popular matrix-based languages, such as R and Matlab, SAS/IML has no edge. SAS’s most valuable products are still its robust data step and statistical procedures. ‘Porting’ source codes from other languages into SAS has to rely on data step.

Asset prices can be estimated by Monte Carlo simulation. To generate a series of price-evolving paths with several most common parameters, Dr. Brandimarte codes a naive Matlab function to apply the standard Wiener process [Ref. 2]. With random seeds for normal distribution, multiple pricing mechanisms can be demonstrated and compared. In SAS 9.2, the workflow of simulation and visualization would be modularized as a data step subroutine. Later, such a pricing subroutine could be easily invoked under given circumstances.

It is said that SAS 9.3 is going to be released 2011Q3. Hope this time, the data step function compiler, Proc Fcmp, could be more dynamic and with more methods.

References:
1. Dmitrienko, Alex, Christy Chuang-Stein, and Ralph D’Agostino. Pharmaceutical Statistics Using SAS: A Practical Guide. SAS Publishing. 2007
2. Paolo Brandimarte. Numerical methods in finance. John Wiley & Sons. 2002.

/*******************READ ME*********************************************
* - A SUBROUTINE IN SAS TO SIMULATE ASSET PRICING PATHS -
*
* VERSION:     SAS 9.2(ts2m0), windows 64bit
* DATE:        17apr2011
* AUTHOR:      hchao8@gmail.com
*
****************END OF READ ME*****************************************/

****************(1) MODULE-BUILDING STEP******************;
******(1.1) COMPILE FUNCTION-ACCOMPANYING MACRO*************;
option mstored sasmstore = sasuser;
%macro AssetPath_macro() / store source;
   data _tmp1;
      set _tmp;
      day  = _n_ - 1;
   run;

   proc transpose data = _tmp1 out = _tmp2 ;
      by day;
      var path:;
   run;

   data _tmp2;
      set _tmp2;
      label _name_ = 'Simulated paths';
   run;

   ods html style = money;
   proc sgplot data = _tmp2;
      series x = day y = col1 / group = _name_;
      yaxis grid label = 'Asset price';
      xaxis grid label = 'Change by days';
   run;
   ods html close;
%mend;

******(1.2) COMPILE SUBROUTINE FOR ASSET PRICING*************;
proc fcmp outlib = sasuser.astpth.funcs;
   subroutine AssetPath(S0,  mu, sigma, T,  NSteps,  NRepl);
   /*************************************************************
   * SUBROUTINE:    AssetPath()
   * PARAMETERS:    S0    = the initial price
   *                mu    = the drift
   *                sigma = the volatility
   *                T     = the horizontal time
   *                NSteps= the number of time steps
   *                NRepl = the number of replications
   *************************************************************/
      array SPaths[1, 1] / nosymbols;
      array Path[1, 1] / nosymbols;
      call dynamic_array(SPaths, NRepl, NSteps + 1);
      call dynamic_array(Path, NSteps + 1, NRepl);
      call zeromatrix(SPaths);
      do _row = 1 to NRepl;
         SPaths[_row, 1] = S0;
      end;
      dt = T / NSteps;
      nudt = (mu - 0.5*sigma**2) * dt;
      sidt = sigma * sqrt(dt);
      do _row = 1 to NRepl;
         do _col = 1 to Nsteps;
            SPaths[_row, _col + 1] = SPaths[_row, _col] * 
               exp(nudt + sidt*rannor(0));
         end;
      end;   
      call transpose(SPaths, Path);
      rc1 = write_array('_tmp', Path);
      rc2 = run_macro('AssetPath_macro');
    endsub;
quit;
****************END OF STEP (1)******************;

****************(2) TESTING STEP******************;
option cmplib = (sasuser.astpth) mstored sasmstore = sasuser;
data _null_;
   S0 = 50; mu = 0.1; sigma = 0.3; T = 1; NSteps = 365; NRepl = 3;
   call AssetPath(S0,  mu, sigma, T,  NSteps,  NRepl);
run;
****************END OF STEP (2)******************;

****************END OF ALL CODING***************************************;

Saturday, April 9, 2011

Predict unemployment rate for Election 2012 by SAS


Since recently President Obama announced that he is seeking reelection, the unemployment rate on November 2012 would decide the result. The Wall Street Journal averaged 54 economists’ predication and concluded that the number is going to be 7.7%. Apparently, those economists rely on the historical data to forecast the future, together with more or less their subjective judgment. However, the newly released March data is surprisingly good: 8.8%, which means that this predication number has to be adjusted downwardly to be below 7.7%. Then what is the real-time prediction of the unemployment rate for this ‘big’ time?

SAS has one of the finest time-series packages in the world: SAS/ETS which includes a few predictive procedures such as the ARIMA procedure and the FORECAST procedure[Ref. 2]. And the economic data is updated by Federal Reserve and well accessible on their website. To predict unemployment rate like a real professional is possible with a notebook computer and SAS. Of course SAS’s procedures have tons of methods and parameters to tune. To simply this problem, in the SAS macro below, I chose a conservative method and an aggressive one, to give a rough estimation about the unemployment range. Just like what the WSJ said, the trend matters. The predication will be more approaching to the real number as time goes forward. Right now, my prediction for the unemployment rate on November 2012 is from 7.1% to 7.4%.

References:
1."Jobless Rate at 2012 Presidential Vote Forecast at 7.7%, Highest Since Carter-Ford, but the Trend May Matter Most". The Wall Street Journal, 13Mar2011.
2.SAS/ETS 9.2 User Guide. SAS Publishing, 2008.

/*******************READ ME*********************************************
* -- PREDICT UNEMPLOYMENT FOR ELECTION 2012 LIKE A PRO --
*
* VERSION:     SAS 9.2(ts2m0), windows 64bit
* DATE:        09apr2011
* AUTHOR:      hchao8@gmail.com
*
****************END OF READ ME*****************************************/

****************(1) MODULE-BUILDING STEP******************;
%macro unemrate(predtime = );
   /***********************************************************
   *  MACRO:      unemrate()
   *  GOAL:       use time series based on latest FED data to
   *              predict unemployment rate in US and plot
   *  PARAMETERS: predtime = the time when unemployement rate 
   *                         is to be predicted
   *
   ***********************************************************/
   filename _infile url 
      "http://research.stlouisfed.org/fred2/data/UNRATE.txt" 
      debug lrecl=100;

   data raw;
      infile _infile missover firstobs = 22;
      format date date9.;
      input @1 date yymmdd10. @13 value 4.1;
   run;

   data _null_;
      set raw end = eof;
      if eof then do;
         interval = intck('month', date, input("&predtime", monyy7.));
         call symput('interval', interval);
         call symput('eodate', date);
         call symput('insert', 'Lastest data:' || put(value, 4.1) || 
                   '% on ' || put(date, monyy7.));
      end;
   run;

   %if %eval(&interval) le 0 %then %do; 
      %put ERROR: Predicted time must be greater than latest time FED posts data; 
      %goto finish; 
   %end;

   ods select none;
   proc forecast data = raw out = _predbyfc outfull
      method = stepar lead = &interval interval = month;
      id date;
      var value;
   run;

   proc arima data = raw;
      identify var = value;
      estimate p = 1 q = 12;
      forecast lead = &interval interval = month 
              id = date out = _predbyar;
   quit;
   ods select all; 

   proc sql;
      create table predicted0 as
      select a.date, a.value label = 'Real unemployment rate', 
            a.forecast as predbyar label = 'ARIMA model', 
            b.value as predbyfc label = 'STEPAR model'
      from _predbyar as a,
           _predbyfc (where = (lowcase(_type_) = 'forecast')) as b
      where a.date = b.date
   ;quit;

   data predicted1;
      set predicted0 end = eof;
      if date lt &eodate then call missing(predbyar, predbyfc);
      else if date eq &eodate then do;
         predbyar = value; 
         predbyfc = value;
      end;
      if eof then do;
         call symput('arlast', put(predbyar, 4.2));
         call symput('fclast', put(predbyfc, 4.2));
      end;
   run;
    
   ods html style = harvest;
   proc sgplot data = predicted1;
      where date ge '01jan2006'd;
      series x = date y = value;
      series x = date y = predbyar;
      series x = date y = predbyfc;
      refline &arlast / axis = y labelloc = inside
         label = "&arlast" transparency = 1;
      refline &fclast / axis = y labelloc = inside
         label = "&fclast" transparency = 1;
      xaxis grid label = ' ';
      yaxis grid label = 'Unemployment percentage %'
         values = (4 to 11 by 0.2);
      inset "Prediction ends on &predtime" / position = topright border;
      inset "&insert" / position = bottomright;
   run;
   ods html close;

   proc datasets nolist;
      delete _:;
   quit;

   %finish: ; 
%mend;

****************(2) TESTING STEP******************;
%unemrate(predtime = NOV2012);

****************END OF ALL CODING***************************************;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...