Monday, December 13, 2010

A SAS data miner without Enterprise Miner



SAS Enterprise Miner (EM) is indeed a fancy tool for a SAS programmer who wants to switch to the field of data mining. It is like the point-and-click camera: you drag several nodes onto the diagram, run it and everything is settled. And I was quite impressed by the resourceful reports and figures automatically generated by SAS EM. Then I spent several months to learn it, passed a SAS certified exam, and finished some pet projects on it. You don’t even need to know anything about statistics to be a data miner. Does that sound too cool to be true?

Some barriers exist for implementing SAS EM in reality. First consideration is the price. SAS charges about 30k for a retail license (I am not sure about the exact price). Only those fat-cat companies, like AT&T, could boast their UNIX-TERADATA-SAS EM complex. Many mid-size or small-size companies can hardly to afford it: they have to do data mining in their own ways with or without SAS. Second is about the installation and maintenance. Since I was taking a training class, I can purchase a license of all SAS products during limited time with a reasonable price. Then I tried numerous times to install SAS EM on my computer and failed. None of my classmates succeeded neither. SAS customer service told me that the standard strategy is to hire a $50k-a-year system engineer, trained by SAS, to install SAS Enterprise Miner properly. Of course, it adds more to the total cost of SAS EM. Third, the running process is contained in a black box. Each time SAS EM would invoke a number of procedures from SAS foundation, but no extra information is available in the log except whether it is a success or failure. For those SAS programmers who tend to check log to find ‘note/warning/error’ message, it is painful experience. I used to run the same process in two computers installed with SAS EM and had distinctive results. Everything was identical, and therefore I could not figure out why I had different results. Fourth, SAS EM is not that efficient. SAS EM applies J2EE to realize the cross-platform feature, which lags the execution time. Data and statistics have to pass through SAS foundation, J2EE until the XML layer of SAS EM. The time to run big data, say more than 100MB data, in a mainstream computer with SAS EM is going to be very long, sometimes intolerable.

Actually, SAS foundation, mainly SAS/BASE and SAS/STAT, is good enough for routine data mining jobs (some procedures may need the license of SAS Enterprise Miner). For example, to emulate the cluster node in SAS EM, we probably have a number of options, such as Proc Cluster, Proc Fastclus, Proc Aceclus, Proc Distance and Proc Tree; to do dimension reduction by PCA or FA, Proc Prinqual, Proc Princomp and Proc Factor are ready to use; to build a decision tree, using Proc Arboretum directly may provide more flexibility. Almost all functionality in SAS EM could be rendered in SAS 9.2. As for the graphics, the SG procedures are handy tools to customize the output images. Macro is a good choice for companies which may need to encapsulate and reuse their codes. If a GUI is desired, Dr. Fernandez in his book suggested using the compiled macros. Besides, we have chance to see what the interim datasets look like.

Still a SAS data miner should be a person who knows statistics. She/he should understand the underlying theory or algorithm; she/he should know how to pick and combine the appropriate procedures corresponding to specific tasks; she/he also should know how to do the dirty work to code and get job done. Overall, the data mining needs to generate not only fun but also meaningful results.


References: George Fernandez. Statistical Data Mining Using SAS Applications, Second Edition (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series). CRC Press.
/*CALL A GUL TO INPUT THE VARIABLES*/
%window mycluster irow=3 rows=33 icolumn=2 columns=60  color=YELLOW
#1    @15 " Cluster Analysis" color=red  attr=underline
#4    @3  "1. Input the SAS Data set name?"  color=blue
#5    @7  "EX: SASHELP.CARS "   attr=rev_video
#6    @7   data 30  required=yes attr=underline
#4    @40  "2. Exploratory graphs ?"  color=blue
#5    @44  " options: YES "      attr=rev_video
#6    @44  explor 20    attr=underline
#7    @3  "3. Input number of cluster   ?"   color=blue
#8    @7   " EX: 5 "    attr=rev_video
#9   @7   num 20    required=yes         attr=underline
#7    @40  "4. Checking for assumptions ?"    color=blue
#8    @44  "options: YES  "  attr=rev_video
#9    @44  assump 10       attr=underline
#10    @3  "5. Input clustering variable(s) ?"   color=blue
#11    @7   "EX: make MSRP "   attr=rev_video
#12    @7   mpred1 80     required=yes           attr=underline
#13    @7   mpred2 80     attr=underline
#13    @90 "contin var" color= blue
#14   @3  "6. Any optional PROC CLUSTER  options?"   color=blue
#15   @7  " EX: method=ave method=cen method=war  k=5 trim=5  "   attr=rev_video
#16   @7  clusopt 80           attr=underline
#17   @3  "7. Input  ID variable ?"      color=blue
#18   @7  " EX: id "       attr=rev_video
#19   @7  ID 30           attr=underline
#17   @40  "8. Variable standardization method ? "     color=blue
#18   @44  prin 30                  attr=underline
#20    @3  "9. Folder to save output?"          color=blue
#21    @7  "EX:  c:\  "        attr=rev_video
#22    @7  output 80              attr=underline
#23    @3 "10. Folder to save graphics ?"    color=blue
#24    @7  "EX: c:\  "       attr=rev_video
#25    @7  dir 80       attr=underline
#29   @3   "11. Display or save SAS graphs?"    color=blue
#30   @7   graph 20      required=yes attr=underline
#33   @10 " USE THE ENTER KEY TO RUN THIS MACRO(don't hit the submit key)  
"color=red attr=rev_video;
%display mycluster;

/*INVOKE THE MACRO USED IN THE GUI*/
libname mylib base "d:\macroproject\cluster";
options sasmstore=mylib mstored;
%mycluster

Sunday, December 12, 2010

5D visualiztion: from SAS to Google Motion Chart



Three dimensions are usually regarded as the maximum for data presentation. With the opening of ODS from SAS 9.2 and its graph template language, 3D graphing is no longer a perplexing problem for SAS programmers. However, nowadays magnificent amount of data with multi-dimension structure needs more vivid and simpler way to be displayed.

The emerging of Google Motion Chart now provides a sound solution to visualize data in a more than three dimensions scenario. This web-based analytical technology originated from Dr. Hans Rosling’s innovation. Dr. Rosling and his Gapminder foundation invented a technology to demonstrate the relationship among multiple dimensions by animated bubbles. They developed a lot of bubble plots in Gapminder’s website to discover knowledge form a bulk of public information, especially for regional/national comparison. It soon attracted Google’s attention. In 2008 after an agreement between Dr. Rosling and Google’s two founders, Google launched its Motion Chart gadget. People could create motion chart by using Google Docs, an online alternative to Microsoft’s Office.

The combination between SAS and Google Motion Chart shows a handy and cheap way for up-to-five-dimension data visualization. For Motion Chart, it supports five variables all together in a plot. Commonly the data structure requires time(animation), var1(X axis), var2(Y axis), var3(color) and var4(bubble size). The correlation from var1 to var4 is expected: usually the bubbles with changing color and size tend to move along the diagonal line. Overall 5d visualization can be rendered within such a single plot. In this example, a SAS help dataset ‘SASHELP.SHOES’ is used. The data set has several regions to compare each other. Logged return money is Y-axis, while logged sale money is X-axis. A series of virtual time is given to each region, with inventory as bubble size and the store number as color. By SAS, the data structure in Motion Chart can be prepared quickly. Thus, once the CSV file is uploaded to Google Docs, a motion chart is ready to be published in any webpage. OK, it's time to sit and discover some interesting tendency...

Reference:
1.'Show me--New ways of visualising data’. The Economist. Feb 25th 2010.
2.‘Making data dance’. The Economist. Dec 11st 2010.
3. Google Docs online help center. 2010.

*********(1) Extract data from SASHELP.SHOES***********;
proc sql;
create table test as
select region, Sales, Inventory, Returns, Stores
from sashelp.shoes
order by region , sales desc
;quit;
********(2) Create a random variable for time************;
data test1;
do i=1 by 1 until (last.region);
set test;
by region;
time=today()-i+1;
mytime=put(time, mmddyy8.);
drop i;
output;
end;
run;
********(3) Transform some variables with log**********;
proc sql;
create table test2 as
select region, mytime, log(sales) as logsales, log(returns) as logreturn, Stores as storenum, Inventory
from test1
order by region, mytime
;quit;
********(4) Export data as CSV***************;
proc export data=test2 outfile='C:\Users\Yanyi\Desktop\test.csv' replace;
run;
*******(5) Upload CSV to Google Docs************;
******(6)  Create Google Motion Chart manually**********;

**********END*********TEST PASSED 12DEC2010****************************;

Monday, December 6, 2010

Proc Fcmp(3): brute force for a distribution's pdf


Proc Fcmp can produce arbitrary distribution formulas. In this example, suppose that I don’t know too much statistics, what if I want to evaluate the pdf of the absolute value of the subtraction between two independent random variable from the uniform distributions? In this example, probably we can follow such procedures: (1) use Proc Kde to find the kernel distribution and make a guess about what this distribution is; (2) use Proc Fcmp to make the pdf of the distribution; (3) use Proc Model to fit the test dataset and find the parameters; (4) use the simulation to validate the hypothesis. Thus, such a method may be a useable solution to evaluate distribution of a random variable, if without stringent mathematical proof.
*************(1) The distribution of the absolute value of between two uniform distribution*****;
data distance;
do i=1 to 10000;
x1=ranuni(0);
x2=ranuni(0);
x=abs(x1-x2);
drop i;
output;
end;
run;
proc sgplot;
histogram x1/transparency=0.3;
histogram x2/transparency=0.7;
histogram x/transparency=0.8;
run;

************(2) Observe the kernel distribution ************;
ods graphics on;
proc kde data=distance;
univar x /plots=all out=myden;
run;
ods graphics off;

***********(3) Fit with Beta distribution*****************;
proc fcmp outlib = sasuser.myfunc.math;
function betadist(a,b,x);
density=(beta(a,b))**(-1)*x**(a-1)*(1-x)**(b-1);
return(density);
endsub;
run;

options cmplib = sasuser.myfunc nodate ls=80;
proc model data=myden(rename=(value=x));
density=betadist(a,b,x);
fit density;
quit;

************(4) Look at the distribution again***************;
data newdata;
do i=1 to 1E5;
y=rand('beta', 1, 2);
output;
end;
run;

proc sgplot data=newdata;
histogram y/fillattrs=(color=green) ;
run;
********************END********************;

Saturday, December 4, 2010

Proc Fcmp(2): a subroutine for Binomial-CRR model

Problems: Quote for six-month American style euro currency options on plain vanilla, Max[S-K,0]and 〖Max[S-K,0]〗^0.5. Exchange rate S_0=$1.3721 /euro
Six-month continuously compounded inter-bank rates: r=0.4472%,r_f=1.2840%.
Assumptions:The exchange rate for euro follows an iid log normal price changes and volatility is constant.
Methodology:Binomial Model is used to price American currency options on euros.
We calculate the payoffs at time T and discount payoffs to the prior time step. Under the risk neutral probability measure,
c_(t-1)=(q×c_u+(1-q)×c_d)/R
Since these two options are American styles, we need to check for optimal early exercise at each node of the binomial tree. For these two currency options, we check Max[S-K,c_(t-1),0] and Max[〖Max[S-K,0]〗^0.5,c_(t-1) ] . Matlab is the software used to implement the binomial model.
--Parameters:1. Time steps n
As the number of time steps n increases, we can apply CRR model and the binomial model approaches real world price changes. We choose n=80,h=T/n=0.5/80=0.00625.
2. u and dWe choose CRR model to define u and d for binomial model.
u=e^(σ√h) ; d=1/u=e^(-σ√h)
Where h is the length of the binomial times step and σ is the annualized log price change volatility.
3. Annualized log price change volatility σThe daily log changes and daily squared log changes for two year exchange rates from 11/5/2008 – 11/5/2010 are as follows. We consider the volatility to be constant since 05/01/2009. Thus, we choose the historical prices from 05/01/2009-11/5/2010 and apply the volatility of the daily log changes as an estimate. Then the annualized log price change volatility equals the square root of the trading days in one year (252 days) times the daily log price change volatility.
σ=√252×σ_daily
4. Risk Neutral Measure QOptions on currencies can be regarded as an asset providing a yield at the foreign risk-free rate of interest. Thus, the risk neutral probability measure Q :
q=(e^((r-r_f ) )-d)/(u-d) ;
1-q= 〖u- e〗^((r-r_f ) )/(u-d)
5. Strike Price KSet strike price K from $1.3000/euro to $1.5000/euro with $0.005 per euro increments.
Reference: 1. John C. Hull.Options, Futures and Other Derivatives, 7th edition. Prentice Hall. 2008.
2. Base SAS 9.2 Procedures Guide. SAS Publishing. 2009

**********************AUTHOR(DAPANGMAO)----HCHAO8@GMAIL.COM***********************************;
****************(1) CONSTRUCT € TO $ EXCHANGE RATIO VECTOR******************;
data vector;
attrib value informat=dollar10.4 format=dollar10.4 ;
StrikeS=1.3000;  
StrikeE=1.5000;
Increment=0.005;
do value=StrikeS to StrikeE by  Increment;
drop StrikeS StrikeE Increment;
output;
end;
run;

**************(2) BUILD THE FUNCTION TO EVALUATE OPTION PRICES********;
proc fcmp outlib = myfunc.finance.subrout;
subroutine mysubr(T, n, r, rf, s0, sigma, c1[*], c2[*]);
/*Two vectors are output arguments*/
outargs c1, c2;
/*Inside calculation from input arguments*/
dt=T/n;
length=dim(c1);
u=exp(sigma*sqrt(dt));
d=1/u;
q=(exp((r-rf)*dt)-d)/(u-d);
/*Announce 4 arrays -- 1 vector and 3 matrixes */
array k[1]/ nosymbols;
array s[1] /nosymbols;
array x1[1] /nosymbols;
array x2[1] /nosymbols;
n=n+1;
/*The sizes of the arrays are specified*/
call dynamic_array(s, n, n);
call dynamic_array(x1, n, n);
call dynamic_array(x2, n, n);
call dynamic_array(k, length);
/*Read the exchange ratio into function*/
rc=read_array('vector', k);
/*Assign values to S matrix */
call zeromatrix(s);
S[1,1]=S0;
do _col=2 to n ;
do _row=1 to n;
S[_row,_col]=S0*u**(_col-_row)*d**(_row-1);
if _row gt _col then S[_row, _col]=0;
end;
end;
/*Generate final option vectors */
do i=1 to length;
x=k[i];
call zeromatrix (x1);
call zeromatrix (x2);
do j=1 to n;
x1[j,n]=max(S[j,n]-x,0);   
x2[j,n]=max(S[j,n]-x,0)**0.5;     
end;
do _col=(n-1) to 1 by -1;
do _row=1 to (n-1) by 1;
h=exp(-r*dt)*(q*x1[_row,_col+1]+(1-q)*x1[_row+1,_col+1]);
h2=exp(-r*dt)*(q*x2[_row,_col+1]+(1-q)*x2[_row+1,_col+1]);
x1[_row,_col]=max(S[_row,_col]-X,h,0);    
x2[_row,_col]=max( max(S[_row,_col]-x ,0) **0.5, h2);  
end;           
c1[i]=x1[1,1];
c2[i]=x2[1,1];
end;
end;
endsub;
run;
quit;

***********SOME INITIAL VALUES*****************;
/*T=1/2;*/
/*n=80;*/
/*r=0.004472;    */
/*rf=0.01284;*/
/*S0=1.3721;*/
/*Sigma=0.1074*/

**************(3) MEASURE THE LENGTH OF PRICING VECTOR**************;
proc sql;
select count(*) into: vecnum from vector;
quit;

**************(4) USE THE SUBROUTINE TO GENERATE TWO VECTORS********************;
options cmplib = (myfunc.finance);
data final;
array c1[&vecnum] _temporary_;
array c2[&vecnum] _temporary_;
call mysubr(0.5, 80, 0.004472, 0.01284, 1.3721, 0.1074, c1, c2);   /*subroutine mysubr(T, n, r, rf, s0, sigma, c1[*], c2[*]);*/
do i=1 to dim(c1);
c1value= c1[i];
c2value=c2[i];
output;
end;
run;
****************TEST PASSED ----------- END****************************;

Thursday, December 2, 2010

Find the 'right' SAS functions


How many functions SAS has? Well, it sounds like a job interview question. For SAS 9.2, by querying the system dictionary (sashelp.vfunc or dictionary.functions), the exact answer is 946, including all functions and call routines. There are two types - unicode/bit based on input argument, while three types –numeric/character/bitwise based on output argument. Again according to their usage[1], the common SAS functionscan be categorized into several types: array(3), bitwise logical operation(3), PERL regular expression(11), character(91), time(38), descriptive statistics(32), random number(22), probability(18) , mathematics(36), finance(32), etc.

Some functions have evolved for several generations Since SAS has development history of more than 40 years. For example, there are 6 functions from SAS dictionary for random number generator from normal distribution, including ‘normal’, ‘rannor’ and its call routine, ‘rannorm’ and its call routine, and ‘rand’. All functions need seeds to produce random numbers and the random number queue can be replicated if with the same seed. The difference is that the latest one ‘rand’ has astronomical possibilities of seeds, while the older types only contains merely 2 trillion seeds that can cause dependence among various random number queues.

SAS handles data with rows as units (SAS calls row as observation), which is a unique characteristics, while most other software packages tend to process data with columns or vectors. Thus, many summarization functions in Data step only work on the ‘right’ by combining all variables in a row. As for vertical summarization, the SAS procedures are more appropriate, such as Proc Summary, Proc Means or Proc report. In a word, if we prefer SAS Data step, a transposition may be necessary.

Reference: 1. SAS 9.2 Language Reference: Dictionary, Third Edition. SAS Publishing. 2009.
2. Ron Cody. SAS Functions by Example, Second Edition. 2010.

/**********************AUTHOR(DAPANGMAO)----HCHAO8@GMAIL.COM***********************************;
*(1)CALCULATE THE COMPOSITION OF SAS FUNCTIONS*/
proc sgplot data=sashelp.vfunc;
vbar fnctype/barwidth=0.5
transparency=0.2 ; 
run;

proc sgplot data=sashelp.vfunc;
vbar source;
run;
proc sgplot data=sashelp.vfunc;
scatter x= source y=fnctype;
run;
proc sql;
select *
from sashelp.vfunc
where lowcase(fncname) in ('rannorm', 'rannor', 'normal', 'rand');
quit;
proc freq data=sashelp.vfunc;
tables fnctype*source/nopercent nocum norow nocol;
run;

/*(2)COMPARE SERVAL RANDOM FUNCTIONS*/
data one;
call streaminit(1234);
do i=1 to 10000;
x1=rannorm(1234);
x2=rannor(1234);
x3=normal(1234);
x4=rand('normal');
output;
end;
run;
data two;
seed1 = 1;
seed2 = 3;
seed3 = 5;
seed4=7;
do i = 1 to 10000;
call rannor(seed1, x1);
call rannor(seed2, x2);
call rannor(seed3, x3);
call rannor(seed4, x4);
output;
end;
run;
%macro test(input);
proc sgscatter data = &input;
title 'Independence test';
plot x1*x2 x1*x3 x3*x2 x1*x4 x2*x4 x3*x4 / markerattrs = (size = 1);
run;
%mend test;

/*(3)PICK OUT THE LARGEST THREE FROM VARIOUS TRANSACTIONS*/
data test;
attrib amt informat=dollar10.2 format=dollar10.2;
do id=1 to 10;
times=ceil(100*ranuni(12345));
do i=1 to times;
amt=10000*ranuni(123);
output;
end;
end;
drop i times;
run;
proc sort data=test out=test1; 
by id descending amt;
run;
data test2;
do _n_=1 by 1 until(last.id);
set test1 ;
by id;
if _n_<4 then output;
end;
run;
proc transpose data=test out=test3(drop=_name_);
by id notsorted;
var amt;
run;
proc sql;
select name into: vname separated by ', '
from sashelp.vcolumn
where libname='WORK'
and memname='TEST3'
and name contains 'COL'
;quit;
%put &vname;
data test4;
set test3;
call sortn(of &vname);  
vstd=std(of &vname);
vmax=max(of &vname);
vmean=mean(of &vname);
vmedian=median(of &vname);
vrange=range(of &vname);
vnum=n(of &vname);
vmissing=nmiss(of &vname);
no1=largest(1, of &vname);
no2=largest(2, of &vname);
no3=largest(3, of &vname);
run;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...