Thursday, April 26, 2012

10 keywords taken out from SAS Global Forum 2012


1. In-memory
SAS is famous for hitting hard disk at every operation, which is a proved strategy to save memory.  To speed up the processing of ‘Big Data’, SAS at the server side will aggregate memories, load data into memory and then deal with data there, which is 1000 times faster than the hard disk based operation.

2. Hadoop
Informationweek described that Dr. Goodnight, CEO of SAS, loathes Hadoop, the distributed open source platform. However, this time SAS presented its DI Studio and SAS/ACCESS interface, which now allows data access by Hive and Pig. It looks like a challenge for SAS to run its statistical procedures on HDFS.

3. Web application
SAS’s applications obviously start to move toward web, like SAS Visual Analytics, which also fits various mobile devices. There is a way to distinguish a desktop application and a web application in SAS: the former’s default background color is white and the latter is black.

4. High performance procedures
SAS is vigorously developing the procedures with HP as prefix, mostly for the servers. Currently 10-20 procedures in SAS/BASE and SAS/STAT can find their counterpart, such as PROC HPSummary and PROC HPLogistic. Those procedures can also run locally but won't not improve the efficiency significantly.

5. JMP
JMP remains as a lean desktop analysis package while SAS evolves toward gigantic enterprise solution platforms. One interesting thing -- you can always find the motion chart (which JMP can do and SAS can’t) and John Sall at the demo area.

6. SAS 12.1?
Next release of SAS is not 9.4, but 12.1. SAS version 9, including 9.1, 9.2 and 9.3, dwells for almost a decade. Thus, the version update from 9.3 to 12.1 is quite a great leap forward.
Correction - thanks to Chris Hemedinger, 'the new release numbering applies only to the analytical products (STAT, ETS, and so on)'.

7. Data Step 2?
To support data management along with the high performance procedures at the servers, a language called DS2 is under development. It is a strong typing language more like Java or C++ more than Data Step. However, SAS has a macro which can transform Data Step codes to Data Step 2.

Thanks to the corrections by Jason Secosky, who is the development manager for DS2 --
"While DS2 is based on the DATA step, its name is just "DS2" not "DATA step 2". DS2 is statically typed, not strongly typed. Ok, ok, there is no implicit type conversion between some types, like double and timestamp, yet there are functions to explicitly convert these values.
And, there is a PROC that can be used to translate DATA steps generated by SAS Enterprise Miner to DS2. The PROC isn't intended to convert *any* DATA step to DS2."

8. In-database
SAS’s in-database technology now supports all database systems beyond Teradata and Greenplum. To avoid compilation error, it is better to apply ANSI SQL functions instead of SAS’s own function. As for me, it is still not very clear how SAS passes its statistical procedures into the relation database systems.

9. Risk management
SAS’s risk management platform is quite mature and implemented the latest procedures like PROC COPULA. It seems that the end users have to own Bloomberg or other vendor’s license to update the market data.

10. Statistical graph
More SAS’s procedures and solution plans integrated layer-based statistical graph technology to visualize results. Still SAS’s Windowing Environment still doesn’t support the syntax highlighting for  Graph Template Language and the SG procedures, since it always shows red fonts warning syntax errors.

Thursday, April 19, 2012

Stored Processes: SAS's voice on Business Intelligence

Everyday I write SAS scripts to extract, transform and load data from various sources, which is a step before the database, and also pull out data to do analysis such as aggregation and regression in SAS, which is a step after the database. According to Norman Nie, a data shop has a four-layer structure: ETL layer, database layer, analytics layer and BI layer. It seems that recently the database layer and the analytics layer look more and more identical. The relational databases start to fiercely create their statistics arms, such as the SQL Server Analysis Services and Oracle R Enterprise. Tricia and Angela’s new book, The 50 Keys to Learning SAS Stored Processes, discloses SAS’s revenge on Business intelligence layer with its stored processes, which are similiar to the stored procedures usually carried out by the database systems.

This book has 12 chapters, including how to create stored processes, optimize the stored processes and implement the stored processes. The book is easy to read and contains detailed illustrations with lots of colorful graphics. Following the examples, I may create my own stored processes in SAS, for example, a simple query to count numbers of a small data set SASHELP.CLASS by gender. It is also interesting to compare SAS’s stored process with the databases' 
stored procedure such as SQL Server -- they have the same functionality. Therefore, on a Windows server, a developer will be able to choose either SQL Server or SAS to build the web applications.

Now with the stored processes, a SAS programmer may evolve toward a SAS developer. Tricia and Angela’s book will be a good reference for this role transition.

/*********(1) LOAD DATA(SASHELP.CLASS) TO SQL SERVER OR SAS *************/
create table class (name char(8), sex char(1), 
age numeric, height numeric, weight numeric );
insert into class values('ALFRED','M',14,69,112.5);
insert into class values('ALICE','F',13,56.5,84);
insert into class values('BARBARA','F',13,65.3,98);
insert into class values('CAROL','F',14,62.8,102.5);
insert into class values('HENRY','M',14,63.5,102.5);
insert into class values('JAMES','M',12,57.3,83);
insert into class values('JANE','F',12,59.8,84.5);
insert into class values('JEFFREY','M',13,62.5,84);
insert into class values('JOHN','M',12,59,99.5);
insert into class values('JOYCE','F',11,51.3,50.5);
insert into class values('JUDY','F',14,64.3,90);
insert into class values('LOUISE','F',12,56.3,77);
insert into class values('MARY','F',15,66.5,112);
insert into class values('PHILIP','M',16,72,150);
insert into class values('ROBERT','M',12,64.8,128);
insert into class values('RONALD','M',15,67,133);
insert into class values('THOMAS','M',11,57.5,85);
insert into class values('WILLIAM','M',15,66.5,112);

/**********************(2) A STORED PROCEDURE IN SQL SERVER **************/
IF OBJECT_ID('CntBySex') IS NOT NULL
   drop proc CntBySex;
go

create proc CntBySex @sex char(1) = null, @Cnt int output
as
   select @Cnt = COUNT(*)
   from class
   where sex = ISNULL(@sex, sex)
go

declare @OutCnt int
exec CntBySex @Cnt = @OutCnt output
select @OutCnt as AllCnt

exec CntBySex @sex = 'F', @Cnt = @OutCnt output
select @OutCnt as FemaleCnt

exec CntBySex @sex = 'M', @Cnt = @OutCnt output
select @OutCnt as MaleCnt
go


/*************(3) A STORED PROCESS IN SAS *********************/
%macro query;
proc sql;
   select count(*) 
   from class
   %if %length(&sex) > 0 %then %do;
      where sex = &sex;
   %end; 
;
quit;
%mend query;

%stpbegin; 
   %query;
%stpend;

Thursday, April 12, 2012

Correlations of three variables

Question
There is an interesting question in statistics --
“There are 3 random variables X, Y and Z. The correlation between X and Y is 0.8 and the
correlation between X and Z is 0.8. What is the maximum and minimum correlation between Y and Z?”

Solutions

1. Geometric illustration
The value of corr(Y, Z) is the COS function of the angle between Y and Z. We already know the corr(X, Y) and corr(X, Z). In this particular case, the angle can be zero, which suggests Y and Z are identical and the max value of corr(Y, Z) is 1. The min value of corr(Y, Z) is caused by the biggest angle between Y and Z, which is 0.28.

2. Positive semi-definiteness property of the correlation matrix
Due to this feature, the determinant of the correlation matrix is greater than or equal to zero. Thus we will be able to construct a quadratic inequality to evaluate the boundaries, which is from 0.28 to 1.

proc fcmp outlib=work.funcs.test1;
  function corrdet(x, a, b);
    return(-x**2 + 2*a*b*x - a**2 -b**2 +1);
  endsub;
  function solvecorr(ini, a, b);
    array solvopts[5] initial abconv relconv
          maxiter solvstat (.5 .001 1.0e-6 100);
    initial = ini;
    x = solve('corrdet', solvopts, 0, ., a, b);
    return(x);
  endsub;
quit;

options cmplib = work.funcs;
data one;
  * Max value;
  upper = solvecorr(1, 0.8, 0.8);
  upper_check = corrdet(upper,0.8,0.8);
  * Min value;
  lower = solvecorr(-1, 0.8, 0.8);
  lower_check = corrdet(lower,0.8,0.8);
run;




Generalization
We can generalize the question to all possibilities for corr(X, Y) and corr(X, Z). First we need to create two user-defined functions to solve the maximum and the minimum values. Then we will be able to draw the max values and min values in the same plot. It is very interesting to see that only four points the upper surface and lower surface converge together, which are (1, 1, 1), (-1, 1, -1), (1, -1, 1) and (-1, -1, -1).

A lot other phenomenon can be summarized from this plot, such as that when corr(X, Y) = corr(X, Z) the max value of corr(Y, Z) is always equal to 1.


proc fcmp outlib = work.funcs.test2;
  function upper(a, b);
     x = 4*(a**2)*(b**2) - 4*(a**2+b**2-1);
     if x ge 0 then y = -0.5*(sqrt(x) - 2*a*b);
     else y = .;
     return(y);
  endsub;
  function lower(a, b);
     x = 4*(a**2)*(b**2) - 4*(a**2+b**2-1);
     if x ge 0 then y = -0.5*(-sqrt(x) - 2*a*b);
     else y = .;
     return(y);
  endsub;
quit;

data two;
   do xy = -.99 to .99 by 0.01;
      do xz = -.99 to .99 by 0.01;
         upper = upper(xy, xz);
         lower = lower(xy, xz);
         output;
      end;
   end;
run;

proc template;
   define statgraph surface001;
   begingraph;
      layout overlay3d / cube = false rotate = 150 tilt = 30
         xaxisopts = (label="Correlation between X and Y") 
         yaxisopts = (label="Correlation between X and Z") 
         zaxisopts = (label="Boundaries of correlation between Y and Z") ;
      surfaceplotparm x = xy y = xz z = upper; 
      surfaceplotparm x = xy y = xz z = lower; 
      endlayout;
   endgraph;
   end;
run;

proc sgrender data =  two template = surface001;
run;

Monday, April 2, 2012

The years to get green card for Indian and Chinese

The path toward a green card is especially difficult for Indian and Chinese who are working in the US and named as EB2 workers, since they have to wait years to submit an i485 form after being approved with their i140 form submission. Once the cutoff date set by USCIS each month catches the priority date that is a date of filling labor certification and decides the position in the long line, the applicants holding their priority date will be able to file the i485 form. However, the cutoff dates are unpredictable for the public, and “it is impossible to accurately estimate how long that may take”.

Situation
The recent cut-off dates data can be copied and pasted from Wiki. Then I adjusted the cut-off dates within the retrogression period. In the past decade, the waiting years range from about 0 year to 4.25 years. July 2007 looks like an amnesty for those who have priory date before that, otherwise people have to wait at least 1.85 years. Applicants from the adjacent 3 or 4 years usually wait in line for the door to be open. The door is recently shut off on June 2012 and hopefully will be open again one day in 2014.

data pd;
   input @1 _cmon $3. @5 _cyear $4. @13 _pdmon $3. @17 _pddate : @21 _pdyear $2.;
   format c_time pd_time date9.;
   c_time = input(cats('01', _cmon, _cyear), date9.);
   pd_time = input(cats(_pddate, _pdmon,_pdyear), date9.);
   dif_day = c_time - pd_time;
   dif_year = dif_day/365;
   drop _:;
cards;
May 2012   Aug 15  07   
Apr 2012   May 1   10   
Mar 2012   May 1   10   
Feb 2012   Jan 1   10   
Jan 2012   Jan 1   09   
Dec 2011   Mar 15  08
Nov 2011   Nov 1   07   
Oct 2011   Jul 15  07
Sep 2011   Apr 15  07
Aug 2011   Apr 15  07
Jul 2011   Mar 8   07   
Jun 2011   Oct 15  06
May 2011   Jul 1   06   
Apr 2011   May 8   06   
Mar 2011   May 8   06   
Feb 2011   May 8   06   
Jan 2011   May 8   06   
Dec 2010   May 8   06   
Nov 2010   May 8   06   
Oct 2010   May 8   06   
Sep 2010   May 8   06   
Aug 2010   Mar 1   06   
Jul 2010   Oct 1   05   
Jun 2010   Feb 1   05   
May 2010   Feb 1   05   
Apr 2010   Feb 1   05   
Mar 2010   Feb 1   05   
Feb 2010   Jan 22  05
Jan 2010   Jan 22  05
Dec 2009   Jan 22  05
Nov 2009   Jan 22  05
Oct 2009   Jan 22  05
Sep 2009   Jan 8   05   
Aug 2009   Oct 1   03   
Jul 2009   Jan 1   00   
Jun 2009   Jan 1   00   
May 2009   Feb 15  04
Apr 2009   Feb 15  04
Mar 2009   Feb 15  04
Feb 2009   Jan 1   04   
Jan 2009   Jul 1   03   
Dec 2008   Jun 1   03   
Nov 2008   Jun 1   03   
Oct 2008   Apr 1   03   
Sep 2008   Aug 1   06   
Aug 2008   Jun 1   06   
Jul 2008   Apr 1   04   
Jun 2008   Apr 1   04   
May 2008   Jan 1   04   
Jul 2007   Jul 1   07
Jun 2007   Apr 1   04   
May 2007   Jan 8   03   
Apr 2007   Jan 8   03      
Mar 2007   Jan 8   03      
Feb 2007   Jan 8   03      
Jan 2007   Jan 8   03      
Dec 2006   Jan 8   03      
Nov 2006   Jan 8   03   
;;;
run;

proc sql;
   create table pd1 as
   select a.pd_time 'priority date', a.c_time, (select min(c_time) from pd as b 
      where a.pd_time le b.c_time and a.pd_time le b.pd_time) as i140_submit_date format date9.,
      (calculated i140_submit_date - a.pd_time) / 365 as i140_waiting_year 
   from pd as a 
   where year(pd_time) gt 2002
   order by a.pd_time
;quit;

proc sgplot data = pd1;
   series x = pd_time y = i140_submit_date;
   xaxis grid;
run;

proc sgplot data = pd1;
   series x = pd_time y = i140_waiting_year;
   yaxis grid;
run;


Possible change
The cut-off dates much depend on the changes of law and policy which reflects the economic environment. Therefore the US unemployment rate is possibly helpful in predicting the fluctuation of the cut-off dates. First I imported such data, and then transformed the difference between the priority date and the cut-off date to 6-month average. Then the two curves match well. I can even fit the relationship with a simple linear regression. As a conclusion, one percent of unemployment rate decrease may shorten 1/3 year of the waiting time from i140 to i485.

filename _infile url "http://research.stlouisfed.org/fred2/data/UNRATE.txt" debug lrecl=100;
data unempl;
   infile _infile missover firstobs = 22;
   format date date9.;
   input @1 date yymmdd10. @13 unemployement_rate 4.1;
run;

proc sql;
   create table combine as
   select a.dif_year, b.*
   from pd as a, unempl as b
   where a.c_time = b.date
;quit;

proc expand data = combine out = combine1 method=none;
   convert dif_year = ma_dif_year / transform = (movave 6);
run;

proc sgplot data = combine1;
   series x = date y = ma_dif_year;
   series x = date y = unemployement_rate;
   xaxis grid; yaxis grid label = ' ';
run;

proc reg data = combine1;
   model ma_dif_year = unemployement_rate;
   ods select ParameterEstimates FitPlot;
run;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...