Friday, March 18, 2011

Array 2.0: matrix-friendly array in Proc Fcmp


Array is probably the only number-indexed data type in SAS. Interestingly, SAS and R, the major statistical softwares, use 1 instead of 0 to specify the first element. SAS programmers adopt array mostly for multiple-variable batch-processing. For example, longitudinal summation can be achieved by specifying a one-dimensional array and then adding all array elements together. On the contrary, general-purpose programming languages, like Java, Python, C++,  widely use array or list to store and index data. The programmers who newly switch into SAS often feel confused about the limitation of SAS’ array, which is transient and need output to the parental data step to save result. Since data step does not memorize the array’s original structure, multiple dimensional arrays are almost useless in creating more complex data structures. In SAS, those multiple dimensional arrays are occasionally implemented to reshape data [Ref. 1]. If a matrix, such as a big one with random numbers of 5000 rows by 5000 columns [Example 1], is needed, Other programming languages would use int[][] to declare a matrix. In SAS, 2D array, such as a [5000, 5000] array, may be intuitively chosen to perform this task. However, the result is not desired, and the right way is still to apply a one-dimensional array. 

Array in Proc Fcmp is an upgraded array data type other than data step array. This new version of array allows not only creating matrices but also indexing matrix. In addition, the communication between data step and Proc Fcmp is fast and convenient: this procedure supplies the READ_ARRAY() and WRITE_ARRAY() functions, which transform a dataset to an array and vice versa. For example, to test an interviewee’s knowledge on SAS’s data step array, a typical interview question is to ask her/him to write a 9X9 multiplication table [Example 2]. The expected solution is to place the OUTPUT statement between the inner Do-loop and outer Do-loop. Placing OUTPUT into the inner layer of Do-loops builds an 81*9 monster, while ignoring OUPTUT would only generate the last row of multiplication table. This task is much simpler in Proc Fcmp: just produce a matrix and write it to a dataset. Proc Fcmp is a nice alternative to SAS/IML, a specialized matrices language module in SAS. For instance, a typical SAS/IML kind of job, such as filling missing cells in a matrix (or a dataset) with its diagonal elements, can be fulfilled by arrays in Proc Fcmp [Example 3]. Proc Fcmp is shipped in SAS/BASE, which means that no extra out-of-pocket money is needed for another license. Another concern is the efficiency of SAS/IML module. It is reported that frequently calling of SAS/IML's functions would decrease system speed dramatically[Ref. 3].

The matrix feature of Proc Fcmp’s array benefits other SAS programming areas. For example, Proc Transpose and data step array usually reshape data structure from either long to wide or wide to long. However, in many cases that positions between observation and variable in a dataset have to be exchanged, the two methods may require several try-and-error steps. The TRANSPOSE() subroutine in Proc Fcmp solves such a problem easily [Example 4]. Currently there are 13 functions or subroutines available in Proc Fcmp for matrices operation[Ref. 2]. Some may complain they are still not enough for their particular need. Don’t forget: Proc Fcmp makes user-defined functions and subroutines! For example, to simulate set() function in Python, a deldup_array() function, based on encapsulating Proc SQL in a macro by RUN_MACRO(), can delete duplicate elements in array[Example 5]. Therefore, users of Proc Fcmp’s array can always construct and accumulate their tools to suit their purpose.

References: 1. UCLA Statistics Course. http://www.ats.ucla.edu/stat/sas/library/multidimensional_arrays.htm
2. SAS 9.2 Online Help. http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a003193719.htm
3. Wang, Songfeng; Zhang, Jiajia. Developing User-Defined Functions in SAS®: A Summary and Comparison. SAS Global 2011.

******(1) EXAMPLE 1: CREATE A 5000X5000 RANDOM MATRIX****;
****(1.1) WRONG 2D ARRAY SOLUTION******;
data matrix1;
   array v[5000, 5000];
   do i = 1 to 5000;
     do j = 1 to 5000;
       v[i, j] = ranuni(0);
     end;
     output;
   end;
run;  

****(1.2) CORRECT 1D ARRAY SOLUTION*****;
data matrix2;
   array x[5000];
   do i = 1 to 5000;
     do j = 1 to 5000;
       x[j] = ranuni(0);
     end;
     output;
   end;
run;

******(2) EXAMPLE 2: CREATE A MULTIPLICATION TABLE******;
****(2.1) DATA STEP ARRAY SOLUTION ********;
data mt1;
   array a[9] a1-a9;   
   do row = 1 to 9;           
      do col= 1 to 9;    
         if row ge col then a[col]=row*col;  
      end;
     output;        
   end;
   drop row col;
run;

****(2.2) PROC FCMP ARRAY SOLUTION*****;
proc fcmp;
   array a[9, 9] / nosymbols;
   do row =1 to 9;
      do col = 1 to 9;
         if row ge col  then a[row,  col] = row*col;
      end;
   end;
   rc1 = write_array('mt2', a);
quit;

****(3) EXAMPLE 3: FILL MISSING CELL WITH DIAGONAL ELEMENT******;
****(3.0) INPUT RAW DATA****;
data have;
   input x1-x4;
   datalines;
   1 . . .
   2 1 . . 
   3 4 1 .
   7 6 5 1
;
run;

****(3.1) PROC FCMP TRANSPOSITION******;
proc fcmp;
   array a[4, 4] / nosymbols;
   rc1 = read_array('have', a);
   do i = 1 to 4;
      do j = 1 to 4;
         if missing(a[i, j]) = 1 then a[i, j] = a[j, i];
      end;
   end;
   rc2 = write_array('want', a);
quit;

*****(4) EXAMPLE 4: RESHAPE SQUARE-SHAPE DATA********;
****(4.0) INPUT RAW DATA********;
data have1;
   input x1-x5;
   cards;
   1 . 0 1 1
   0 1 . 0 0
   .  . 1 1 1
   0 0 0 1 .
   . 0 0 1 1
;
run;

****(4.1) PROC TRANSPOSE SOLUTION******;
data trps1;
   set have1;
   obs = _n_;
run;

proc transpose data = trps1 out = trps2;
   by obs;
   var x1-x5;
run;

proc sort data = trps2 out = trps3;
   by _name_;
run;

proc transpose data = trps3 out = want1_1;
   by _name_;
   var col1;
run;

****(4.2) PROC FCMP SOLUTION********;
proc fcmp;
   array a[5, 5] / nosymbols;
   rc1 = read_array('have1', a);
   array b[5, 5] ;
   call transpose(a, b);
   rc2 = write_array('want1_2', b);
quit;

******(5) EXAMPLE 5: FCMP DEDUPLICATION FUNCTION FOR FCMP ARRAY*******;
****(5.1) ENCAPSULATE DEDUPLICATIONA UTILITY OF PROC SQL IN A MACRO ******;
%macro deldup_array;
   %let dsname = %sysfunc(dequote(&dsname));
   %let arrayname = %sysfunc(dequote(&arrayname));
   /*(5.1.1) GENERATE UNIQUE DATASET AND ITS OBSERVATION NUMBER*/
   proc sql noprint;
      select count(unique(&arrayname.1)) into: obs_num
      from &dsname;
      create table _temp as
      select distinct *
      from &dsname;
   quit;
   /*(5.1.2) USE TEMP DATASET TO REPLACE RAW DATASET*/
   data &dsname;
      set _temp;
   run;
   /*(5.1.3) DELETE TEMP DATASET*/ 
   proc datasets;
      delete _temp;
   run;
%mend deldup_array; 

****(5.2) ENCAPSULATE MACRO ABOVE IN A FUNCTION*****;
proc fcmp outlib=work.func.practice;
   function deldup_array(dsname $, arrayname $);
      rc = run_macro('deldup_array', dsname, arrayname, obs_num);
      if rc eq 0 then return(obs_num);
      else return(.);
   endsub;
run;

****(5.3) USE THIS FUNCION TO DELETE DUPLICATES IN ARRAY*****;
option cmplib = (work.func) mlogic mprint symbolgen; 
proc fcmp;
   array a[1000] /nosymbols;
   do j = 1 to 1000;
      a[j] = ceil((ranuni(12345)*100) + rannor(12345));
   end;
   
   dsname = 'numbers'; 
   rc1 = write_array(dsname, a);
   
   n = deldup_array(dsname,  %sysfunc(quote(a)));
   
   call dynamic_array(a, n);
   rc2 = read_array(dsname, a);
   put a = ;
quit;

*****************END OF ALL CODING****************************;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...