Friday, January 21, 2011

A two-step transpose approach to reshape data



New in SAS 9.2, the TRANSPOSE procedure accepts multiple IDs in its ID statement. More than one IDs would automatically concatenate together as the new variable names. Previously, Proc Transpose usually only allows one ID. As the result, the concatenation of variable names has to be done by DATA step array in SAS 9.1 or earlier versions. This change would bring more flexibility to reshape data to any desired structure. In this case, a small file with date, gender and 3 credit records is transformed to a more flat data structure, only corresponding to the date. Gender would be moved from row name to column name, and consequently several new variables would be created to combine old variables: credit and gender.

A variety of methods in SAS can realize the reshaping purpose. The coding can follow the principle: first accumulate the numeric values in a single column; second expand them with its accompanying IDs as new variable names. As usual, Proc SQL is always the first choice to aggregate data by its Group By statement. In this example, splitting and combining did the trick but needs some more coding. To increase efficiency, building a macro may be useful. If the programming intention is to report, Proc Report has the magic power to display sub-categories with least code. However, variable names have to be re-defined in the following steps.

Two-step transpose by Proc Transpose is intuitive to change data structure horizontally. And it’s pretty extensible to even more complex data structure.
*******(0) INPUT RAW DATA***********;
data raw;
format date mmddyy10.;
input sex: $1. Date: mmddyy10. Credit1 Credit2 Credit3;
cards;
M       01/01/2011    600     610  650
M       01/02/2011    500     510   730
F       01/01/2011    700     710    820
F       01/02/2011    400     410   500
;
run;

*********(1)TWO-STEP TRANSPOSE***************;
******NOTE: only works in SAS 9.2*********;
proc sort data=raw out=raw_s; 
   by date sex;
run;

proc transpose data=raw_s out=raw_t;
   var credit:;
   by date sex;
run;

proc sort data=raw_t out=interim; 
   by date; 
run;

proc transpose data=interim out=final1(drop=_name_);
   var col1;
   by Date;
   id sex _name_ ;
run;

********(2)DATA STEP ARRAY: AN ALTERNATIVE FOR TWO-STEP TRANSPOSE******;
data arrout;
   set raw;
   array cr[*] credit:;
   do i=1 to dim(cr);
      cred=cr[i];
      cred_name=cats(sex,vname(cr[i]));
      output;
   end;
   keep date cred cred_name;
run;

proc sort data=arrout out=arrout_s; by date;run;
/*NOTE: a following transpose would be better than another data step array*/
proc transpose data=arrout_s out=final2(drop=_name_);
   by date;
   var cred;
   id cred_name;
run;

*********(3)PROC SQL AND MACRO********;
%macro reshape(max);
   %do i=1 %to &max;
      proc sql;
      create table out&i as
      select a.date,  a.credit&i  as Mcredit&i , b.credit&i  as    Fcredit&i
      from raw(where=(sex='M')) as a , raw(where=(sex='F'))  as b
      where a.date=b.date
      ;quit;
   %end;
   data final3;
   %do j=1 %to &max;
      set out&j;
   %end;
run;
%mend reshape;
%reshape(3);

******(4)PROC REPORT*********;
proc report data=raw nowd out=final4 ;
   column date sex,(credit:);
   define date/group;
   define sex/across;
run;

*****(5) ANOTHER EXAMPLE******;
data have;
    input id $ num;
    cards;
    A 1
    A 2
    A 3
    A 4
    B 1
    B 2
    B 3
    B 4
    B 5
    C 1
    ;
run;

proc transpose data=have out=temp;
    by id;
    var num;
run;

proc transpose data=temp out=want(drop=_name_);
    by _name_;
    id id;
    var col:;
run;
*********End of the program****Tested on 21Jan2011********;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...