Monday, September 19, 2011

A test to count missing values for large data

This morning Rick introduced how to count the missing frequencies for all variables in a dataset, including character and numeric variables. He provided two solutions by either PROC FREQ or PROC IML. I have a petty macro based on PROC SQL’s nmiss() function to do the same job. In this big data era, I am interested in those SAS codes’ efficiencies to check large data.

Then I simulated a 1e8 size dataset (5 numerical and 5 character variables) and tested the 3 methods. According to the results, PROC FREQ is slightly better than PROC SQL in both memory usage and speed. However, a macro may be needed to integrate a number of output tables by PROC FREQ for reporting (the demo macro by SAS Knowledge Base is an option and can be improved). Sadly, PROC IML used out all the 250MB memories the server allocated to me and can’t finish this test. Similarly, the memory volume should be considered for other matrix languages like R and Matlab.

P.S.
Thanks to Rick’s new codes, missing values were counted successfully by PROC IML finally. The real time spent by PROC IML was 150 seconds, which is almost equal to running the macro by PROC SQL. It is interesting to compare the functions from these two procedures: COUNT() and NMISS() functions in PROC SQL and COUNTN() and COUNTMISS() functions in PROC IML can work for both numeric and character variables.


**********************(1) SIMULATION STEP**************************************;
data test;
   array _c{*} $ c1-c5;
   array _n{*} n1-n5;
    do i = 1 to 1e8;
       do j = 1 to 5;
          if ranuni(1234)>0.2 then _c[j]= 'A';
          else call missing(_c[j]);
          if ranuni(4321)>0.3 then _n[j] = rannor(0);
          else call missing(_n[j]);
       end;
       output;
       drop i j;
    end;
run;

**********************(2) MODULE-BUILDING STEP*********************************;
%macro countmiss(data = );
   ods listing close;
   ods output variables = _varlist;
   proc contents data = &data;
   run;
   proc sql;
      alter table _varlist
      add nmiss num
      ;
      select count(*) into :nvar
      from _varlist
      ;
      select count(*) into :nobs
      from &data
      ;
   quit;
   %do i = 1 %to &nvar;
      proc sql;
         select cats('nmiss(', variable, ')') into :miss_var
         from _varlist
         where num = &i
         ;
         select &miss_var into: miss_num
         from &data
         ;
         update _varlist
         set nmiss = &miss_num
         where num = &i
         ;
      quit;
   %end;
   proc sort data = _varlist;
     by num;
   run;
   ods listing;
   proc report data = _varlist nowd split = '*';
      columns  type num variable label nmiss pcnmiss;
      define type / group;
      define variable / width = 15;
      define nmiss / display 'Missing*Number';
      define pcnmiss / computed format = percent8.3
                  'Missing*Percentage' width = 10;
      compute pcnmiss;
         pcnmiss = nmiss / &nobs;
      endcomp;
   run;
%mend countmiss();

proc format;
 value $missfmt ' '='Missing' other='Not Missing';
 value  missfmt  . ='Missing' other='Not Missing';
run;

**********************(3) TESTING STEP***************************************;
options fullstimer;
*******************METHOD(1): PROC SQL***************************************;
%countmiss(data = test);
 
*******************METHOD(2): PROC FREQ*************************************;
proc freq data = TEST; 
   format _CHAR_ $missfmt.; 
   tables _CHAR_ / missing missprint nocum nopercent;
   format _NUMERIC_ missfmt.;
   tables _NUMERIC_ / missing missprint nocum nopercent;
run;
*******************METHOD(3): PROC IML***************************************;
proc iml;
   use test;
   read all var _NUM_ into x[colname=nNames]; 
   n = countn(x,"col");
   nmiss = countmiss(x,"col");
   read all var _CHAR_ into x[colname=cNames]; 
   close one;
   c = countn(x,"col");
   cmiss = countmiss(x,"col");
   Names = cNames || nNames;
   rNames = {"    Missing", "Not Missing"};
   cnt = (cmiss // c) || (nmiss // n);
   print cnt[r=rNames c=Names label=""];
quit;
********************END OF ALL CODING*****************************************;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...