Wednesday, January 11, 2012

Benchmarking of the CUSUM function in SAS/IML


Cumulative sums can be obtained in SAS’s DATA step by the RETAIN statement. As the codes below, a new variable of the cumulative values will be returned by DATA step’s implicit DO loop.

data one;
   do i = 1 to 1e6;
      z = ranuni(0);
      output;
   end;
   drop i;
run;

data two;
   set one;
   retain y 0;
   y + z;
run;

The same logic can be realized by an explicit DO loop in SAS/IML. However, SAS/IML has a function CUSUM which is specially made for cumulative sums. To compare the efficiency of the two methods, I tried an experiment: calculating the cumulative sums for a random vector with incremental sizes from 1 million to 20 million by 1 million. The result shows that the CUSUM function is always 100 times faster than a raw DO loop. With the increase of the vector size, the gap gets wider. When the number of elements in the vector reaches 20 million, a Do loop would ask almost 40 seconds, while the CUSUM function requires only 0.35 seconds. Therefore, the vectorized CUSUM function is going to be a good candidate to do such jobs in SAS/IML.

proc iml;
   a = t(do(1e6, 2e7, 1e6));
   timer =  j(nrow(a), 2);
   do p = 1 to nrow(a);
      n = a[p];
      z = t(ranuni(1:n));
      
      t0 = time(); 
          x = cusum(z);
      timer[p, 1] = time() - t0;

      y = j(n, 1, .);
      t0 = time(); 
         do i = 1 to n;
            if i = 1 then y[i] = z[i];
            else y[i] = y[i-1] + z[i];
         end;
      timer[p, 2] = time() - t0;
   end;

   t =  a||timer;
   create _1 from t;
      append from t;
   close _1;
quit;

data _2;
   set _1;
   length test $100.;
   label col1 = "Number of observations" 
      time = "Time by seconds for cumulative summation";
   test = "The CUSUM function"; time = col2; output;
   test = "DO loop"; time = col3; output;
   keep test time col1;
run;

proc sgplot data = _2;
   series x = col1 y = time / curvelabel group = test;
   yaxis grid;
run;
P.S.
Thanks to Rick’s correction, I included his improved DO loop. Comparing with my previous DO loop with a conditional statement, this new DO loop is obviously more efficient. The conclusion is still the same: the CUSUM function outperforms the DO loops in this competition. And with more data involved, the contrast gets even more striking.


proc iml;
   a = t(do(1e6, 2e7, 1e6));
   timer =  j(nrow(a), 3);
   do p = 1 to nrow(a);
      n = a[p];
      z = t(ranuni(1:n));
      /* 1 -- The CUSUM function */
      t0 = time(); 
          x = cusum(z);
      timer[p, 1] = time() - t0;
      /* 2 -- Bad DO loop*/
      y = j(n, 1, .);
      t0 = time(); 
         do i = 1 to n;
            if i = 1 then y[i] = z[i];
            else y[i] = y[i-1] + z[i];
         end;
      timer[p, 2] = time() - t0;
      /* 3 -- Good DO loop */
      w = z;
      t0 = time(); 
         do i = 2 to n;
            w[i] = w[i-1] + z[i];
         end;
      timer[p, 3] = time() - t0;
   end;

   t =  a||timer;
   create _1 from t;
      append from t;
   close _1;
quit;

data _2;
   set _1;
   length test $100.;
   label col1 = "Number of observations" 
      time = "Time by seconds for cumulative summation";
   test = "The CUSUM function"; time = col2; output;
   test = "bad DO loop"; time = col3; output;
   test = "good DO loop"; time = col4; output;
   keep test time col1;
run;

proc sgplot data = _2;
   series x = col1 y = time / curvelabel group = test;
   yaxis grid;
run;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...