Thursday, March 31, 2011

Optimize many-to-one mapping by user-defined functions


In many occasions, fast access into a lookup table to find desired value is necessary. In computer science, linked list, associative array, and hash table are widely used to construct the relationship between values and keys. Hash function, like value <-- index = Function(key), is essential to build such a hash table. Improving the hash function’s performance is pretty challenging and rewarding [Ref. 1]. In SAS, macro may be utilized to substitute function. However, macro would be failed in front of some cases, such as f(x1) + g(x2) or f(g(x)). Function or functional programming is still a better choice. With the user-defined function complier, Proc Fcmp, building reliable functions in SAS to map many keys to a single value looks promising.

However, SAS’s hash object is not applicable for coding interactive or reusable hash functions. The concept of hash object is introduced since SAS version 9, and it is ‘the first truly runtime dynamic, memory-resident DATA step structure’ [Ref. 2]. Evidence shows that it is robust, and most importantly, faster than other hard-disk based lookup solutions. And this technology is vigorously growing: SAS’s hash object has more methods, and can even accept duplicate keys [Ref. 3]. The biggest problem for SAS’s hash object is that it is transient and not savable. Hash object has to be declared within a data step to load data. If there is any other query, it has to be loaded again. If with frequently invocation, the loading factor will be formidable. As the result, SAS’s hash object is particularly useful for batch processing of lookup tasks, say, finding many key-value pairs simultaneously. That is probably why SAS programmers like to use hash object in merging datasets.

Thus, mapping many keys to one value by user-defined functions has to rely on other methods. Senior SAS programmers may prefer SAS arrays with help of the POINT option. However, SAS’s array is still transient. SQL and Proc Format are the options available. SQL is interactive and does not rely on any particular data type. Format is a unique data type in SAS created by Proc Format, which is permanent and exchangeable with a common SAS dataset. To test the efficiency of many-to-one mapping by user-defined functions, first a dataset of a million records with 2 keys and 1 value was simulated, and three functions were defined and complied by Proc Fcmp. Then those functions were repeatedly called for 10000 times. For SQL-based function, invoking each time is equivalent to querying the raw dataset once. On my notebook, the total time cost of running SQL-based function is more than 3 minutes, which is intolerable. Adding indexes to both keys of the lookup table will largely decrease the time expense to 1/6 of the original time. In addition, loading the lookup table into the memory, by the SASFILE statement, would improve the efficiency of the function a little further. Since Proc Format only allows one key, the two keys need to be concatenated as a composite key before finalizing the lookup format. Amazingly, calling of Format-based function 10000 times only spends less than 2 seconds. As the result, the Format-based function is the champion of this test. Understandably, the memory consumption is proportional to the functions’ efficiency: faster speed means more memory requirement. Much more than other methods, the Format-based function used 200 MB memories. I guess that the ‘format’ data type may be loaded into memory and stayed there during the processing time, which slashed the loading factor. In conclusion, Proc Format produced ‘format’ is not only an alternative solution to merge large datasets, but also is a reliable foundation to define key-value pairs by many-to-one mapping functions.

References:
1. Hash table on Wikimedia. http://en.wikipedia.org/wiki/Hash_table
2. Elena Muriel. Hashing Performance Time with Hash Tables. SAS Global 2007.
3. Robert Ray and Jason Secosky. Better Hashing in SAS 9.2. SAS Global 2008.

/*******************READ ME****************************************
* --- OPTIMIZE MANY-TO-ONE MAPPING BY USER-DEFINED FUNCTIONS ----
*
* HARDWARE:    a notebook with amd64 cpu 2.0g, 3g ram 
* SAS:         SAS 9.2(TS2M0), pc 64bit
* 
* TEST PASSED: 31mar2011
* AUTHOR:      hchao8@gmail.com
*
****************END OF READ ME************************************/

************(1) GENERATE TEST DATASET*********************;
******(1.1) SIMULATE A DATASET OF 1M RECORDS WITH 2 KEYS*********;
data raw;
  do key1 =1 to 1000;
    do key2 = 1 to 1000;
      value = ranuni(20100322) + rannor(20100322);
      output;
    end;
  end;
run;

******(1.2) DERIVE INDEXED DATASET*********;
proc sql;
  create table indexed as select * from raw;
  create index key1 on indexed(key1);
  create index key2 on indexed(key2);
quit;

******(1.3) TRANSFORM DATASET INTO FORMAT*********;
****(1.3.1) CAST VALUE FROM NUMBER TO CHARACTER*****;
proc format;
  picture key low-high = '0000'(fill = '0');
run;
****(1.3.2) COMBINE 2 KEYS TO 1 COMPOSITE KEY******;
data keycombined;
  set raw;
  length keystr $8;
  keystr = cats(put(key1, key.), put(key2, key.));
  drop key1 key2;
run;
****(1.3.3) DERIVE FORMAT DATASET*****;
data fmtds;
  set keycombined;
  rename keystr = start 
          value = label;
  fmtname = '$myfmt';
  type = 'C';
run;
****(1.3.4) INCORPORATE FORMAT DATASET TO BUILD FORMAT *****;
proc format cntlin = fmtds;
run;

**************END OF STEP (1)*****************;

**************(2) CREATE USER-DEFINED FUNCTIONS***********;
*******(2.1) BUILD THE FIRST FUNCTION************;
****(2.1.1) THE EMBEDDED MACRO************; 
option mstored sasmstore = sasuser;
%macro myfunc1_macro / store source;
  %let key1 = %sysfunc(dequote(&key1));
  %let key2 = %sysfunc(dequote(&key2));
   proc sql noprint;
      select value into: value
      from raw
      where key1 = &key1
          and key2 = &key2
  ;quit;
%mend;
****(2.1.2) CREATE THE FUNCTION AND OUTPUT***********;
proc fcmp outlib = sasuser.keyvalue.funcs;
   function myfunc1(key1, key2);
      rc = run_macro('myfunc1_macro', key1, key2, value);
      if rc eq 0 then return(value);
      else return(.);
   endsub;
run;

*******(2.2) BUILD THE SECOND FUNCTION************;
****(2.2.1) THE EMBEDDED MACRO************; 
option mstored sasmstore = sasuser;
%macro myfunc2_macro / store source;
  %let key1 = %sysfunc(dequote(&key1));
  %let key2 = %sysfunc(dequote(&key2));
   proc sql noprint;
      select value into: value
      from indexed
      where key1 = &key1
          and key2 = &key2
  ;quit;
%mend;
****(2.2.2) CREATE THE FUNCTION AND OUTPUT***********;
proc fcmp outlib = sasuser.keyvalue.funcs;
   function myfunc2(key1, key2);
      rc = run_macro('myfunc2_macro', key1, key2, value);
      if rc eq 0 then return(value);
      else return(.);
   endsub;
run;

*******(2.3) BUILD THE THIRD FUNCTION************;
proc fcmp outlib = sasuser.keyvalue.funcs;
  function myfunc3(key1, key2);
    key = put(cats(put(key1, key.), put(key2, key.)), $8.);
    value = put(key, $myfmt.);
    return(value);
   endsub;
run;

**************END OF STEP (2)*****************;

**************(3) TEST USER-DEFINED FUNCTIONS***********;
****(3.1) CREATE A TEST MACRO TO RUN FUNCTIONS 10000 TIMES******;
%macro test(num);
  option cmplib = (sasuser.keyvalue) mstored sasmstore = sasuser;
  data test;
     do x = 101 to 200;
        do y = 301 to 400;  
           z = myfunc&num(x, y);
           output;
        end;
     end;
  run;
%mend;

****(3.2) TEST THE FIRST FUNCTION********;
%test(1);

****(3.3) TEST THE SECOND FUNCTION********;
%test(2);

****(3.4) TEST THE SECOND FUNCTION WITH IN-MEMORY LOOKUP TABLE********;
sasfile indexed open;
%test(2);
sasfile indexed close;

****(3.5) TEST THE THIRD FUNCTION********;
%test(3);

**************END OF STEP (3)*****************;

*************************END OF ALL CODING***************************;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...