Monday, June 4, 2012

Index tuning in SAS for high-volume transactional data



Why use indexes in SAS?
A page is the smallest I/O unit that SAS can read or write, including data set page and index file page. Index files in SAS are sorting and searching structures made by B-trees. “When an index is used to process a request, such as a WHERE expression, SAS does a binary search on the index file and positions the index to the first entry that contains a qualified value. SAS then uses the value’s RID or RIDs to read the observations containing the value. The entire index file is not loaded to memory; one index page is accessed at a time. The most recent index page is kept in memory”. Thus, by scanning the index file pages first, SAS may significantly reduce the number for logical reading and physical reading, and therefore improve query performance by orders of magnitude, say from O(N) to O(log(N)) for large data set.

In this example, I simulated a data set of 10 million transaction records starting from January 1st, 2005 with 7 variables including PRIMARY_KEY, HASH_KEY, FIRST_NAME, LAST_NAME , ORDER_AMOUNT, ORDER_DATE , SHIPPING_ADDRESS. This fake transactional data set is about 600 MB big on disk.

data transaction;
   retain primary_key hash_key first_name 
      last_name order_amount order_date;
   do primary_key = 1 to 1e7;
      first_name = md5(ranuni(1));
      last_name = md5(ranuni(2));
      shipping_address = md5(ranuni(3));
      hash_key = put(md5(primary_key), hex16.);
      order_amount = put(ranuni(4)*1000, dollar8.2);
      order_date = put('01jan2005'd + floor(ranuni(5)*2500), mmddyy9.);
      output;
   end;
run;

Index strategies: 
1. Create clustered index for primary key 
Clustered index gives data the physical row order on hard disk. Not like other relational database systems, SAS doesn’t have a special statement by either PROC SQL Or Data Step, to specify a clustered index. However, I found the easiest way is to use PROC SORT to sort the data set itself.

PROC SORT is also the prerequisite of the merge join at the DATA Step, while PROC SQL mostly uses the hash join. One thing to note is that the sorting should happen before the creation of other indexes. Otherwise, after sorting the existing index files may be lost.

proc sort data = transaction;
   by primary_key;
run;

2. Create unique index for hash value 
To avoid misleading identifiers for later joining or hide sensitive information, it is quite popular nowadays to transform the primary key to the hash value for future references. SAS has a MD5 function which can generate distinguishable 128-bit hash values. To display them, I choose a valid hex16 value format.

proc sql;
   create unique index hash_key on transaction(hash_key);
quit;

3. Create simple indexes 
Individual indexes are created for each of the variables such as ORDER_DATE, ORDER_AMOUNT and SHIPPING_ADDRESS.

proc sql;
   create index order_date on transaction(order_date);
   create index order_amount on transaction(order_amount);
   create index shipping_address on transaction(shipping_address);
quit;

4. Create composite indexes 
Then another composite index is generated to include the first name and the last name, in response to full name search.

proc sql;
   create index full_name on transaction(first_name, last_name);
quit;

proc contents data = transaction position;
run;


Eventually this clustered (or sorted) data set contains 5 indexes, which are stored in a same-name index file data set that occupies 700 MB disk separately. In a conclusion, although building a few indexes for a large transactional data set is time-consuming and disk-costly, a query in PROC SQL or DATA Step by the WHERE statement is right now much faster (slash processing time up to 80%-90%), which is quite rewarding for many read-heavy jobs.

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...