Tuesday, July 9, 2013

Why sometimes use DATA Step instead of PROC SQL

A question
There is an interesting thread about the efficiency of PROC SQL on Mitbbs.com.
Both datasets have1 and have2 have few variables, but many observations. Very simple program, but without any result after running it for a whole afternoon.
proc sql;
    create table want as 
    select *
    from have1 as a left have2 as b
    on b.serie_beg <= a.serie <= b.serie_end;
quit;
If I replace left join with inner join, the result is the same. Why it takes so long?
Four types of joins in PROC SQL
To understand this question, we may have to go back into PROC SQL. A technical paper introduces that like other more generalized database systems, there are four kinds of joins to combine two data sets (tables) in the SQL procedure of SAS: Cartesian join, Hash join, Merge join and Indexed join. Using the command _method at the proc sql statement can let the log of SAS tell which join the current join is based on. For example, sqxjsl stands for Cartesian join; sqxjhsh is for Hash join; sqxjm is for Merge join; sqxjndx is for Indexed join.
  • Cartesian join
Cartesian join uses two nested loops to join two tables. In other SQL systems, Cartesian join is often referred as Nested loop join. Suppose one table has N elements and the other has M elements, the complexity of this algorithm is O(N*M ), and is a bilateral merge that costs more resources. It can be shown in an example below.
************(0)Separate as 2 datasets***************;
data male female;
    set sashelp.class;
    if sex = 'F' then output female;
    else output male;
run;

************(1) Cartesian join ***************;
proc sql _method;
    create table friends1 as
    select a.name as name1, b.name as name2, a.age as age1, b.age as age2
    from male as a left join female as b on a.age > b.age
;quit;
The _method option outlines the execution plan of the join. Here this Cartesian join can't be replaced by more advanced join, and SAS honestly indicates that it is the only approach available.
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjsl
              sqxsrc( WORK.FEMALE(alias = B) )
              sqxsrc( WORK.MALE(alias = A) )
  • Merge join
Merge join implements a sorting for each of the tables and merges them afterward. The time spent is the summation between the sorting and the scanning, and therefor the complexity is O(Nlog(N) + Mlog(M)).
************(2) Merge join ***************;
proc sql _method;
    create table friends2 as
    select a.name as name1, b.name as name2, a.age as age1, b.age as age2
    from male as a left join female as b on a.age = b.age
;quit;
Clearly this join contains two sqxsort steps before the following sqxjm step. The logic here is quite similar to the traditional DATA Step's sort-sort-merge.
NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjm
              sqxsort
                  sqxsrc( WORK.FEMALE(alias = B) )
              sqxsort
                  sqxsrc( WORK.MALE(alias = A) )
  • Hash join
Hash join loads the tables (if they are small enough to fit memory) into memory as hashs table and then do the scanning. The in-memory process is fast and the big O is O(N+M ).
************(3) Hash join ***************;
proc sql _method;
    create table friends3 as
    select a.name as name1, b.name as name2, a.age as age1, b.age as age2
    from male as a inner join female as b on a.age = b.age
;quit;
Usually such an inner join initiates the Hash join.
NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjhsh
              sqxsrc( WORK.MALE(alias = A) )
              sqxsrc( WORK.FEMALE(alias = B) )
  • Indexed join
For relative large data sets that are frequently read, building an index is a good idea. An index is a basically a B-tree structure that translates to O(log(N)). If we have indexes for both lookup table and base table, the complexity will be like Max[O(Mlog(N)), O(Nlog(M))]. In the demo below, I create two indexes for a lookup table and a simulated data set.
************(4) Indexed join ***************;
data simu;    
    do number = 1 to 1e6;
        output;
    end;
run;

proc sql _method;
    create index number on simu(number)
    ;
    create index age on male(age)
    ;
    create table friends4 as
    select a.name as name1, a.age as age1, b.number
    from male as a inner join simu as b on a.age = b.number
;quit;
Hereby the Indexed join has been automatically activated.
NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjndx
              sqxsrc( WORK.MALE(alias = A) )
              sqxsrc( WORK.SIMU(alias = B) )
Alternative solutions to the question
The good thing for PROC SQL is that the execution plan manager always chooses the optimized join plan. The bad thing is that the user is not allowed to specify any particular join. As for the question at the beginning, the Cartesian joins are obviously chosen to realize the left joins. If the data sets are quite large, the waiting seems to be painful. However, I can imagine some other ways in SAS to apply faster algorithms and avoid the Cartesian joins if not necessary.
  • DATA Step Merge
The sort-sort-merge forces a merge join, which decreases the complexity from O(N*M ) to O(Nlog(N) + Mlog(M)).
  • PROC FORMAT
If the lookup table is not very large, a user defined format with PROC FORMATis a unique way in SAS to construct the hash table in memory to perform lookup or scanning, which further decreases the complexity from O(N*M ) to O(N+M ).

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...