There is an interesting thread about the efficiency of PROC SQL on Mitbbs.com.
Both datasets have1 and have2 have few variables, but many observations. Very simple program, but without any result after running it for a whole afternoon.
proc sql; create table want as select * from have1 as a left have2 as b on b.serie_beg <= a.serie <= b.serie_end; quit;If I replace left join with inner join, the result is the same. Why it takes so long?
Four types of joins in PROC SQL
To understand this question, we may have to go back into PROC SQL. A technical paper introduces that like other more generalized database systems, there are four kinds of joins to combine two data sets (tables) in the SQL procedure of SAS: Cartesian join, Hash join, Merge join and Indexed join. Using the command
proc sqlstatement can let the log of SAS tell which join the current join is based on. For example,
sqxjslstands for Cartesian join;
sqxjhshis for Hash join;
sqxjmis for Merge join;
sqxjndxis for Indexed join.
Cartesian join uses two nested loops to join two tables. In other SQL systems, Cartesian join is often referred as Nested loop join. Suppose one table has N elements and the other has M elements, the complexity of this algorithm is , and is a bilateral merge that costs more resources. It can be shown in an example below.
************(0)Separate as 2 datasets***************; data male female; set sashelp.class; if sex = 'F' then output female; else output male; run; ************(1) Cartesian join ***************; proc sql _method; create table friends1 as select a.name as name1, b.name as name2, a.age as age1, b.age as age2 from male as a left join female as b on a.age > b.age ;quit;
_methodoption outlines the execution plan of the join. Here this Cartesian join can't be replaced by more advanced join, and SAS honestly indicates that it is the only approach available.
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: SQL execution methods chosen are: sqxcrta sqxjsl sqxsrc( WORK.FEMALE(alias = B) ) sqxsrc( WORK.MALE(alias = A) )
Merge join implements a sorting for each of the tables and merges them afterward. The time spent is the summation between the sorting and the scanning, and therefor the complexity is .
************(2) Merge join ***************; proc sql _method; create table friends2 as select a.name as name1, b.name as name2, a.age as age1, b.age as age2 from male as a left join female as b on a.age = b.age ;quit;
Clearly this join contains two
sqxsortsteps before the following
sqxjmstep. The logic here is quite similar to the traditional DATA Step's
NOTE: SQL execution methods chosen are: sqxcrta sqxjm sqxsort sqxsrc( WORK.FEMALE(alias = B) ) sqxsort sqxsrc( WORK.MALE(alias = A) )
Hash join loads the tables (if they are small enough to fit memory) into memory as hashs table and then do the scanning. The in-memory process is fast and the big O is .
************(3) Hash join ***************; proc sql _method; create table friends3 as select a.name as name1, b.name as name2, a.age as age1, b.age as age2 from male as a inner join female as b on a.age = b.age ;quit;
Usually such an inner join initiates the Hash join.
NOTE: SQL execution methods chosen are: sqxcrta sqxjhsh sqxsrc( WORK.MALE(alias = A) ) sqxsrc( WORK.FEMALE(alias = B) )
For relative large data sets that are frequently read, building an index is a good idea. An index is a basically a B-tree structure that translates to . If we have indexes for both lookup table and base table, the complexity will be like . In the demo below, I create two indexes for a lookup table and a simulated data set.
************(4) Indexed join ***************; data simu; do number = 1 to 1e6; output; end; run; proc sql _method; create index number on simu(number) ; create index age on male(age) ; create table friends4 as select a.name as name1, a.age as age1, b.number from male as a inner join simu as b on a.age = b.number ;quit;
Hereby the Indexed join has been automatically activated.
NOTE: SQL execution methods chosen are: sqxcrta sqxjndx sqxsrc( WORK.MALE(alias = A) ) sqxsrc( WORK.SIMU(alias = B) )
Alternative solutions to the question
The good thing for PROC SQL is that the execution plan manager always chooses the optimized join plan. The bad thing is that the user is not allowed to specify any particular join. As for the question at the beginning, the Cartesian joins are obviously chosen to realize the left joins. If the data sets are quite large, the waiting seems to be painful. However, I can imagine some other ways in SAS to apply faster algorithms and avoid the Cartesian joins if not necessary.
DATA Step Merge
sort-sort-mergeforces a merge join, which decreases the complexity from to .
If the lookup table is not very large, a user defined format with
PROC FORMATis a unique way in SAS to construct the hash table in memory to perform lookup or scanning, which further decreases the complexity from to .