Friday, February 4, 2011

Self-matching and its applications


Programming is all about data structure and algorithm. For example, value comparison needs to find right data structure and iteration method. To fulfill this purpose, the first thing is to load the variable with a key-value like data structure, followed by key-driven iterations. In SAS, the typical key-value data types are array and hash table. Proc Format powered format can be another option. For data merging or equivalent values searching, those data types are pretty adequate. If exact numerical difference is desired, those SAS data structures may have some obstacles. First of all, array and hash table have to be created in a data step, which means they cannot be saved and reused, unless they are unidirectional translated to another data structure - dataset. Second, data loading is somewhat cumbersome: as for array, data has to be transposed to be assigned to an array; as for hash table, data step inherent iteration has to be carefully avoided and hash object still has limited methods.

Value comparison can utilize either data step's inherent iteration or SQL's Cartesian-product iteration. Explicit loops could be largely bypassed through these means. For instance, to remove unrelated adjacent numbers in a grouped data set, a strategy called ‘look-ahead and look-back’ is suggested by one blog post on SAS Community[1]. Several SAS-L users kindly provided answers for this specific question [2]. Within the four workable solutions, using data step, Arhur Tabachneck and BBSER2009 implemented the lagged or temporary variables derived from the same variable to compare values; Joe Matise and Bolotin Yevgeniy realized the same goal by Proc SQL (Example 1).

The strategy can be further extended to the cases requiring more iterations. In one real-world sample, to search values with fixed difference in a variable, a conditional self-matching SQL command can achieve n-choosing-2 comparisons simply and quickly(Example 2). Finding longest repeated substring is a famous algorithm question in computer science. And the common answer is to construct Suffix tree and look for the deepest node. In another example below, self-matching by Proc SQL can be a short-cut solution without considering many algorithm details (Example 3).

Self-matching is not the most efficient algorithm. Proc SQL is demanding on memory, while intensive data I/O can cause data step freezing. Arhur Tabachneck metioned that the application on large dataset may lead into SQL infinite loop [2], which I think as a stack overflow due to calling too much memory. However, in many scenarios of values comparison, using self-matching technique could dramatically decrease coding complexity and save time.

References: 1. 'Look-Ahead and Look-Back'. http://www.sascommunity.org/wiki/Look-Ahead_and_Look-Back.
2. 'Removing numbers'. http://www.listserv.uga.edu/cgi-bin/wa?A1=ind1101d&L=sas-l#44.

*********(1)EXAMPLE1 --  REMOVE UNRELATED NUMBERS WITHIN GROUPS*****;
******(1.0) INPUT OF THE RAW DATA******;
data one;
 input id:$1. var1 @@;
 cards;
a  4 b  2 b  3 b  4 b  8 b  9 b  10 b  11
b  13  b  15 b  16 c  1 c  2  c  5
;
run;

*******(1.1) ARTHUR TABACHNECK'S PLAN************;
data AT_RESULT (drop=prev_: next_:);
    set one;
    by id;
    set one ( firstobs = 2 keep = var1
            rename = (var1 = next_var1) )
    one ( obs = 1 drop = _all_ );
    prev_var1 = ifn( first.id, (.), lag(var1) );
    next_var1 = ifn(  last.id, (.), next_var1 );
    if abs(var1-prev_var1) eq 1 or
    abs(var1-next_var1) eq 1 then output;
run;

******(1.2) BBSER2009'S PLAN****;
data BBSER2009_RESULT(rename=(temp_id=id temp_var1=var1)) ;
      length pre_id $1;
      retain pre_id; retain pre_var1 .; retain pre_op 0;
      set one;
      if var1-pre_var1=1 then  do;
         if pre_op=0 then do;
    temp_id=pre_id;
    temp_var1=pre_var1;
    output;
    end;
        temp_id=id;
      temp_var1=var1;
    output;
        pre_op=1;
      end;
      else pre_op=0;
      pre_id=id; pre_var1=var1;
   keep temp_id temp_var1;
run;

********(1.3) JOE MATISE'S PLAN*************;
proc sql;
create table JM_RESULT  as
  select * from one h
  where exists (
      select 1 from one v
      where h.id=v.id
      and (h.var1 = v.var1-1
        or h.var1 = v.var1+1));
quit;

*******(1.4) BOLOTIN YEVGENIY'S PLAN*****;
proc sql;
  create table BY_RESULT as
       select distinct a.*
       from    one a
               inner join one b
       on a.id = b.id
       where abs(a.var1 - b.var1)=1;
quit;

*********END OF EXAMPLE (1)**********;

*******(2) EXAMPEL 2 -- FIND ANY VALUES WITH FIXED DIFFERENCES****** ;
*****(2.1) INPUT RAW DATA******;
data timeseries;
    input  date1 date9.  var1;
    cards;
    11jan2008  11
    12jan2008 10
    13jan2008 21
    14jan2008 20
    15jan2008 30
    16jan2008 30
;
run;

*****(2.2) FIND ALL DATES WITH UNTI DIFFERENC EQUALS 10****;
proc sql;
    create table timeseries1 as
    select a.date1 as date1 format=date9., a.var1 as unit1, b.date1 as date2 format=date9., b.var1 as unit2
    from timeseries as a, timeseries as b
    where a.var1-b.var1=10
;quit;

****END OF EXAMPLE (2)**********;

****(3) EXAMPLE3 --  FIND LONGEST DUPLICATED SUBSTRING SEARCH****;
***(3.0) INPUTRAW DATA***;
data test(where=(missing(char)=0));
   input char $1.@@;
    id=_n_;
   cards;
abcaabcabd
;
run;

***(3.1) SINGLE CHARACTER MATCHING****;
proc sql;
    create table test1 as
    select  a.id as id1, b.id as id2, a.char as char1, b.char as char2, 
            abs(id1-id2) as pos
    from test as a, test as b
    where a.char=b.char and id1 lt id2
    order by pos, id1
;quit;

****(3.2) REMOVE ONE-CHARACTER DUPLICATES****;
data test2;
  set test1;
  by pos;
  dif = dif(id1);
  if first.pos then dif=1;
  if dif eq 1;
  if first.pos and last.pos then delete;
run;

****(3.3) CONCATENATING DUPLICATES****;
proc transpose data=test2 out=test3;
    by pos ; 
    var char1;
    run;
    
data test4;
    set test3;
    array allvar[*] col:;
    repeat = cats(of allvar[*]);
run;

****(3.4) SHOW THE LONGEST DUPLICATED SUBSTRING****;
proc sql;
    select max(repeat) 'The longest duplicated substring'
    from test4
;quit;

******END OF EXAMPLE(3)****************;

*****(4) EXAMPLE 4 -- FIND THE PRODUCT FROM ANY TWO NUMBERS IN AN ARRAY***

data have;    
    input var1;
    cards;
    1
    -3 
    3
    9
    12
    23
    34
    5
    8
    9
    -5
    ;
run;

data temp;
    set have;
    idx = _n_;
run;

proc sql;
    create table want as
    select a.var1 * b.var1 as product
    from temp as a, temp as b
    where a.idx ne b.idx
;quit;
**************END OF EXAMPLE 4**********;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...