Friday, March 11, 2011

Effectiveness of two mail list groups: SAS-L and R-help


Software’s strength depends on the cohesion of the community backing it. Though a commercial package comes with technique support guarantee, the speed and efficiency of telephone wired customer service may not suit the fast-evolving programming need. Especially for a statistical package, such as SAS and R, which typically deals with many small extracting, loading, transformation and analysis tasks, quick short answer to a tricky question is desired. Community based mail list is a fast approach to get question posted and solved. With the help of Google’s Gmail, huge volume of emails generated by such mail lists can be collected and sorted conveniently. As for me, R-help and SAS-L are probably the most popular user groups for R and SAS, separately. And as a learner, I constantly gain kind help from those SAS or R gurus in the two communities. To compare the two user groups, I gathered threads from Gmail up-to-March 7th this year and parsed them to digestible data.

R’s ever-increasing popularity is reflected by enormous number of topics to be discussed. Averagely 37.4 questions are posted on R-help every day, compared with meagerly 8.8 a day on SAS-L. R-help Users tend to discuss a wide range of questions, most likely focusing on modeling and visualization on a specific package, while SAS-L users are mainly interested in data integration and management, such as Proc SQL, data step and macro. The usage deviation may demonstrate the distinctive fields for SAS and R in daily practice. Most SAS-L users are more familiar with the background involved for the questions others ask: a question usually gets 5.0 follow-ups. On the contrary, R-help users expect 2.8 follow-ups for the question they posted, and the chance of no-reply-at-all is also high. In SAS-L, information concentrates on several senior SAS experts who are experienced and ready to provide clues. R-help users are more diverse, possibly because that R has more than 3000 packages and is difficult for an R user to be an all-around player.

SAS and R also have other active mail lists and forum for technical discussion. For those who love SAS-L and R-help mailing lists and wish to become a two-way statistical programmer, getting to know some aspects of the two user groups could benefit us in using them effectively.

****(0) EXPORT THE HEADS FROM MY GMAIL TO FLAT TEXT FILES BY DIFFERENT GROUP TAGS****;
********(1) PREPARE A MACRO TO EXTRACT KEY WORDS FROM TEXTS***************;
%macro extract(group);
   /*(1.1) INTEGRATE LINES FROM RAW TEXT*/
   data &group.ug;
      infile "H:\Regular_expression\&group._group.txt" truncover dsd  lrecl=400;
       input string $400.;
   run;
   /*(1.2) REMOVE GMAIL-TAG-CAUSING REDUNDANT LINES AND INDEX EACH LINES BY THREAD*/
   data &group.ug_c;
      set &group.ug;
      where string is not missing
              and string not in ('LinkedIn', 'R-Group', 'Inbox', 'SAS(noreply)', 'SAS-L');
      thread = ceil(_n_/3);
   run;
   /*(1.3) TRANSPOSE TOPIC|POST AUTHORS|TIME*/
   proc transpose data=&group.ug_c out=&group.ug_s(rename=(col1 = writer col2 = topic col3 = time));
      by thread;
      var string;
   run;
   /*(1.4) PARSE TARGET INFORMATION FOR TOPIC|POST AUTHORS|TIME*/
   data &group.ug1;
      set &group.ug_s(drop=thread _name_);
      /*(1.4.1) PARSE POST AUTHOR   */
       position = prxmatch("/\(\d+/",writer);
       if position ne  0 then  
            rep_num = input(compress(substr(writer, position+1, 2), ')'), 2.) ;
       else rep_num = 0; 
      /*(1.4.2) PARSE THREAD TOPIC*/
      idx = index(topic, '? -');
       topic_str= input(substr(topic, 1, idx), $200.);
      /*(1.4.3) PARSE TIME*/
      length time1 $6.;
      if sum(index(time, 'pm'), index(time, 'am')) > 0 then time1 = 'Mar 7';
      else time1 = strip(time);
      date = input(cats(substr(time1, 5, 2), substr(time1, 1, 3), '2011'), date9.);
      format date date9.;
      drop idx  writer topic position time:;
   run;
   /*(1.5)* SUMMARIZE BY EACH DATE*/
   proc sql;
      create table &group.ug2 as
      select mean(rep_num) as &group._mean_reply format=3.1,
            sum(ifn(rep_num=0, 1, 0)) as &group._zero_reply,
            max(rep_num) as &group._max_reply, count(topic_str) as &group._topicnum, 
            calculated &group._zero_reply / calculated &group._topicnum   
                  as &group._zero_reply_ratio format=3.1,
            date
      from &group.ug1
      group by date
   ;quit;
%mend extract;

**********(2) EXECUATE MACRO TO BUILD DATASETS FOR 2 GROUPS*****;
option mprint symbolgen;
%extract(group=sas);
%extract(group=R);

*********(3) CONCATENATE DATA AND REPORT****;
****(3.1) CONCATENATE 2 DATASETS FROM SAS AND R******; 
proc sql;
   create table reportdata as
   select mean(rep_num) as avg_rep_num, count(topic_str) as topicnum, date, 'SAS' as ug length=3
   from SASug1
   group by date
   union 
   select mean(rep_num) as avg_rep_num, count(topic_str) as topicnum, date, 'R' as ug length=3
   from Rug1
   group by date
   order by date
;quit;

****(3.2) REPORT MEAN AND STD FOR MEAN NUMBERS OF TOPIC AND REPLY****;
proc report data=reportdata nowd headline split='|';
   column ug topicnum,(mean std) avg_rep_num,(mean std);
   define ug/group 'USER|GROUP' width=6;
   define topicnum/'AVERAGE|TOPIC NUMBER';
   define avg_rep_num/'AVERAGE|REPLY NUMBER';
   define mean/format=5.1 'MEAN';
   define std/format=5.2 'STD';
run;

***********(4) VISUALIZE THE COMPARISON RESULT IN TIME SERIES*******;
****(4.1) PREPARE DATASET FOR PLOTTING******;
proc sql;
   create table combine as
   select a.date, a.*, b.*
   from rug2 as a left join sasug2  as b
   on a.date = b.date
;quit;

****(4.2) BUILD PLOTTING TEMPLATE*******; 
proc template;
  define statgraph compplot;
     begingraph / designwidth=1000px designheight=800px;
       entrytitle "Comparison of user groups between SAS and R";
          layout lattice / columns=1 columndatarange=union rowweights=(0.3 0.3 0.4);
            columnaxes;
               columnaxis / offsetmin=0.02 griddisplay=on;
            endcolumnaxes;
         /*(4.2.1)DEFINITION OF TOP PANEL*/
            layout overlay / cycleattrs=true yaxisopts=(griddisplay=on label=" " 
                             display=(line) displaysecondary=all);                                       
               seriesplot x=date y=R_topicnum / lineattrs=(thickness=2px) 
                   legendlabel="R's topics" name="d25"; 
               seriesplot x=date y=SAS_topicnum / lineattrs=(pattern=solid thickness=2px) 
                     legendlabel="SAS's topics" name="d50"; 
               discretelegend "d25" "d50" / across=1 border=on valign=top halign=left 
                     location=inside opaque=true;                            
            endlayout;
         /*(4.2.2)DEFINITION OF MIDDLE PANEL*/
            layout overlay / cycleattrs=true yaxisopts=(griddisplay=on label=" " 
                        display=(line) displaysecondary=all);                                       
                    seriesplot x=date y=R_zero_reply_ratio/ lineattrs=(thickness=2px) 
                     legendlabel="R's zero reply ratio" name="d2"; 
               seriesplot x=date y=SAS_zero_reply_ratio/ lineattrs=(pattern=solid thickness=2px) 
                     legendlabel="SAS's zero reply ratio" name="d5"; 
               discretelegend "d2" "d5" / across=1 border=on valign=top halign=left 
                     location=inside opaque=true;                
            endlayout;
         /*(4.2.3)DEFINITION OF BOTTOM PANEL*/
           layout overlay / yaxisopts=(griddisplay=on display=(line) label=" " displaysecondary=all)  
                        cycleattrs=true xaxisopts=(griddisplay=on);                      
               seriesplot x=date y=R_mean_reply / lineattrs=(thickness=2px) 
                    legendlabel="R's mean reply" name="ser1"; 
               seriesplot x=date y=sas_mean_reply / lineattrs=(pattern=solid thickness=2px) 
                    legendlabel="SAS's mean reply" name="ser2"; 
               seriesplot x=date y=R_max_reply / lineattrs=(thickness=2px) 
                    legendlabel="R's max reply" name="ser3"; 
               seriesplot x=date y=sas_max_reply / lineattrs=(pattern=solid thickness=2px) 
                    legendlabel="SAS's max reply" name="ser4"; 
               discretelegend "ser1" "ser2" "ser3" "ser4"/ across=1 border=on valign=top halign=left 
                    location=inside opaque=true;
           endlayout;
        endlayout;
     endgraph;
  end;
run;

****(4.3) RENDER IMAGES BY USING TEMPLATE******; 
proc sgrender data=combine template=compplot; 
run;

*************************END OF ALL CODING*****************************************;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...