Thursday, October 20, 2011

Rick Wicklin’s 195th blog post

Today I ran a SAS routine to check the KPIs for a few websites I am interested in. I accidentally found the total number of posts on Rick Wicklin’s blog is going to approach 200 pretty soon. I followed his blog since its creation. It is an amazing number in a little more than one year. Rick is a unique blogger: he is a statistician who does programming; he is a programmer who plots data; he is a data analyst who is a good writer. As for me, it’s meaningful to summarize what I have learned from his blog.
Data extracted from The Do Loop
SAS official blogs have been restructured this summer. Since I can’t find the previous XML button on the website, I rewrote a program to directly extract HTML data to drive the KPI. Jiangtang Hu also created a program to extract data from The Do Loop, and mentioned that Rick is an incredibly productive writer.

%macro extract(page = );
   options mlogic mprint;
   %do index = 1 %to &page;
      filename raw url "";
      data _tmp01;
        infile raw lrecl= 550 pad ;
        input record $550. ;
        if find(record, 'id="post') gt 0 or find(record, 'class="post') gt 0;
      data _tmp02;
         set _tmp01;
         _n + 1;
         _j = int((_n+2) / 3);
      proc transpose data=_tmp02 out=_tmp03;
         by _j;
         var record;
      data _&index;
         set _tmp03;
         array out[3] $100. title time pageview;
         array in[3] col1-col3;
         do i = 1 to 3;
            if i = 1 then do; _str1 = 'rel="bookmark">'; _str2 = "</a></"; end;
            if i = 2 then do; _str1 = '+0000">'; _str2 = '</abbr>'; end;
            if i = 3 then do; _str1 = '="postviews">'; _str2 = "</span>"; end;
            _len = length(compress(_str1));
            _start = find(in[i], compress(_str1)) + _len ;
            _end = find(in[i], _str2, _start);
            out[i] = substr(in[i] , _start  , _end - _start);
         drop _: col: i;
   data out;
       set %do n = 1 %to &page;
   proc datasets nolist;
      delete _:;
%extract(page = 20);
data out1;
   set out nobs = nobs;
   j + 1; 
   n = nobs - j + 1;
   length level $20.;
   label pageview1 = 'PAGEVIEW' time1 = 'TIME' n = 'TOTAL POSTS';
   pageview1 = input(pageview, 5.);
   _month = scan(time, 1);
   _date = scan(time, 2);
   _year = scan(time, 3);
   time1 = input(cats(_date, substr(_month, 1, 3), _year), date9. );
   weekday = weekday(time1);
   drop _:;
   format time1 date9.;

ods html style = htmlbluecml; 
proc sql noprint;
   select count(*), sum(pageview1) into: nopost, :noview
   from out1
proc gkpi mode=basic;
   dial actual = &nopost bounds = (0 100 200 300 400) /
   target=200 nolowbound  
   afont=(f="Garamond" height=.6cm)
   bfont=(f="Garamond" height=.7cm) ;
proc gkpi mode=basic;
   dial actual = &noview bounds = (0 2e4 4e4 6e4 8e4) /
   afont=(f="Garamond" height=.6cm)
   bfont=(f="Garamond" height=.7cm) ;
What I learned
I accumulated all the 195 titles, replaced/removed some words and processed them with Wordle. As I expected, Rick’s blog is mainly about ‘Matrix’, ‘Statistics’ and ‘Data’. It is interesting to learn how to create ‘Function’ in SAS/IML, which involves a lot of programming skills. I also enjoyed his topics about ‘Simulating’ and ‘Computing’ with ‘Random’ numbers. He also has exciting articles about how to deal with ‘Missing’ values and ‘Curve’.

data word_remove;
   input word : $15. @@;
sas iml using use creating create proc blog vesus

proc sql noprint;
   select quote(upcase(compress(word))) into :wordlist separated by ',' 
   from word_remove

data _null_;
   set out(keep=title);
   title =tranwrd(upcase(title), 'MATRICES', 'MATRIX');
   title =tranwrd(upcase(title), 'FUNCTIONS', 'FUNCTION');
   title =tranwrd(upcase(title), 'STATISTICAL', 'STATISTICS');
   length i $8.;
   do i = &wordlist;
      title =tranwrd(upcase(title), compress(i), ' ');
   file 'c:\tmp\output1.txt';
   put title;
When the number reaches 200
Except the holidays (those gaps in the finger plot above), Rick keeps a constant rate in writing articles (approximately 3 posts a week).

No double the OLS regression gives a straight line. It seems that the total number will hit the 200 target pretty soon: next next week I believe.

proc sgplot data=out1;
   needle x = time1 y = n;
   yaxis grid max = 300;

proc sgplot data = out1;
   reg x =time1 y = n;
   refline 200/ axis=y ;
   yaxis max = 300;

What a SAS user likes to know
From my experience, clicks in a web browser are mostly originated form search engines, while a regular reader would like to use feeds instead. The page views recorded on the website of The Do Loop can reflect what SAS users try to find. Rick follows his pattern -- introductory tips on Monday, intermediate techniques for Wednesday, and topics for experienced programmers Friday. If we separate the page view trends at the three levels, we can see that the intermediate and advanced posts attract more page views than the basic ones.

data out2;
   set out1; 
      if weekday = 2 then level = '1-Basic';
      else if weekday in (3, 4) then level = '2-Intermediate';
      else level = '3-Advanced';
   set out1; 
      level = '4-Overall'; 

proc sgpanel data = out2;
   panelby level / spacing=5 columns = 2 rows = 2 novarname;
   series x = time1 y = pageview1;
   rowaxis grid; colaxis grid;
I agree with what Rick Wicklin said: blogging helps us to become more aware of what we know and what we don't know. I benefited a lot from his book and his resourceful blog in the past year. Cheers on Rick’s incoming 200th post!

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...