Saturday, December 31, 2011

Using SQL for data management


Every weekday I pull out data from IBM DB2 database and do analysis in SAS. I use SQLite to record my personal chores, such as credit card transactions. And I am taking courses of database development by Microsoft SQL Server. Then I have pretty enough time to compare the 4 softwares. Recently I have the rough feeling -- the relational database end and the analytics end look more and more similar, while the databases brings more analysis procedures and every statically packages is eager to introduce SQL feature as enhancement. Thus, Structured Query Language (SQL) tends to be the universal language in the data world.

SQLs in DB2, Microsoft SQL Server (T-SQL), SQLite and SAS (PROC SQL) follow the same logic but contain different characteristics. We can treat them as the branches originated from the ANSI SQL standard. I play around their own SQL dialects and try to learn how to manage data in different ways.

Data
Using the SQL syntax below,  the SASHELP.CLASS data set, which contains the gender, name, age, weight and height information of 18 fake teenagers, will be able to be transferred to any database system.
create table class (name char(8), sex char(1), 
   age numeric, height numeric, weight numeric );
insert into class values('ALFRED','M',14,69,112.5);
insert into class values('ALICE','F',13,56.5,84);
insert into class values('BARBARA','F',13,65.3,98);
insert into class values('CAROL','F',14,62.8,102.5);
insert into class values('HENRY','M',14,63.5,102.5);
insert into class values('JAMES','M',12,57.3,83);
insert into class values('JANE','F',12,59.8,84.5);
insert into class values('JEFFREY','M',13,62.5,84);
insert into class values('JOHN','M',12,59,99.5);
insert into class values('JOYCE','F',11,51.3,50.5);
insert into class values('JUDY','F',14,64.3,90);
insert into class values('LOUISE','F',12,56.3,77);
insert into class values('MARY','F',15,66.5,112);
insert into class values('PHILIP','M',16,72,150);
insert into class values('ROBERT','M',12,64.8,128);
insert into class values('RONALD','M',15,67,133);
insert into class values('THOMAS','M',11,57.5,85);
insert into class values('WILLIAM','M',15,66.5,112);

Random sampling
I randomly select 8 out of the 18 persons. The realization of this purpose mainly relies on the software's' random function and output control.

/********************** a. SQL Server **********************/
select top (8) *
from class
order by NEWID();

/********************** b. DB2 **********************/
select *
from class
order by rand() 
fetch first 8 rows only;

/********************** c. SQLite **********************/
select *
from class
order by random() 
limit 8;

/********************** d. SAS **********************/
proc sql outobs = 8;
select *
from class
order by ranuni(0);

Report subtotal of a variable as the last observation
It is a quite common task to make a table with subtotal for reporting. DB2 and SQL Server both have ROLLUP operator/function, while SAS may utilize its macro facility.

/********************** a. SQL Server **********************/
select coalesce(name, 'Total') as name, SUM(weight) as weight
from class
group by name with rollup;

/********************** b. DB2 **********************/
select case grouping(name) 
      when 0 then name
      else 'Total'
   end name,
sum(weight) as weight
from class
group by rollup(name);

/********************** c. SQLite **********************/
select name, sum(weight) as weight
from class
group by name
union all
select 'Total', sum(weight) 
from class;

/********************** d. SAS **********************/
select sum(weight) into: sum_weight from class;
create table class2 as select name, weight from class;
insert into class2 values('Total', &sum_weight);

Retrieve column information of a table
Each database tries to store the metadata somewhere, so that we are able to retrieve the information of the columns/variables, such as their name, data type and position at the table. Only SQLite, a portable database, has no designated table to store this information. However, running the command line shell of SQLite has the same effect.

/********************** a. SQL Server **********************/
select column_name, data_type, ordinal_position
from information_schema.columns
where table_name = 'class';

/********************** b. DB2 **********************/
select colname, typename, colno
from syscat.columns
where tabname = 'class';

/********************** c. SQLite **********************/
pragma table_info(class);

/********************** d. SAS **********************/
select name, type, varnum
from sashelp.vcolumn
where libname = 'WORK' and memname = 'CLASS';

Obtain the running total of a variable
It seems like a difficult job for native SQL syntax. SQL Server and DB2 evolve with their ranking functions -- DB2 has sum() over() and SQL Server has row_number() over(), which can be used to solve this problem. In SAS, the better way is to use DATA step instead.

/********************** a. SQL Server **********************/
select name, weight, sum(weight) over (order by weight) as running_total
from class
order by weight

/********************** b. DB2 **********************/
create view v_class as
select ROW_NUMBER() OVER(ORDER BY name) as rownum, *
from class;
select a.name, a.weight, 
   (select SUM(b.weight) from v_class b
   where b.rownum <= a.rownum) as running_total
from v_class as a

/********************** d. SAS **********************/
data class2;
   class;
   retain running_total;
   running_total + weight;
   keep name weight running_total; 
run;

Sunday, December 25, 2011

Four ways to add row number in SQL Server



It is easy to sort any data in SQL Server, while it is not a trivial job to add a new variable of row number to an existent table. Here I come with four ways to realize the purpose.


/********** 0. Input data source *********************/ 
use tempdb
if object_id('class') is not null 
   drop table class;
go

-- Use a table with 5 variables from 18 teenagers
create table class (name char(8), sex char(1), 
   age numeric, height numeric, weight numeric );
insert into class values('ALFRED','M',14,69,112.5);
insert into class values('ALICE','F',13,56.5,84);
insert into class values('BARBARA','F',13,65.3,98);
insert into class values('CAROL','F',14,62.8,102.5);
insert into class values('HENRY','M',14,63.5,102.5);
insert into class values('JAMES','M',12,57.3,83);
insert into class values('JANE','F',12,59.8,84.5);
insert into class values('JEFFREY','M',13,62.5,84);
insert into class values('JOHN','M',12,59,99.5);
insert into class values('JOYCE','F',11,51.3,50.5);
insert into class values('JUDY','F',14,64.3,90);
insert into class values('LOUISE','F',12,56.3,77);
insert into class values('MARY','F',15,66.5,112);
insert into class values('PHILIP','M',16,72,150);
insert into class values('ROBERT','M',12,64.8,128);
insert into class values('RONALD','M',15,67,133);
insert into class values('THOMAS','M',11,57.5,85);
insert into class values('WILLIAM','M',15,66.5,112);

/********** 1. Primary key with auto-increment *********************/ 
create table class2 (row_num smallint identity(1,1) primary key clustered, 
   name char(8), sex char(1), age numeric, height numeric, weight numeric );
insert into class2(name, sex, age, height, weight) 
select * from class
order by weight, name;

select * from class2;
drop table class2;

/********** 2. Subquery *********************/ 
select name, weight, (
   select count(*) from class as b
   where b.weight < a.weight or (b.weight = a.weight and b.name <= a.name) 
   ) as row_num
from class as a
order by weight , name

/********** 3. Cursor *********************/ 
create table classcursor(name char(8), weight int, row_num int);
declare @name as char(8), @weight as int, @row_num as int

begin tran
   declare rncursor cursor fast_forward for
      select name, weight from class order by weight , Name;
      
   open rncursor;

   -- Set the initial value
   set @row_num = 0;
   fetch next from rncursor into @name, @weight;

   -- Here come a loop 
   while @@fetch_status = 0
   begin 
      set @row_num = @row_num + 1;
      insert into classcursor(name, weight, row_num)
         values(@name, @weight, @row_num)
      fetch next from rncursor into @name, @weight
   end
   close rncursor;
   deallocate rncursor;
commit tran

select * from classcursor
drop table classcursor

/********** 4. Row_number() function *********************/ 
select name, weight, row_number() over(order by weight, name) as rownum
from class

Friday, December 16, 2011

Merry Christmas!

Just for fun.

data cars;
   set sashelp.cars;
   length label $10.;
   keep invoice weight enginesize type label;
   if _n_ = 1 then label = 'Merry';
   if _n_ = 50 then label = 'Christmas';
run;

proc sgplot data = cars noautolegend;
   bubble x = invoice y = weight size = enginesize / group = type datalabel = label datalabelattrs=(color=red 
      family="garamond" style=italic size=45 weight= bold) transparency=0.4;
   xaxis max = 80000 label = ' '; yaxis label = ' ' max = 6000;
run;

Monday, December 12, 2011

10 toughest interview questions for R developers (1)

I recently discovered 10 questions most daunting in R development, and I am trying to find the answers below.

1. Data structure -- How many data structures R has? How do you build a binary search tree in R?
2. Sorting -- How many sorting algorithms are available? Show me an example in R.
3. Low level -- How do you build a R function powered by C?
4. String -- How do you implement string operation in R?
5. Vectorization -- If you want to do Monte Carlo simulation by R, how do you improve the efficiency?
6. Function -- How do you take function as argument of another function? What is the apply() function family?
7. Threading -- How do you do multi-threading in R?
8. Memory limit and database -- What is the memory limit of R? How do you avoid it? How do you use SQL in R?
9. Testing -- How do you do testing and debugging in R?
10. Software development -- How do you develop a package? How do you do version control?

Q1. Data structure -- How many data structures R has? How do you build a binary search tree in R?
My answer: R mainly has 5 data structures.
Homogeneous(contain the same type of objects): vector --> matrix --> array
Heterogeneous(allow different type of objects): list --> data frame

A binary search tree requires several actions: implement a tree, insert nodes and delete nodes. We should create individual routines in R.

In R, a matrix is the ideal data structure to contain the linked elements. Then a list should be used to wrap the matrix and pass the arguments.

For insert-node routine, there is the pseudocode for it. The key point here is to use recursion in R’s function. Norman Matloff gave a complete example at page 180 of his book. 

insert(X, node){
 if(node = NULL)
 node = new binaryNode(X,NULL,NULL)
 return
 }
 if(X = node:data)
 return
 else if(X < node:data)
 insert(X, node:leftChild)
 else // X > node:data
 insert(X, node:rightChild)
}

Q2. Sorting -- How many sorting algorithms are available? Show me an example in R.
My answer: there are mainly 5 kinds of sorting algorithms(with their complexity):
Bubble Sort - O(n^2);
Selection Sort - O(n^2)
Merge Sort - O(n log n)
Quick Sort - from O(n log n) to O(n^2)
Bucket Sort - O(n + m)

R has a native sort function sort(), which is written by C and uses Quick Sort.

There is an example of Quick Sort in R by recursion

qs <- function(x) {
 if (length(x) <= 1) return(x)
 seed <- x[1]
 rest <- x[-1]
 sv1 <- rest[rest < seed]
 sv2 <- rest[rest >= seed]
 sv1 <- qs(sv1)
 sv2 <- qs(sv2)
 return(c(sv1,seed,sv2))
}

Those most productive R developers



The number of R packages on CRAN is 3,483 on 2011-12-12. The growth of R package in the past years can be fitted by a quadratic regression perfectly.

I am always interested in who are maintaining those packages. Then I wrote an R script to extract the package head information from CRAN’s website and stored them in a SQLite database. Most R developers are maintaining 1-3 R packages. Some of them are really productive. By the correspondence addresses (Email), the top 50 R developers are listed below:

developer package
1 Kurt Hornik 23
2 Martin Maechler 23
3 Hadley Wickham 21
4 Rmetrics Core Team 19
5 Achim Zeileis 17
6 Henrik Bengtsson 17
7 Paul Gilbert 17
8 Brian Ripley 14
9 Roger D. Peng 13
10 Torsten Hothorn 13
11 Karline Soetaert 12
12 Philippe Grosjean 12
13 Robin K. S. Hankin 12
14 Charles J. Geyer 11
15 Matthias Kohl 11
16 Charlotte Maia 10
17 Mikis Stasinopoulos 10
18 Simon Urbanek (1) 10
19 Thomas Lumley 10
20 Arne Henningsen 9
21 Gregory R. Warnes 9
22 Jonathan M. Lees 9
23 Michael Hahsler 9
24 Peter Ruckdeschel 9
25 A.I. McLeod 8
26 Brian Lee Yung Rowe 8
27 Dirk Eddelbuettel 8
28 John Fox 8
29 Kaspar Rufibach 8
30 Korbinian Strimmer 8
31 Michael Friendly 8
32 Peter Solymos 8
33 Roger Bivand 8
34 Simon Urbanek (2) 8
35 Christopher Brown 7
36 David Meyer 7
37 ORPHANED 7
38 Revolution Analytics 7
39 Rob J Hyndman 7
40 Romain Francois 7
41 Ulrike Groemping 7
42 Christophe Genolini 6
43 Frank Schaarschmidt 6
44 G. Grothendieck 6
45 Hana Sevcikova 6
46 Jeffrey A. Ryan 6
47 Kjetil Halvorsen 6
48 Pei Wang 6
49 Trevor Hastie 6
50 Yihui Xie 6


### A script of R to extract R package information and 
### build a SQLite databse by hchao8@gmail.com 
library(ggplot2)
library(XML)
library(RSQLite)

# Create and connect a SQLite database
conn <- dbConnect("SQLite", dbname = "c:/Rpackage.db")

# Extract names of R packages available from web
allPackageURL <-
  "http://cran.r-project.org/web/packages/available_packages_by_name.html"
allPackage <- na.omit(melt(readHTMLTable(allPackageURL))[, c("V1")])

# Extract individual package information from web and store data in SQLite 
for (i in 1:length(allPackage)){
  packageName <- allPackage[i]
  packageURL <- paste("http://cran.r-project.org/web/packages/",packageName,
                      "/index.html", sep="")
  y <- melt(readHTMLTable(packageURL))
  y$L1 <- packageName
  if(dbExistsTable(conn, "Rpackage")) {
     dbWriteTable(conn, "Rpackage", y, append = TRUE)
  } else {
     dbWriteTable(conn, "Rpackage", y)
  }
} 
# Pull out maintainer information from SQLite database
all <- fetch(dbSendQuery(conn, "
          select v2 as author, count(v2) as package
          from rpackage
          where v1 = 'Maintainer:'
          group by v2
          order by package desc
          ;"))

# Disconnect SQLite database
dbDisconnect(conn)

# Draw a histogram
qplot(package, data = all, binwidth = 1, ylab = "Frequency",
      xlab = "R packages maintained by individual developer")
ggsave("c:/Rlist.png")

# Find 50 most productive developers
head(all, 50)

Tuesday, December 6, 2011

A new way to draw maps in SAS


SAS’s ODS Graphics technology brought the concept of layer into data visualization. We can use those SG procedures to do many tricks. Previously in SAS, a map has to be drawn from its GMAP procedure. Now we can simply use 3-4 lines of codes to sketch some maps by the scatter statement in PROC SGPLOT, such as North America or Asia.

ods html style = money;
proc sgplot data = maps.namerica noautolegend;
 scatter x = x y = y / group = id  markerattrs=(size=1);
 xaxis grid label = ' '; yaxis grid label = ' ';
run;
proc sgplot data = maps.china ;
 scatter x = x y = y /markerattrs=(size=2);
 xaxis grid label = ' '; yaxis grid label = ' ';
run;



We can apply it to single countries, like China or India. For India, the aspect has to be modified a little. It can be aslo done by PROC SGSCATTER. My friend Xiangxiang has a great tutorial for this procedure.

ods html style = harvest;
proc sgplot data = maps.asia noautolegend;
 scatter x = x y = y / group = id  markerattrs=(size=1);
 xaxis grid label = ' '; yaxis grid label = ' ';
run;

ods html style = htmlbluecml;
ods graphics on / width=6in height = 6in;
proc sgplot data = maps.india;
 scatter x = x y = y /markerattrs=(size=2);
 xaxis grid label = ' '; yaxis grid label = ' ';
run;
ods graphics  / reset;
Thanks to SAS’s ODS group, those SG procedures always give me limitless fun

Wednesday, November 30, 2011

When Google Analytics meets SAS

Thanks to Tricia’s introduction, I recently realized that Google Analytics is such a powerful tool for web analytics or business intelligence. It will fit the special needs if we use SAS to analyze the well-structure users’ data accumulated in Google Analytics. The challenge is that Google Analytics API and SAS hardly meet each other: Google Analytics often serves web/Linux, and SAS dwells in the ecosystems of Windows/UNIX/Mainframe. On a Windows-equipped computer, I tried three methods to pull out this blog’s data from Google Analytics to SAS: they have their own pros and cons.

Method 1: CliendLogin + HTTP protocol
The Data Export API of Google Analytics has 3 types of authorization, and ClientLogin is one of them. After downloading a token, the information from Google Analytics can be received through the HTTP protocol. William Roehl has a wonderful paper to describe how to pass the authorization step and then parse the XML data by applying two SAS macros. R’s RGoogleAnalytics package and Python’s GA library are also based the similar principles.
Pros: simple and effective. A client user can choose SAS, R or Python to download data. The codes are all open-sourced and easy to get modified for any particular need.
Cons: they all need cURL to set up SSL connection while downloading data. Since cURL is not built for Windows, it’s really awkward to use cURL on a PC which could fail many attempts.

Method 2: Data Feed Query Explorer
Google Analytics API has a web portal to supply data. It uses a browser to realize the operations in the first method above.
Pros: the easiest solution. The portal provides all options for the metrics, dimensions and segments.
Cons: the data has to be re-structured in SAS. It is getting slow when displaying a lot of results.



Method 3: OAuth + Google’s Python client library
OAuth is another authorization method. To obtain the necessary client’s key and secret for this approach, Google Analytics API has to be activated from Google API Console.
Pros: this authorization method is recommended by Google. The official Python library is very fast. Data downloaded in Python can be saved as CSV and then incorporated by SAS.

Cons: a little complicated. A SAS user has to learn some Python to tweak the codes.

Google Analytics now kicks off a new web interface which has a pretty high learning curve. In my opinion, using it's Data Export API as a front-end database and SAS as a back-end analytics platform will help generate customized models.

Tuesday, October 25, 2011

NCAA football and computer rankings

I am a big fan of NCAA football. I found that in the past weeks the cold-blooded computer rankings are more accurate than the poll rankings(BCS, Harris Poll and USA Today). And they are pretty good in predicting the game results, such as the fall of Oklahoma last week.

Data and plotting
Those ranking data are available on ESPN’s website (and they are well structured data and easy to grab). I subtracted the 6 computer rankings by the overall ranking and drew those differences on a scatter plot.
-Alabama seems to have more chance to take LSU’s place.
-Although Michigan State beat Wisconsin last week, the computers still don’t favor it.
-Auburn looks very promising.
-Oklahoma State is highly possible to become #2.
One complaint about the computer ranking
I don’t like the averaging method of the 6 computer rankings used by BCS. Transformation and factor analysis may make full use of the information - SAS’s user guide provided a detailed solution by PROC PRINQUAL and PROC FACTOR.

data week9;
   input @1 RK: 20. @5 TEAM: $40. AVG_bcs: PVS_bcs: $2. RK_hp: $2. PTS_hp pct_hp RK_usa: $2. 
      PTS_usa pct_usa AVG_computer AH   RB   CM   KM   JS   PW;
cards;   
/* COPY AND PASTE DATA FROM http://espn.go.com/college-football/bcs
*/
;;;
run;

data _tmp01;
   set week9;
   array rank[6] ah--pw;
   do i = 1 to 6;
      if rank[i]= 0 then rank[i] = 25;
      rank[i] = rk - rank[i];
      drop i;
   end;
run;
proc sort data=_tmp01;
   by team;
run;
proc transpose data=_tmp01 out=_tmp02;
   var ah--pw;
   by team;
run;

proc template;
  define Style HeatMapStyle; 
    parent = styles.htmlblue;
    style GraphFonts from GraphFonts /                                                       
      'GraphLabelFont' = (", ",6pt) 
      'GraphValueFont' = (", ",6pt)
      'GraphDataFont' = (", ",6pt);    
   end; 
run;
proc template;
   define statgraph HeatMap.Grid;
   begingraph;
   layout overlay / border=true xaxisopts=(label='TEAM') yaxisopts=(label='COMPUTER ALGORITHM');
      scatterplot x=team y=_name_ / markercolorgradient=col1 markerattrs=(symbol=squarefilled size=32)
                                   colormodel=threecolorramp name='s';
      continuouslegend 's' / orient=vertical location=outside valign=center halign=right;
   endlayout;
   endgraph;
   end;
run;
ods html style=HeatMapStyle image_dpi=300 ;
proc sgrender data=_tmp02 template=HeatMap.grid;
run;

Thursday, October 20, 2011

Rick Wicklin’s 195th blog post

Today I ran a SAS routine to check the KPIs for a few websites I am interested in. I accidentally found the total number of posts on Rick Wicklin’s blog is going to approach 200 pretty soon. I followed his blog since its creation. It is an amazing number in a little more than one year. Rick is a unique blogger: he is a statistician who does programming; he is a programmer who plots data; he is a data analyst who is a good writer. As for me, it’s meaningful to summarize what I have learned from his blog.
Data extracted from The Do Loop
SAS official blogs have been restructured this summer. Since I can’t find the previous XML button on the website, I rewrote a program to directly extract HTML data to drive the KPI. Jiangtang Hu also created a program to extract data from The Do Loop, and mentioned that Rick is an incredibly productive writer.

%macro extract(page = );
   options mlogic mprint;
   %do index = 1 %to &page;
      filename raw url "http://blogs.sas.com/content/iml/page/&index/";
      data _tmp01;
        infile raw lrecl= 550 pad ;
        input record $550. ;
        if find(record, 'id="post') gt 0 or find(record, 'class="post') gt 0;
      run;
      data _tmp02;
         set _tmp01;
         _n + 1;
         _j = int((_n+2) / 3);
      run;
      proc transpose data=_tmp02 out=_tmp03;
         by _j;
         var record;
      run;
      data _&index;
         set _tmp03;
         array out[3] $100. title time pageview;
         array in[3] col1-col3;
         do i = 1 to 3;
            if i = 1 then do; _str1 = 'rel="bookmark">'; _str2 = "</a></"; end;
            if i = 2 then do; _str1 = '+0000">'; _str2 = '</abbr>'; end;
            if i = 3 then do; _str1 = '="postviews">'; _str2 = "</span>"; end;
            _len = length(compress(_str1));
            _start = find(in[i], compress(_str1)) + _len ;
            _end = find(in[i], _str2, _start);
            out[i] = substr(in[i] , _start  , _end - _start);
         end;
         drop _: col: i;
      run;
   %end;
   data out;
       set %do n = 1 %to &page;
               _&n
           %end;;
   run;
   proc datasets nolist;
      delete _:;
   quit;
%mend;
%extract(page = 20);
data out1;
   set out nobs = nobs;
   j + 1; 
   n = nobs - j + 1;
   length level $20.;
   label pageview1 = 'PAGEVIEW' time1 = 'TIME' n = 'TOTAL POSTS';
   pageview1 = input(pageview, 5.);
   _month = scan(time, 1);
   _date = scan(time, 2);
   _year = scan(time, 3);
   time1 = input(cats(_date, substr(_month, 1, 3), _year), date9. );
   weekday = weekday(time1);
   drop _:;
   format time1 date9.;
run;

ods html style = htmlbluecml; 
proc sql noprint;
   select count(*), sum(pageview1) into: nopost, :noview
   from out1
;quit;
proc gkpi mode=basic;
   dial actual = &nopost bounds = (0 100 200 300 400) /
   target=200 nolowbound  
   afont=(f="Garamond" height=.6cm)
   bfont=(f="Garamond" height=.7cm) ;
proc gkpi mode=basic;
   dial actual = &noview bounds = (0 2e4 4e4 6e4 8e4) /
   afont=(f="Garamond" height=.6cm)
   bfont=(f="Garamond" height=.7cm) ;
quit;
What I learned
I accumulated all the 195 titles, replaced/removed some words and processed them with Wordle. As I expected, Rick’s blog is mainly about ‘Matrix’, ‘Statistics’ and ‘Data’. It is interesting to learn how to create ‘Function’ in SAS/IML, which involves a lot of programming skills. I also enjoyed his topics about ‘Simulating’ and ‘Computing’ with ‘Random’ numbers. He also has exciting articles about how to deal with ‘Missing’ values and ‘Curve’.


data word_remove;
   input word : $15. @@;
   cards;
sas iml using use creating create proc blog vesus
;;;

proc sql noprint;
   select quote(upcase(compress(word))) into :wordlist separated by ',' 
   from word_remove
;quit;

data _null_;
   set out(keep=title);
   title =tranwrd(upcase(title), 'MATRICES', 'MATRIX');
   title =tranwrd(upcase(title), 'FUNCTIONS', 'FUNCTION');
   title =tranwrd(upcase(title), 'STATISTICAL', 'STATISTICS');
   length i $8.;
   do i = &wordlist;
      title =tranwrd(upcase(title), compress(i), ' ');
   end;
   file 'c:\tmp\output1.txt';
   put title;
run;
When the number reaches 200
Except the holidays (those gaps in the finger plot above), Rick keeps a constant rate in writing articles (approximately 3 posts a week).

No double the OLS regression gives a straight line. It seems that the total number will hit the 200 target pretty soon: next next week I believe.

proc sgplot data=out1;
   needle x = time1 y = n;
   yaxis grid max = 300;
run;

proc sgplot data = out1;
   reg x =time1 y = n;
   refline 200/ axis=y ;
   yaxis max = 300;
run;


What a SAS user likes to know
From my experience, clicks in a web browser are mostly originated form search engines, while a regular reader would like to use feeds instead. The page views recorded on the website of The Do Loop can reflect what SAS users try to find. Rick follows his pattern -- introductory tips on Monday, intermediate techniques for Wednesday, and topics for experienced programmers Friday. If we separate the page view trends at the three levels, we can see that the intermediate and advanced posts attract more page views than the basic ones.

data out2;
   set out1; 
      if weekday = 2 then level = '1-Basic';
      else if weekday in (3, 4) then level = '2-Intermediate';
      else level = '3-Advanced';
   output;
   set out1; 
      level = '4-Overall'; 
   output;
run;

proc sgpanel data = out2;
   panelby level / spacing=5 columns = 2 rows = 2 novarname;
   series x = time1 y = pageview1;
   rowaxis grid; colaxis grid;
run;
Conclusion
I agree with what Rick Wicklin said: blogging helps us to become more aware of what we know and what we don't know. I benefited a lot from his book and his resourceful blog in the past year. Cheers on Rick’s incoming 200th post!

Friday, October 14, 2011

What are those SAS jobs around Cary, NC?


SAS Institute is located in Cary, NC. In this job-scarce economy, an interesting question is: what job opportunities are available for a SAS user around this great company which created SAS, say, in an area of 150-mile radius. Fortunately, I found that the returned values from the omnipotent job search engine, Indeed.com, are highly digestible, although this website doesn’t provide analytics service to general public. To integrate the data from Indeed.com, I designed a macro to extract essential variables from the returned HTML pages. Then I set the time limit for the opening as the past 30 days, ‘SAS’ as keyword to search, and 100 openings to show on each returned page. I extracted 5 such pages by running this macro and eventually I obtained 500 job openings to conduct this experiment.

%macro extract(city =, state = , radius = , page = );
   options mlogic mprint;
   %do i = 1 %to &page;
       %let j = %eval((&i-1)*100);
       filename raw url  
       "http://www.indeed.com/jobs?q=sas%nrstr(&l)=&city+,&state%nrstr(&radius)=&radius%nrstr(&limit)=100%nrstr(&fromage)=30%nrstr(&start)=&j";
       data _tmp01;
          infile raw lrecl= 500 pad ;
          input record $500. ;
          if find(record, 'jobmap[') gt 0 and find(record, 'srcname') gt 0;
       run;
       data _&i;
           set _tmp01;
           array id[5] $50. indeed_id srcname corpname location title;
           length _str $8.;
           do _k = 1 to 5;
               if _k = 1 then _str = "jk:";
               else if _k = 2 then _str = "srcname:";
               else if _k = 3 then _str = "cmpesc:";
               else if _k = 4 then _str = "loc:";
               else _str = "title:";
               _len = length(compress(_str));
               _start = find(record, compress(_str)) + _len ;
               _end = find(record, ",", _start) ;
               id[_k] = compress(substr(record , _start  , _end - _start), "'");
           end;
           extract_time = datetime(); format extract_time datetime.;
           drop record _:;
       run;
   %end;
   data &city._&state;
       set %do n = 1 %to &page;
               _&n
           %end;;
   run;
   data &city._&state;
       retain obs extract_time corpname location title indeed_id srcname;
       set &city._&state;
       obs + 1;
   run;
   proc datasets nolist;
       delete _:;
   quit;
%mend;
%extract(city = cary, state = nc, radius = 150, page = 5);



Popular job titles
I used Wordle, a text cloud website to summarize all job titles. ‘Analyst’, ‘Developer’ and ‘Programmer’ seem to be quite common titles for a job related to SAS. Obviously many openings ask for experience, since they frequently mentioned ‘Senior’, ‘Manager’, or‘Management’. You can also predict the daily routines of those jobs by ‘Data’, ‘Marketing’, ‘Statistical’, ‘Risk’, ‘Database’, ‘Research’ and ‘Clinical’.

data _null_;
   set cary_nc(keep=title);
   string =tranwrd(upcase(title), 'SAS', ' ');
   string =tranwrd(upcase(string), 'SR', 'SENIOR ');
   file 'c:\tmp\output.txt';
   put string;
run;
Top 10 job providers
As I expected, SAS is the biggest job provider with 18 openings in the past month, followed by Bank of America (14), Ettain Group(12) and other companies. The top job providers don’t include insurance companies, CRO or pharmaceuticals, which suggests that there are fewer local companies or they may transfer the recruiting task to staffing companies.

proc sql outobs=10;
   create table _1 as
   select upcase(corpname) as name, count(*) as freq
   from cary_nc
   where corpname is not missing
   group by name
   order by freq desc    
;quit;

data _2;
   set _1;
   n + 1;
   length category $10.;
   if n = 1 then category = 'SAS';
   else if n in (2, 6) then category = 'Bank';
   else if n in(4, 9, 10) then category = 'Consulting';
   else category = 'Staffing';
run;

ods html gpath = 'c:\tmp\' style = harvest;
goptions device=javaimg ftitle="arial/bold" ftext="arial" 
htitle=.15in htext=.2in xpixels=600 ypixels=500;
proc gtile data = _2;
    tile freq tileby = (name, freq) / colorvar = category;
run;


Location, location, location
Job accumulated in five cities for those talents who have SAS skills: Charlotte, Raleigh, Cary, Durham and Richmond. Except working for SAS Institute, Charlotte and Raleigh are the best cities for a job seeker in Center North Carolina to consider.

proc gchart data= cary_nc;
   pie location;
run;


Source of job posts
Nowadays most company websites require the applicants to disclose where they get the information. In this experiment, the top 3 sites are DICE.com, Careebuilder.com and Corp-to-corp.com.

proc sql outobs=10;
   create table _3 as
   select upcase(srcname) as name, count(*) as freq
   from cary_nc
   where corpname is not missing
   group by name
   order by freq desc    
;quit;
proc sgplot data = _3;
   waterfall category = name response = freq ;
   xaxis  label= ' '; yaxis label=' ';
run;

Lesson learned from this weekly project
1. PROC GMAP in SAS 9.3 now supports alpha-transparency feature to draw a map -- a significant improvement.
2. Parsing online data doesn’t need to use the awesome Regular Expression syntax, although SAS has a few RE functions. SAS’s character functions are pretty robust.
3. Analyzing the online job information is quite entertaining for me. I believe that mining of those texts deeper would lead to more discoveries of secrets.

Friday, October 7, 2011

Create Nelson-Siegel function for yield curve

U.S. Treasury bonds with maturity ranging from 1 year to 30 years are daily updated on the Treasury’s website. However, some yields, such as from 4 years maturity bond, have to be inferred. The Nelson-Siegel function is probably one of the most important formulas to construct the yield curve. Many people use EXCEL’s Solver to find the nonlinear estimates of the four parameters (a1, a2, a3 and beta) for the Nelson-Siegel function. SAS’s PROC NLIN can also do this job smoothly. I copied today’s yield data and imported into SAS. Then I assigned some initial values to the unknown parameters and solved those parameters. With the PROC FCMP, I was able to build a function to predict the yields at any given year. Finally I plotted the yield curve (I also fitted the real yields with a quadratic regression line). The results showed that the nonlinear Nelson-Siegel function behaves better at longer maturities than OLS regression.


**********************(1) IMPORT STEP******************************************;
data _1;
   input ytm : $4. @@;
   cards;
1mo   3mo   6mo   1yr   2yr   3yr   5yr   7yr   10yr 20yr   30yr
;;;

data _2;
   input yield : 3.2 @@;
   cards;
0.01 0.01   0.03   0.09 0.29 0.46   1.01   1.52   2.01   2.71   2.96
;;;

data _3(keep=yield years);
   set _1;
   set _2;
   if find(ytm, 'yr') gt 0;
   years = input(compress(ytm, 'yr'),3.);
   yield = yield / 100;
run;

**********************(2) SOLVE PARAMETERS************************************;
proc nlin data=_3 method=newton;
   parms a1 = 0.05 a2 = 0.05
      a3 = 0.05 beta = 1;
   model yield=a1+(a2+a3)*(beta/years)*(1-exp(-years/beta))-a3*exp(-years/beta);
   ods output parameterestimates = parmds;
run;

data _null_;
   set parmds;
   call symput(parameter, estimate);
run;

**********************(3) BUILD FUNCTION AND APPLY IT****************************;
proc fcmp outlib = work.func.finance;
   function nelson_siegel(years);
      return(  
         &a1 + (&a2+&a3)*(&beta/years)*(1-exp(-years/&beta))
         - &a3*exp(-years/&beta)
   );
   endsub;
run;
options cmplib=work.func;
data _4;
   do years = 1 to 30;
      predyield = nelson_siegel(years);
      output;
   end;
run;

data _5;
   merge _3(rename=(yield=realyield)) _4(in=a);
   by years;
   if a;
run;

ods html style = money;
proc sgplot data = _5;
   scatter x = years y = predyield;
   series x = years y = realyield ;
   reg x = years y = realyield / degree = 2;
   xaxis grid; yaxis grid label = ' ';
   format realyield percent8.2;
run;
ods html style = htmlbluecml;
********************END OF ALL CODING*****************************************;

Friday, September 30, 2011

Modeling loss given default (LGD) by finite mixture model

The 'highly skewed' and 'highly irregular' loss data from the insurance and banking world is routinely fitted by a simple beta/ lognormal/gamma/Pareto distribution. While looking at the distribution plot, I bet that many people don’t want to buy this story and are willing to explore better ways. Finite mixture model that incorporates multiple distributions can be a good option in the radar map. For example, Matt Flynn will present how to use PROC NLMIXED to realize finite mixture model for insurance loss data in the incoming SAS ANALYTICS 2011 conference. Finally the revolutionary FMM procedure shipped with SAS 9.3 makes building finite mixture model easy.



For example, I have a sample loss given default dataset with 317 observations: lgd(real loss given default) is the dependent variable; lgd_a(mean default rate by industry), lev(leverage coefficient by firm) and i_def( mean default rate by year) are independent variables. The kernel distribution is plotted and difficult to be estimated by naked eyes.

data lgddata;
   informat lgd  lev  12.9 lgd_a 6.4  i_def 4.3;
   input lgd lev lgd_a i_def;
   label lgd   = 'Real loss given default'
      lev   = 'Leverage coefficient by firm'
      lgd_a = 'Mean default rate by year'
      i_def = 'Mean default rate by industry';
   cards;
0.747573451   0.413989786   0.6261   1.415
/* Other data*/
0.748255544   0.607452819   0.3645   3.783
;
run;

proc kde data = lgddata;
   univar lgd / plots = all;
run;

data _lgddata01;
   set lgddata;
   id + 1;
run;
proc transpose data = _lgddata01 out = _lgddata02 ;
   by id;
run;
proc sgplot data = _lgddata02;
   hbox col1 / category = _LABEL_;
   xaxis label = ' ';
run;
What I need PROC FMM to do is to estimate: 1. which distribution is the best from beta, lognormal, and gamma distributions; 2. how many components (ranging from 1 to 10) are the best for each distribution. To automate and visualize the process, I designed a macro. From the plots above, all penalized criterions (AIC, BIC, etc.) indicate that beta distribution is better than the other two. Also the beta distribution has higher Pearson statistic value and less parameter numbers.

ods html style = money;
%macro modselect(data = , depvar = , kmin= , kmax = , modlist = );
   %let modcnt=%eval(%sysfunc(count(%cmpres(&modlist),%str( )))+1);
   %do i = 1 %to &modcnt;
      %let modelnow = %scan(&modlist, &i);
      ods output  fitstatistics = &modelnow(rename=(value=&modelnow));
      ods select densityplot fitstatistics;
      proc fmm data = &data;
         model  &depvar = / kmin=&kmin kmax= &kmax dist=&modelnow;
      run;
   %end;
   data _final;
      %do i = 1 %to &modcnt;
         set %scan(&modlist, &i);
      %end;
   run;
   proc sgplot data = _tmp01;
      %do i = 1 %to &modcnt;
         %let modelnow = %scan(&modlist, &i);
         series x =  descr y = &modelnow;
         where descr ne :'E' and descr ne :'P';
      %end;
      yaxis label = ' ' grid;
   run;
   proc transpose data = _tmp01 out = _tmp02;
      where descr = :'E' or descr = :'P';
      id descr;
   run;
   proc sgplot data = _tmp02;
      bubble x = effective_parameters y = effective_components 
         size = pearson_statistic / datalabel = _name_;
      xaxis grid;  yaxis grid;
   run;
%mend;
%modselect(data = lgddata, depvar = lgd, kmin= 1, 
      kmax = 10, modlist = beta lognormal gamma);


The optimized component number for the beta distribution is 5 – beautiful matching curve. Lognormal distribution exhausted the maximum 10 components and fits the kernel distribution very awkwardly. Gamma distribution used 9 components and fits relatively well.


Then I chose the 5-compenent Homogeneous beta distribution to model the LGD data. PROC FMM provided all parameter estimates for these 5 components. From the plot above, the intercepts and the scale parameter s are different as expected. Interestingly, the parameters of lgd_a(mean default rate by industry) present big diversity, while the parameters of i_def( mean default rate by year) tend to converge at the zero point.

ods output parameterestimates = parmds;
proc fmm data = lgddata;
   model  lgd = lev lgd_a i_def / k = 5 dist=beta;
run;

proc sgplot data = parmds;
   series x = Effect y = Estimate / group = Component;
   xaxis grid label = ' '; yaxis grid;
run;
ods html style = htmlbluecml;
In conclusion, although PROC FMM is still an experimental procedure, its powerful model selection features would significantly change the way how people feel and use the loss data in the risk management industry.

Thursday, September 29, 2011

An easy solution for Multi-Sheet EXCEL reporting

Currently the only way to output SAS datasets as a multi-sheet EXCEL workbook for reporting is to use ExcelXP ODS tagset. I like this method a lot, because it can generate stylish multiple EXCEL sheets and is highly customizable. However, in practice it has some weaknesses. 1 - Running this tagset is resource-costly, since it depends on an 8k lines SAS codes - ExcelXP.sas. While dealing with a large SAS dataset, it always gets jammed. 2- It only allows one grouping variable by the BY statement inside the output procedures (PROC REPORT, PROC PRINT, etc.). 3 - The user often has to estimate the width for each column in EXCEL.

Actually we can use SAS macro and VBA macro together to obtain high-quality multi-sheet EXCEL workbook. The workflow is pretty simple: first a SAS macro splits a SAS dataset into many XLS files in a folder through ODS HTML targset. Second a VBA macro merges those single XLS files as sheets in to a workbook. For example, SAS shipped with a sample dataset SASHELP.PRDSAL2 with 23040 observations and 11 variables. If we want to generate a multi-sheet EXCEL workbook grouped by two variables such as ‘state’ and ‘year’, we can set up an empty directory in the hard disk and run a macro like below. As a result, we will have a number of small XLS files.

%macro split(data = , folder = , clsvar1 = , clsvar2 = );
   options nocenter nodate nonumber ps = 9000; 
   title; footnote;
   ods listing close;
   proc sql noprint;
      create table _tmp01 as 
      select &clsvar1, &clsvar2, count(*) as number
      from &data
      group by &clsvar1, &clsvar2
      order by &clsvar1, &clsvar2
   ;quit;
   data _tmp02;
      set _tmp01 nobs = nobs;
      where number gt 0;
      index = _n_;
      call symput('nobs', nobs);
   run;
   %do i = 1 %to &nobs;
      proc sql noprint;
         select &clsvar1, &clsvar2
            into:clsvar1name,:clsvar2name
         from _tmp02
         where index = &i
      ;quit;
      %let filepath = &folder\%sysfunc(dequote(&clsvar1name))_%sysfunc(dequote(&clsvar2name)).xls;
      ods html file = "&filepath " style = minimal; 
      proc print data = &data noobs label;
         where &clsvar1 = "&clsvar1name" and &clsvar2 = &clsvar2name;
      run;
   %end;
   ods listing; 
   ods html close;
%mend;
%split(data = sashelp.PRDSAL2, folder = C:\test1, clsvar1 = state , clsvar2 = year)
Then we can open EXCEL, press ALT+F11, paste the VBA code below and run it. Then we will be able to have a decent multi-sheet EXCEL workbook. The biggest strength for this method is that it is very fast – the overall process (running SAS macro and VBA macro) only takes less than a minute for this relatively large dataset SASHELP.PRDSAL2. And it can be expanded to many grouping variables by modifying the SAS macro a little. In conclusion, for big data EXCEL reporting, combining SAS macro and VBA macro together is a good alternative other than ExcelXP ODS tagset.
VBA Draft 3.1 Blogpost

Monday, September 19, 2011

A test to count missing values for large data

This morning Rick introduced how to count the missing frequencies for all variables in a dataset, including character and numeric variables. He provided two solutions by either PROC FREQ or PROC IML. I have a petty macro based on PROC SQL’s nmiss() function to do the same job. In this big data era, I am interested in those SAS codes’ efficiencies to check large data.

Then I simulated a 1e8 size dataset (5 numerical and 5 character variables) and tested the 3 methods. According to the results, PROC FREQ is slightly better than PROC SQL in both memory usage and speed. However, a macro may be needed to integrate a number of output tables by PROC FREQ for reporting (the demo macro by SAS Knowledge Base is an option and can be improved). Sadly, PROC IML used out all the 250MB memories the server allocated to me and can’t finish this test. Similarly, the memory volume should be considered for other matrix languages like R and Matlab.

P.S.
Thanks to Rick’s new codes, missing values were counted successfully by PROC IML finally. The real time spent by PROC IML was 150 seconds, which is almost equal to running the macro by PROC SQL. It is interesting to compare the functions from these two procedures: COUNT() and NMISS() functions in PROC SQL and COUNTN() and COUNTMISS() functions in PROC IML can work for both numeric and character variables.


**********************(1) SIMULATION STEP**************************************;
data test;
   array _c{*} $ c1-c5;
   array _n{*} n1-n5;
    do i = 1 to 1e8;
       do j = 1 to 5;
          if ranuni(1234)>0.2 then _c[j]= 'A';
          else call missing(_c[j]);
          if ranuni(4321)>0.3 then _n[j] = rannor(0);
          else call missing(_n[j]);
       end;
       output;
       drop i j;
    end;
run;

**********************(2) MODULE-BUILDING STEP*********************************;
%macro countmiss(data = );
   ods listing close;
   ods output variables = _varlist;
   proc contents data = &data;
   run;
   proc sql;
      alter table _varlist
      add nmiss num
      ;
      select count(*) into :nvar
      from _varlist
      ;
      select count(*) into :nobs
      from &data
      ;
   quit;
   %do i = 1 %to &nvar;
      proc sql;
         select cats('nmiss(', variable, ')') into :miss_var
         from _varlist
         where num = &i
         ;
         select &miss_var into: miss_num
         from &data
         ;
         update _varlist
         set nmiss = &miss_num
         where num = &i
         ;
      quit;
   %end;
   proc sort data = _varlist;
     by num;
   run;
   ods listing;
   proc report data = _varlist nowd split = '*';
      columns  type num variable label nmiss pcnmiss;
      define type / group;
      define variable / width = 15;
      define nmiss / display 'Missing*Number';
      define pcnmiss / computed format = percent8.3
                  'Missing*Percentage' width = 10;
      compute pcnmiss;
         pcnmiss = nmiss / &nobs;
      endcomp;
   run;
%mend countmiss();

proc format;
 value $missfmt ' '='Missing' other='Not Missing';
 value  missfmt  . ='Missing' other='Not Missing';
run;

**********************(3) TESTING STEP***************************************;
options fullstimer;
*******************METHOD(1): PROC SQL***************************************;
%countmiss(data = test);
 
*******************METHOD(2): PROC FREQ*************************************;
proc freq data = TEST; 
   format _CHAR_ $missfmt.; 
   tables _CHAR_ / missing missprint nocum nopercent;
   format _NUMERIC_ missfmt.;
   tables _NUMERIC_ / missing missprint nocum nopercent;
run;
*******************METHOD(3): PROC IML***************************************;
proc iml;
   use test;
   read all var _NUM_ into x[colname=nNames]; 
   n = countn(x,"col");
   nmiss = countmiss(x,"col");
   read all var _CHAR_ into x[colname=cNames]; 
   close one;
   c = countn(x,"col");
   cmiss = countmiss(x,"col");
   Names = cNames || nNames;
   rNames = {"    Missing", "Not Missing"};
   cnt = (cmiss // c) || (nmiss // n);
   print cnt[r=rNames c=Names label=""];
quit;
********************END OF ALL CODING*****************************************;

Sunday, September 11, 2011

Add 10 buttons to enhance SAS 9.3 environment

One striking change in SAS 9.3 is that everything is HTMLized. Adding user-defined buttons to the toolbar of SAS’s windowing environment would make coding more efficient.
1. Clear log and result
The most used SAS command is now prompted by a button.

output;clear;log;clear;wpgm;
2. Clear files left by results viewer
Lots of trash (mostly HTML files and PNG-formatted images by SAS 9.3's results viewer) would permanently remain in the current folder after each SAS session (the current folde is indicated on the Window Bar at the bottom). To avoid the possibility that someday the hard disk is jammed, clearing them out at times may be a good habit.

gsubmit "options noxsync noxwait;x 'del *.htm';x 'del *.png';"
3. Clear work directory
Everything in the WORK directory can be cleared out during a SAS session by such an approach.

gsubmit "proc catalog c=work.sasmacr kill force;proc datasets kill nolist;quit;"
4. Turn off ODS graphics
This code line closes ODS graphics for the statistics procedures if needed, and then saves resource expense.

gsubmit "ods graphics off;"
5. Separate html
SAS 9.3 stores all results in a very long html file that is what we see in the results viewer new html files. Ken Kleinman mentioned that running this code would generate a new html file.

gsubmit "ods html close; ods html;"
6. Listing on
One click to go back to the listing output if with output-intensive job.

gsubmit "ods _all_ close; ods listing;"
7. Increase DPI
The original image resolution in SAS 9.3 is set at low level for processing efficiency. Actually DPI (dots per inch) in SAS can be raised as high as 600. Graphs with 300 dpi would be decent for regular paperwork.

gsubmit"ods html image_dpi=300;"

8. Change style
The default style in SAS 9.3 is htmlblue, which is an elegant and smooth html template. However, it has weaknesses: while it draws scatter plots, the overlapping dots tend to be indistinguishable. My favorite for scatting plots is 'harvest'. Aligning them together would see the difference. SAS 9.3 has many other styles. I like 'navy' or 'money' for pictures that need high contrast. The styles 'journal/2/3' may be considered for publishing purpose.

gsubmit"ods html style=harvest;"
9. Recover formchar
If you see tables on listing separated by weird charaters, it is time to click this button.

gsubmit "options formchar='|----|+|---+=|-/\<>*';"
10. Community help
Searching the key words in SAS-L is probably the most efficient way to find answers for any SAS question. Another option is Lex Jansen that searches from total 15925 SAS conference proceeding papers.

wbrowse 'http://listserv.uga.edu/archives/sas-l.html'

Those user-defined buttons can be exported and imported by PROC CATALOG.

********EXPORT****************************;
proc catalog;
   copy in=sasuser.profile out=work.backup;
   select sasedit_main / et=toolbox;
quit;
********IMPORT****************************;
proc catalog;
   copy in=work.backup out=sasuser.profile;
   select sasedit_main / et=toolbox;
quit;

Sunday, August 28, 2011

My 10 wishes for SAS

No wonder that SAS 9.3 is one of SAS’s greatest products. Beyond it, I have 10 secret wishes and hope the far-away SAS 9.3.2 or SAS 9.4 might realize them someday.

1. Trigger for SAS dataset
A wonderful thing about SAS is that it can be used as a RDBMS with full functionality. Just one piece from the SQL language is missing in SAS - trigger. Adding triggers would bring more security for SAS datasets or data views, and automate some routine operations. 

2. PROC PIVOT for pivot table
Pivot table is a huge business success, since I found that every boss loves to do point-and-click data aggregation in Excel (why not they just use the simple PROC FREQ of SAS?). I often spend many hours to painfully decorate a pivot table. A procedure that directly exports SAS’ dataset to pivot-table-contained Excel spreadsheet is going to be a big plus.

3. Visible hash objects
Hash object offers an efficient alternative to the hard disk based Data Step programming. Michele Burlew’s new book  this fall would be a milestone for this emerging technology since SAS 9.1. If SAS windowing environment provides the views for the hash objects through the library explorer, matching or lookup on the hash objects would be better perceived.

4. A multi-threading LOGISTIC procedure
Last week Rick introduced how to open multiple workspace instances to make SAS/IML multi-threading. For many SAS users, PROC LOGISTIC is worthy of the half price they paid for SAS. It seems that SAS is developing a multi-threading HPLOGISTIC procedure for Teradata or in-database technology. However, at the age of big data, a multi-threading LOGISTIC procedure is still very much desired in SAS/STAT.

5. A system options for decimal digits
It is well known that to export the tables by ODS and then change the formats would allow displaying more decimal places for SAS’s outputs. However, a system option specifying the number of digits in the result would save coding time.


/*******************OPTION 1 ***********************************************
* From NOTE 37106 at support.sas.com/kb/37/106.html 
* How can I display more or fewer decimal places in procedure results
* -- Change the decimal places for a table 
****************************************************************************/
ods output parameterestimates = pe;
proc reg data = sashelp.class;
   model Weight = Height;
run; 
proc print data = pe label noobs; 
   format _numeric_ 12.8 ;
run;

/*******************OPTION 2 ***********************************************
* From Robin High at http://listserv.uga.edu/cgi-bin/wa?A2=ind0704b&L=sas-l&P=73559
* Re: How to display p value with more dicimal digits
* -- Change  the decimal places for all p-values 
****************************************************************************/
proc template;
   define column Common.PValue;
      format = pvalue12.8; 
   end;
run;
proc reg data = sashelp.class;
   model Weight = Height;
run; 

6. Support vector machine in SAS/STAT
Although SAS Enterprise Miner 7.1 has two procedures: PROC SVM and PROC SVMSCORE, they seem primitive and only apply for bivariate response variable. A procedure for SVM in SAS/STAT alongside the robust GLM-based procedures there would relieve many desperate SAS coders who got projects to do SVM with SAS.

7. A trial version (or learning edition)
One friend of mine is still using SAS 6 and insists that it is the culminating product. If there is a trial edition of SAS 9.3 available he can taste, he probably will change the idea. A learning edition with no cost or little cost can attract more people to start to learn SAS.

8. PROC TEXTCLOUD
Text cloud is a fancy visualization tool, although it does nothing about statistics. For example, Rick summarized his first 100 blog posts by it. R also has a nice package ' wordcloud'. A text cloud procedure would definitely make SAS more fun.

9. Random forest in Enterprise Miner
Random forest is one of the popular classification methods, which has not been included in Enterprise Miner 7.1 yet. Hope in the future it could become one of the modeling nodes.

10. Reasonably priced SAS cloud
Amazon is earning a windfall of money by its cloud services. And many start-ups provide R clouds. A PC SAS or UNIX SAS cloud may be a lucrative business for SAS. And I will be happy to show SAS to my friends on iPad or Android phone.

Thursday, August 25, 2011

A macro design pattern by PROC FCMP


We all heard horrible stories that someone tried to piece together a bunch of nice functioning macros for a big macro, and ended up with a messy and undebuggable system. Part of reasons can be about encapsulation: not all SAS programmers have the good habit to localize the macro variables by %local statement; the leaking macro variables may ruin the attempt to utilize multiple macros written by different SAS programmers. To solve this problem, Mark Tabladillo brought the concepts of encapsulation, polymorphism, and inheritance into nested macros. And he raised several design patterns for macros to emulate the object-oriented languages.

The FCMP procedure, besides its original purpose as a function compiler, could encapsulate macros by its RUN_MACRO function. The macro-based functions seem to be more safe modules than the macros themselves. Erin, Daniel and Himesh in their SAS Global 2011 paper showed an example to build a complicated reporting system for academic performances. Their principle is to construct a few macro-embedded functions by PROC FCMP and then incorporate them with an interface macro. Here I modified their codes a little to increase the number of macros and showed the relationship among the elements in the UML diagram above. The stucture is similar to the adapter pattern, one of the many OOP design patterns, with PROC FCMP as a wrapper.

Overall, functionalizing our macros or our colleagues’ macros by PROC FCMP is an alternative way to integrate them for a ‘big’ purpose.


 
/*******************READ ME*********************************************
* THE CODES BELOW ARE COPIED AND MODIFIED FROM ERIN LYNCH, DANIEL 
* O’CONNOR, HIMESH PATEL OF SAS INSTITUTE 
*
* THE ORIGINAL CODE AND RAW DATA CAN BE FOUND FROM THEIR PAPER 
* MY REPORTING REQUIRES A FULL STAFF—HELP! 
* PAPER 291, SAS GLOBAL FORUM 2011
* support.sas.com/resources/papers/proceedings11/291-2011.pdf
*
****************END OF READ ME******************************************/

%macro tabledata_prep;
   options topmargin=.125in bottommargin=.125in leftmargin=.25in rightmargin=.25in nodate nonumber;  
   title; footnote;
   ods escapechar="~"; 
   %let tabledata=%sysfunc(dequote(&tabledata.)); 
   data tabledata; 
      set &tabledata; 
      district=substr(district,1,8)||' '||substr(district,9,1); 
      school=substr(school,1,6)||' '||substr(school,7,1); 
   run; 
%mend;

%macro linedata_prep;
   %let linedata=%sysfunc(dequote(&linedata.)); 
   ods _all_ close; 
   data linedata; 
      set &linedata; 
      district=substr(district,1,8)||' '||substr(district,9,1); 
   run; 
   proc sort data= linedata out=sorted_linedata; 
      by district year; 
   run; 
   proc sort data= linedata out=districts(keep=district) nodupkey; 
      by district; 
   run; 
%mend;
                                       
proc template; 
   define style Styles.Charter; 
      parent = styles.printer; 
      style Body from Document 
         "Undef margins so we get the margins from the printer or SYS option"  / 
            marginbottom = _undef_ 
            margintop = _undef_ 
            marginright = _undef_ 
            marginleft = _undef_ 
            pagebreakhtml = html('PageBreakLine') 
            backgroundimage="Your.png"; 
   end; 
run; 
 
%macro Newfile; 
  %if &path ne '' %then %let pathopt= path=&path(url=none); 
   %else  %let pathopt=; 

   %if &gpath ne '' %then %let gpathopt= gpath=&gpath(url=none); 
   %else %let gpathopt=; 

   %let path=%sysfunc(dequote(&path.)); 
   %let gpath=%sysfunc(dequote(&gpath.)); 
   %let destination=%sysfunc(dequote(&destination.)); 
   %let file=%sysfunc(translate(%sysfunc(dequote(&file.)), "_", " ")); 
   %let extension=%sysfunc(dequote(&extension));      
 
   %if &styleparm ne '' %then %let styleopt= style=%sysfunc(dequote(&styleparm.)); 
   %else  %let styleopt=; 

    %if ( %upcase(&destination) eq PDF ) %then %do; 
       ods &destination file="&path.&file..&extension" notoc startpage=no 
       &styleopt; 
   %end; 
   %else %if (( %upcase(&destination) eq RTF ) or ( %upcase(&destination) eq TAGSETS.RTF )) %then %do; 
       ods &destination file="&path.&file..&extension" startpage=no &styleopt; 
   %end; 
   %else %if ( %upcase(&destination) eq HTML ) %then %do; 
       ods &destination file="&file..&extension" &pathopt &gpathopt &styleopt; 
   %end; 
%mend; 
                                                 
%macro Enrollment; 
   %let district=%sysfunc(dequote(&district.)); 
   ods text="~{newline 3}"; 
   ods text="~{style [width=100pct font_size=26pt background=CXf4e9c9] &district Enrollment By School Year}"; 
   ods text="~{newline 2}"; 
   ods text="~{style systemtitle [just=center]Enrollment by Year}"; 
   ods graphics / height=3in width=6in; 
   proc sgplot data=sorted_linedata(where=(district="&district")); 
        series x=year y=students / markers 
        lineattrs=(color=CX39828C pattern=SOLID thickness=3) 
        markerattrs=(color=CX0000FF symbol=STARFILLED) name='series'; 
   run; 
%mend; 

%macro District_Makeup; 
   %let district=%sysfunc(dequote(&district.)); 
   ods text="~{newline 6}"; 
   ods text="~{style [width=100pct font_size=26pt background=CXf4e9c9]Current Year Percentage Of Students By School}"; 
   proc report data=tabledata(where=(district="&district")) nowd 
         style(report)={frame=void font_size=12pt rules=none backgroundcolor=CXF4E9C9 
         cellpadding=0 cellspacing=0}; 
      define district / noprint; 
      define students / noprint; 
      define total_enrollment / noprint; 
      define school / '' style(column)={width=5.5in}; 
      define percent / '' style(column)={width=.5in} right; 
   run; 
%mend; 

%macro Closefile; 
   %let destination=%sysfunc(dequote(&destination.)); 
   ods &destination close; 
%mend; 

proc fcmp outlib=work.fncs.submit; 
   function tabledata_prep(tabledata $); 
      rc = run_macro('tabledata_prep', tabledata); 
      return(rc); 
   endsub; 
   function linedata_prep(linedata $); 
      rc = run_macro('linedata_prep', linedata); 
      return(rc); 
   endsub; 
   function Enrollment(district $); 
      rc = run_macro('Enrollment', district ); 
      return(rc); 
   endsub; 
   function District_Makeup(district $); 
      rc = run_macro('District_Makeup', district ); 
      return(rc); 
   endsub; 
   function Newfile( destination $, path $, gpath $, file $, extension $, styleparm $ ); 
      rc = run_macro('Newfile', destination, path, gpath, file, extension, styleparm ); 
      return(rc); 
   endsub; 
   function Closefile( destination $ ); 
      rc = run_macro('CloseFile', destination ); 
      return(rc); 
   endsub; 
run; quit; 
 
%macro Academic_Performance_Report (linedata =, tabledata = , destination=, path=, gpath=, extension=, style= );        
   options mlogic mprint; 
   %if ( "&extension" eq "" ) and ( &destination ne "" ) %then %let extension =&destination; 
   options cmplib=work.fncs;       
   data _null_;
      rc = tabledata_prep(symget('tabledata')); 
      rc = linedata_prep(symget('linedata'));
   run;
   data _null_;
      set districts;                    
      rc = Newfile( symget('destination'), symget('path'), symget('gpath'), 
         cats(district, "_Annual_Performance"), symget('extension'), symget('style'));                                 
      if ( rc eq 0) then do; 
         rc = Enrollment( district ); 
         rc = District_Makeup( district ); 
         rc = Closefile(symget('destination')); 
      end;  
   run; quit;
%mend; 
 
%Academic_Performance_Report(linedata = data1, tabledata = data2, destination=html, path=, gpath=, extension=, style=Charter ); 

Sunday, August 14, 2011

Using PROC COPULA in a more volatile market

The last week witnessed one of the wildest fluctuations in the market. Copula could measure the nonlinear dependence of multiple assets in a portfolio, and most importantly, is pronounced as \`kä-pyə-lə\(Thanks to the tip by Rick). The latest COPULA procedure in SAS 9.3 is one of the emerging tools to implement copulas.

To test it, I used R to download the daily return data for a few stock shares plus S&P500 index prices, since January 01, 2010. The six stocks are Citi group(C), JP Morgan(jpm), Pfizer(pfe), IBM(ibm), Apple(aapl), and Google(goog). I constructed an equally weighted portfolio by them. Until August 12, 2011, there are 406 observations. Therefore, in a composite plot by SAS, the stocks of banks show the greatest volatility, followed by pharmaceutical and high-tech companies.


#*********************(0) DOWNLOAD MARKET DATA***********************;
library("tseries")    
sp500= get.hist.quote(instrument="^gspc",start="2010-01-01",quote="AdjClose")
c    = get.hist.quote(instrument="c",    start="2010-01-01",quote="AdjClose") 
jpm  = get.hist.quote(instrument="jpm",  start="2010-01-01",quote="AdjClose") 
pfe  = get.hist.quote(instrument="pfe",  start="2010-01-01",quote="AdjClose")
ibm  = get.hist.quote(instrument="ibm",  start="2010-01-01",quote="AdjClose")
aapl = get.hist.quote(instrument="aapl", start="2010-01-01",quote="AdjClose")
goog = get.hist.quote(instrument="goog", start="2010-01-01",quote="AdjClose")

result=as.data.frame(diff(log(merge(sp500, c, jpm, pfe, ibm, aapl, goog))))
write.csv(result,file='c:/tmp/r2sas.csv')

**********************(1) INPUT RAW DATA*****************************;
options fullstimer; dm 'output;clear; log;clear;';
data raw;
   infile 'c:/tmp/r2sas.csv' delimiter = ',' missover dsd firstobs=2 ;
   informat date yymmdd10. sp500 c jpm pfe ibm aapl goog best32.; 
   format date date9.;
   input date sp500 c jpm pfe ibm aapl goog;
run;

**********************(2) PLOT STOCK RETURNS*************************;
proc transpose data = raw out = test01;
   by date;
   var c jpm pfe ibm aapl goog;
run;
data test02;
   merge test01 raw(keep=date sp500);
   by date;
run;
ods graphics / antialiasmax=2900;
proc sgpanel data = test02;
   panelby _name_ / spacing=5 columns = 3 rows = 2 novarname;
   series y=sp500 x=date / lineattrs=(color=red);
   series y=col1  x=date / lineattrs=(color=blue);
   refline 0/ axis=y lineattrs=(pattern=shortdash);
   rowaxis label = 'daily returns';
   label col1 = 'individual stock' ;
run;

I followed the online document of this procedure and also chose the t copula. The correlation plots of the fitted data are displayed above. It seems that PROC COPULA could only draw up to 5*5 matrix for scatter plots in my test. I don’t know if there is any parameter to activate since I have 6 stocks.

**********************(3) CALCULATE COPULA***************************;
proc copula data = raw(drop=sp500); 
   var c jpm pfe ibm aapl goog;
   fit t / marginals = empirical
           method = mle
           plots = (data = both matrix);
   simulate / ndraws = 10000
            seed = 20100822
            out  = sm_t;
run;

Then the kernel densities between stocks from the simulated dataset were calculated in SAS and plotted in R.
 
**********************(4) CALCULATE KERNEL DENSITIES*****************;
%macro kernel(var_list = );
   ods select none;
   %do i = 1 %to 5;
      %do j = %eval(&i + 1) %to 6;
         %let var1 = %scan(&var_list, &i);
         %let var2 = %scan(&var_list, &j);
         proc kde data= sm_t ;
            bivar &var1 &var2 / out = _tmp01;
         run;
         %if %eval(&i + &j) = 3 %then %do;
            data comb;
               set _tmp01;
            run;
         %end;
         %else %do;
            data comb;
               set comb _tmp01;
            run;
         %end;
      %end;
   %end;
   ods select all;
%mend;
%kernel(var_list = c jpm pfe ibm aapl goog);

data comb1;
   set comb;
   length gname $15;
   gname = cats('x=', var1, ';', 'y=', var2);
   keep value1 value2 gname density;
run;
proc export data = comb1 outfile = 'c:/tmp/sas2r.csv' replace;
run;

#*********************(5) PLOT DENSITY IN R**************************;
x = read.csv('c:/tmp/sas2r.csv')
library('lattice')
wireframe(density ~ value1 * value2 | gname , x, shade = TRUE,
          screen = list(z = -30, x = -50), lwd = 0.01, 
          xlab = "Stock X", ylab = "Stock Y",  
          zlab = "Density")


The simulated daily portfolio returns are likely to follow a normal distribution.

**********************(6) PLOT RETURN DISTRIBUTION*******************;
data port_ret (drop = i ret);
   set sm_t;
   array returns{6} c jpm pfe ibm aapl goog;
   ret =0;
   do i =1 to 6;
      ret = ret+ (1/6)*exp(returns[i]);
   end;
   port_ret = ret-1;
run;

proc kde data = port_ret;
   univar port_ret / percentiles = 1,2.5,5,10,90,95,99 plots=histdensity;
   ods output  percentiles = pcts;
   format port_ret percent8.3;
   label port_ret = 'portfolio return';
run;

Several predicted portfolio changes at given probabilities are given in a tilt plot. Hope this portfolio’s performance next day (August 15, 2011) would be within the expected ranges.

**********************(7) PLOT PORTFOLIO RETURNS*********************;
data prob;
   set pcts;
   percent = percent / 100;
   if percent > 0.5 then do ;
      percent  = 1 - percent ;
    result = put(percent , percent8.1)||
         'probability portfolio gains'|| put(port_ret, percent8.3);
   end;
    else result = put(percent , percent8.1)||
         'probability portfolio loses'|| put(port_ret, percent8.3);
run;

goptions device=javaimg ftitle="arial/bold" ftext="arial" 
htitle=.15in htext=.2in xpixels=600 ypixels=500;
proc gtile data = prob;
    tile Percent tileby = (result, Percent) / colorvar = port_ret;
   format port_ret 8.4;
   label port_ret = 'portfolio return';
run;

PROC COPULA supports five types of copulas(normal copula, t copula, clayton copula, Gumbel copula and Frank copula). Jan Chvosta described a ranking method to choose the best copula. I can easily apply the author's protocol.

Overall, PROC COPULA has much lower learning curve than the R package ‘copula’. Hope it grows to a dominating tool in analyzing copula.

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...