Friday, January 14, 2011

SAS vs. R in data mining


The past three years witnessed the rise of R, an open source statistical software. Search R related books in Amazon, and tons of recent titles show up ranging from graphics to scientific computation. Thanks to those graduates sprang out of school that received R training in their statistics major, R starts to appear in some serious business. The basic difference is that license of SAS is sold by SAS Institute, a company with 20k employees, while R is free. In their book ‘SAS and R’, Ken and Nicholas systematically compared the two packages. Even though they carefully avoided the sensitive question that which one is better, the readers can easily make a conclusion that R can do the work equally well as SAS. Then the next question is: why not freebie?

R enjoys many cutting-edge features. First R is a functional language. Writing a function is simple and quick, since the return is always an object. In SAS, implementation of a function is cumbersome, and most SAS programmers use macro instead. Second, in data visualization, R is indispensable, owing to a number of creative packages such as ‘lattice’. Recently SAS strikes back with its SG procedures also based on Trellis' concept of high-level plotting. Even R is more versatile, most cases they look equally good. In data mining, since ‘R is a leading language for developing new statistical methods’ (admitted by SAS when it announced that SAS can call R function in its IML module), the packages available in R are more resourceful than the secretive procedures covered by SAS Enterprise Miner. Name some of them: rpart for decision tree; randomForest for random forests; nnet for neural network; e1071 or kernlab for support vector machines; e1071 for Naive Bayes; earth for multivariate adaptive regression splines; RWeka for boosting. No doubly, any emerging data mining technology can find its counterpart in R. In his new book, Dr. Torgo gave four illustrations using R in data mining for ecology, stock market, fraudulent detection and bioinformatics, separately. Interestingly, biology people and finance people seem least interested in SAS products.

Big data is a curse for R while data mining is always data-intensive. The OOP feature actually backfired on R. Everything, even the raw data, turns out to be an object in memory. No doubly it speeds up the computation, while the side-effect is that memory in an R system tends to overflow easily. The R programmer has to be consistently aware of memory usage. I used to be scared by the noise while SAS reads data from hard disk after submitting codes. But the strategy works even when the data set is larger than the physical memory. As long as it does not freeze, let SAS run. After selling his SPSS to IBM, Dr. Norman Nie, assumed CEO of Revolution Analytics, a commercial provider of R. Dr. Nie’s innovation for R is to introduce the cliché: use an XDF file system in hard disk to store input data. Another distinction between them is that: R reads data by column and SAS reads data by row. In R, the work after reading data includes rows’ spitting and piecing-together. In SAS, data integration is handy with the help of data step’s inherent iteration and numerous unique informats. Another pitfall for R in data management is that it does not support native SQL, while Proc SQL renders SAS an equivalently capable RDBMS as other mainstream RDBMS.

A platform called PCR to compensate R’s shortcoming may be implemented, based on open source software (P: Python to integrate and manipulate data; C: MySQL or SQLite to store and query data; R: R to model and visualize data). Python has peerless capacity in processing complicated data. A database between Python and R avoids the generations of CSV files. Call R function in Python by RSpython(rpy2) or rpy provides other alternatives for direct communication. The underlying principle through PCR is to subset data by scripting or SQL query, and feed R the piece it can absorb. Dr. Janert uses Python partly assisted by R to go through data analysis process in his thoughtful book. Hope next book he would use R more in modeling and explore more in data mining. From my experience, working Python, SQLite and R together is pleasant and productive.

SAS should open its data mining procedures for coding. Many procedures are still under the license of Enterprise Miner, such as Proc Arboretum and Proc SVM. It is difficult to code them like other SAS procedures. SAS is far better than R in data management, while 80% work of data mining usually happens when transforming dirty data to workable data. For R platform, the cost of hiring a qualified worker experienced in Python, SQL and R maybe high. In summary, like the past successful efforts to redeem its reputation in data visualization, SAS should do more for the fast growing data mining market.

References: 1. Luis Torgo. Data Mining with R: Learning with Case Studies. Chapman and Hall, 2010.
2. Ken Kleinman, Nicholas J. Horton. SAS and R: Data Management, Statistical Analysis, and Graphics. Chapman and Hall, 2010.
3. Philipp K. Janert.Data Analysis with Open Source Tools. O'Reilly Media, 2010.


Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...