Thursday, February 24, 2011

SAS and Revolution R for GB size data


R surpassed Matlab and SAS on recent Tiobe Programming index for popularity. I agree with this ranking[1], since R could largely replace Matlab and SAS for MB size data, and it is getting hot. The fast growing sectors, quantitative finance, bioinformatics and web analytics, are all embracing R. In those competitive fields, survival demands the skills of a developer instead of a programmer, which suggests something brings more power or higher AUC. The cross-platform, open-sourced, C-based, function-friendly R is right on top of the tide. SAS programmers have to wait SAS Institute to build tools for them, while an R developer could invent his own gloves and axes. SAS’s licensing policy is also problematic: SAS is divided into modules, such as BASE, STAT, IML, GRAPH, ETS, etc; if you want to use just one procedure in a module, you should buy the whole module and renew the license each year. Norman Nie[2] said that 15 core statistical procedures, such as cluster analysis, factor analysis, PCA and Cox regression, would satisfy 85% of the needs. Thus SAS programmers may have to pay the cost of hundreds of procedures to use the 15 procedures. On the contrary, you can specify and download any package for R. Or you construct brand-new functions or packages.

However, with larger data, R or Matlab users have to painfully jump to low-level languages. For convenience of matrix computation, R or Matlab routinely keeps copies of data in memory and does not let garbage collector frenetically dump them. With GB size data, people would soon hear the outcry of memory shortage in their workstation, and ponder to resort to C or C++. That is possibly why Matlab or R talents are so popular in academia but never welcome in job market. Amazingly, SAS is pretty robust to deal with this scale of data. SAS rarely failed me even though I sometimes complain about the speed. Of course, if memory is not a hurdle, R is another good option. Inspired by this idea, Norman Nie‘s startup company Revolution Analytics reinvigorated the community R. They used an XDF data file on hard disk to contain data before into memory stack. Then their unique package RevoScaleR would perform data analysis inside of those files. This strategy is pretty similar to the In-Database technology by SAS’s high-end enterprise products. Besides, XDF would forcibly partition data to allow multiple-core processing. On a common PC with 3G memory, I compared Revolution R and SAS using a 1.1 GB flat file. SAS integrated it in 20 seconds, while Rev R did that in 5 minutes. Afterward SAS realized some data transformation steps and summarization procedures. Unfortunately in those occasions the R core of Rev R usually crashed with the Rev R’s IDE left as a zombie. Presumably the compatibility between the R core and the XDF file system is a concern. In addition, at this stage the RevoScaleR package in Rev R has limited functions or procedures, such as summary statistics, linear regression, logistic regression, cross-tab, which utilized the advantage of the XDF system. Another worry is that the R core contributors, led by Ross Ihaka and Robert Gentleman, may feel reluctant to further assist Norman Nie's effort to develop his profitable product.

For TB size data, R will roll back. Like the physical world with extraordinary speed or space that Newton's theories expire, the complex, high-volume, high-dimensional data would stun any extraction, transformation, loading and analysis operations base on the concept of RDBMS. MySQL, Oracle and SAS all fade away. Distributed system and function languages are the rescue. Open source software, like Python and R, rule in this domain. The data center would aggregate the total memory from thousands of servers. With gigantic memories all together, data analysis by R could be performed by scheduling a Map/Reduce task. However, those well-known IT giants owning those data centers are not likely to buy many licenses for a commercial package. The community R, not Revolution R, will enjoy a significant share from the rapid growth in this area.

Overall SAS is still much better than Revolution R in handling GB size data now. Revolution R may need more time to get matured for production purpose. Revolution R launched a SAS-to-R challenge and deliberately created a function to transform SAS datasets to its XDF format. I like to see newcomers in the arena, and SAS may benefit from this competition.

Reference: 1. 'TIOBE Programming Community Index for February 2011'. http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
2. 'Another Open Source Swipe at IBM and SAS'. http://blogs.forbes.com/quentinhardy/2011/02/01/another-open-source-swipe-at-ibm-and-sas/


************(1) DOWNLOAD AND UNZIP THE DATA ***********************;
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2006us.zip

***********(2)SAS'S CODE TO INTEGRATE DATA*****************;
data mort06;
 infile 'C:\mort06.dat' 
    lrecl=150 missover ;
 input    @20 Resident_status     $1. 
    @83 Plac_of_death_decedent_status   $1.  
    @85 Day_of_week _of_death   $1. 
    @94 Date_receipt    $8.  
    @102 Data_Year     $4. 
    @65 Month_of_Death     $2.
    @60 Sex      $1.  
    @445 Race      $6.
    @70 Age      $13.  
    @84 Marital      $1.  
    @484 Hispanic_origin     $5.  
    @61 Education             $4.  
    @146 ICD10 $4.  
    @150 code358    $3.   
    @154 code113 $3.     
    @157 code130 $3.    
    @160 cause_recode   $2.   
    @145 Place_injury   $1.   
    @106 Injury_work     $1.  
    @108 Method_disposition    $1. 
    @109 Autopsy      $1. 
    @144 Activity    $1. 
;
run;

proc contents data=mort06;
run;

############(3) REV R 'S CODE TO INTEGRATE DATA#####################
colList <- list(  
    "Resident_status"        =list(type="factor",  start=20, width=1 ),
     "Plac_of_death_decedent_status"    =list(type="factor", start=83, width=1 ),
     "Day_of_week _of_death"      =list(type="factor", start=85, width=1 ),     
     "Date_receipt"         =list(type="factor", start=94, width=1 ),         
     "Data_Year"         =list(type="factor",  start=102, width=4  ),
     "Month_of_Death"        =list(type="factor", start=65, width=2  ),
     "Sex"           =list(type="factor", start=60, width=1  ),
     "Race"           =list(type="factor", start=445, width=6 ),
     "Age"           =list(type="factor", start=70, width=13  ),
     "Marital"          =list(type="factor", start=84, width=1 ),
     "Hispanic_origin"        =list(type="factor", start=484, width=5  ),
     "Education"         =list(type="factor", start=61, width=4 ),
     "ICD10"          =list(type="factor", start=146, width=4  ),     
     "code358"          =list(type="factor", start=150, width=3  ), 
     "code113"          =list(type="factor", start=154, width=3  ),
     "code130"          =list(type="factor", start=157, width=3  ),
     "cause_recode "        =list(type="factor", start=160, width=2  ),
     "Place_injury"         =list(type="factor", start=145, width=1  ),
     "Injury_work "         =list(type="factor", start=106, width=1  ),
     "Method_disposition"       =list(type="factor", start=108, width=1  ),
     "Autopsy"          =list(type="factor", start=109, width=1  ),
     "Activity"          =list(type="factor", start=144, width=1  )  
    )
mortFile <- file.path("C:", "MORT06.DAT")
sourceData <- RxTextData(mortFile, colInfo=colList )                 
outputData <- RxXdfData("MORT")
rxImportToXdf(sourceData, outputData,  overwrite = TRUE) 
rxGetInfoXdf("MORT.xdf", getVarInfo=TRUE) 

#################END OF CODING#######################################;  

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...