Friday, January 3, 2014

Test drive for PROC HADOOP and Pig

PROC HADOOP is available since SAS 9.3M2, which bridges a Windows client and a Hadoop server. The great thing about this procedure is that it supports user-defined function. There are several steps to apply this procedure.
  1. Download Java SE and Eclipse on Windows
    Java SE and Eclipse are free to download. Installation is also fairly easy.
  2. Make user-defined function on Windows
    The most basic user-defined function is an upper-case function for a string that wraps Java’s native str.toUpperCase() function. Pig’s manual has [detail descripton][1] about it.
  3. Package the function as JAR
    There is a wonderful video tutorial on YouTube. Make sure that version of the [Pig API][2] with the name such as pig-0.12.0.jar on Windows is the same to the one running on the Hadoop.
  4. Run PROC HADOOP commands
    # pig_code
    A = load 'test3.txt' as (f1: chararray, f2: chararray, f3: chararray, f4: chararray, f5: chararray);
    describe A;
    register myudfs.jar;
    B = foreach A generate myudfs.UPPER(f3);
    dump B;
    Then we can run the SAS codes with PROC HADOOP. Subsequently one field f3 of the text file on HDFS is capitalized.
    filename cfg "C:\tmp\config.xml";
    filename code "C:\tmp\pig_code.txt";
    proc hadoop options=cfg username="myname" password="mypwd" verbose;
    pig code=code registerjar="C:\tmp\myudfs.jar";
    run;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...