Wednesday, July 29, 2015

Deploy edx spark environment to DigitalOcean

This summer I took the Spark courses at edx CS100 and CS190, and had wonderful experience.
The two classes apply a Vagrant virtual machine containing Spark and all teaching materials. There are two challenges with the virtual machine —
  1. The labs usually take long time to finish, say 8-10 hours. If the host machine is closed, the RDDs will be lost and the pipeline has to be run again.
  2. Some RDD operations take a lot computation/communication powers, such as groupByKey and distinct. Many of my 50k classmates complained about the waiting time. And my most used laptop is a Chromebook and doesn’t even have options to install Virtual Box.
To deploy the learning environment to a cloud may be an alternative. DigitalOcean is a good choice because it uses mirrors for most packages, and the network speed is amazingly fast that is almost 100MB/s (thanks to the SSD infrastructure DigitalOcean implements for the cloud, otherwise the hard disk may not stand this rapid IO; see my deployment records GitHub).

I found that a Linux box with 1 GB memory and 1 CPU at DigitalOcean that costs 10 dollars a month will handle most labs fairly easy with IPython and Spark. A 2 GB memory and 2 CPU droplet will be ideal since it is the minimal requirement for a simulated cluster. It costs 20 dollars a month, but is still much cheaper than the cost to earn the big data certificate that is $100 (50 for each). I just need to write Python scripts to install IPython notebook with SSL, and download Spark and the course materials.
  • The DevOps tool is Fabric and the fabfile is at GitHub.
  • The deployment pipeline is also at GitHub

Friday, July 17, 2015

Transform SAS files to Parquet through Spark

The demo pipeline is at GitHub.
Since the version 1.3, Spark has introduced the new data structure DataFrame. A data analyst now could easily scale out the exsiting codes based on the DataFrame from Python or R to a cluster hosting Hadoop and Spark.
There are quite a few practical scenarios that DataFrame fits well. For example, a lot of data files including the hardly read SAS files want to merge into a single data store. Apache Parquet is a popular column store in a distributed environment, and especially friendly to structured or semi-strucutred data. It is an ideal candidate for a univeral data destination.
I copy three SAS files called prdsale, prdsal2 and prdsal3, which are about a simulated sales record, from the SASHELP library to a Linux directory. And then I launch the SQL context from Spark 1.4.
The three SAS files now have the size of 4.2MB. My overall strategy is to build a pipeline to realize my purpose such as SAS --> Python --> Spark --> Parquet.
import os
    import sas7bdat
    import pandas
except ImportError:
    print('try to install the packags first')

print('Spark verion is {}'.format(sc.version))

if type(sqlContext) != pyspark.sql.context.HiveContext:
    print('reset the Spark SQL context')


def print_bytes(filename):
    print('{} has {:,} bytes'.format(filename, os.path.getsize(filename)))


!du -ch --exclude=test_parquet

Spark verion is 1.4.0
prdsale.sas7bdat has 148,480 bytes
prdsal2.sas7bdat has 2,790,400 bytes
prdsal3.sas7bdat has 1,401,856 bytes
4.2M    .
4.2M    total

1. Test DataFrame in Python and Spark

First I transform a SAS sas7bdat file to a pandas DataFrame. The great thing in Spark is that a Python/pandas DataFrame could be translated to Spark DataFrame by the createDataFrame method. Now I have two DataFrames: one is a pandas DataFrame and the other is a Spark DataFrame.
with sas7bdat.SAS7BDAT('prdsale.sas7bdat') as f:
     pandas_df = f.to_data_frame()
print('-----Data in Pandas dataframe-----')

print('-----Data in Spark dataframe-----')
spark_df = sqlContext.createDataFrame(pandas_df)

-----Data in Pandas dataframe-----
0     925  CANADA  EDUCATION  12054      850  FURNITURE    SOFA        1   
1     999  CANADA  EDUCATION  12085      297  FURNITURE    SOFA        1   
2     608  CANADA  EDUCATION  12113      846  FURNITURE    SOFA        1   
3     642  CANADA  EDUCATION  12144      533  FURNITURE    SOFA        2   
4     656  CANADA  EDUCATION  12174      646  FURNITURE    SOFA        2   

0   EAST  1993  
1   EAST  1993  
2   EAST  1993  
3   EAST  1993  
4   EAST  1993  
-----Data in Spark dataframe-----
| 925.0| CANADA|EDUCATION|12054.0|  850.0|FURNITURE|   SOFA|    1.0|  EAST|1993.0|
| 999.0| CANADA|EDUCATION|12085.0|  297.0|FURNITURE|   SOFA|    1.0|  EAST|1993.0|
| 608.0| CANADA|EDUCATION|12113.0|  846.0|FURNITURE|   SOFA|    1.0|  EAST|1993.0|
| 642.0| CANADA|EDUCATION|12144.0|  533.0|FURNITURE|   SOFA|    2.0|  EAST|1993.0|
| 656.0| CANADA|EDUCATION|12174.0|  646.0|FURNITURE|   SOFA|    2.0|  EAST|1993.0|
The two should be the identical length. Here both show 1,440 rows.


2. Automate the transformation

I write a pipeline function to automate the transformation. As the result, the all three SAS files are saved to the same directory as Parquet format.
def sas_to_parquet(filelist, destination):
    """Save SAS file to parquet
        filelist (list): the list of sas file names
        destination (str): the path for parquet
    rows = 0
    for i, filename in enumerate(filelist):
        with sas7bdat.SAS7BDAT(filename) as f:
            pandas_df = f.to_data_frame()
            rows += len(pandas_df)
        spark_df = sqlContext.createDataFrame(pandas_df)"{0}/key={1}".format(destination, i), "parquet")
    print('{0} rows have been transformed'.format(rows))

sasfiles = [x for x in os.listdir('.') if x[-9:] == '.sas7bdat']

sas_to_parquet(sasfiles, '/root/playground/test_parquet')

['prdsale.sas7bdat', 'prdsal2.sas7bdat', 'prdsal3.sas7bdat']
36000 rows has been transformed
Then I read from the newly created Parquet data store. The query shows that the data has been successfully saved.
df = sqlContext.load("/root/playground/test_parquet", "parquet")
df.filter(df.key == 0).show(5)

| 925.0| CANADA|  null|null|12054.0|  850.0|FURNITURE|   SOFA|    1.0| null|1993.0| null|EDUCATION|  EAST|  0|
| 999.0| CANADA|  null|null|12085.0|  297.0|FURNITURE|   SOFA|    1.0| null|1993.0| null|EDUCATION|  EAST|  0|
| 608.0| CANADA|  null|null|12113.0|  846.0|FURNITURE|   SOFA|    1.0| null|1993.0| null|EDUCATION|  EAST|  0|
| 642.0| CANADA|  null|null|12144.0|  533.0|FURNITURE|   SOFA|    2.0| null|1993.0| null|EDUCATION|  EAST|  0|
| 656.0| CANADA|  null|null|12174.0|  646.0|FURNITURE|   SOFA|    2.0| null|1993.0| null|EDUCATION|  EAST|  0|

3. Conclusion

There are multiple advantages to tranform data from various sources to Parquet.
  1. It is an open format that could be read and written by major softwares.
  2. It could be well distributed to HDFS.
  3. It compresses data.
For example, the original SAS files add up to 4.2 megabyte. Now as Parquet, it only weighs 292KB and achieves 14X compression ratio.
!du -ahc 

4.0K    ./key=2/._metadata.crc
4.0K    ./key=2/._SUCCESS.crc
0    ./key=2/_SUCCESS
4.0K    ./key=2/_common_metadata
4.0K    ./key=2/.part-r-00001.gz.parquet.crc
4.0K    ./key=2/._common_metadata.crc
4.0K    ./key=2/_metadata
60K    ./key=2/part-r-00001.gz.parquet
88K    ./key=2
4.0K    ./key=0/._metadata.crc
4.0K    ./key=0/._SUCCESS.crc
0    ./key=0/_SUCCESS
4.0K    ./key=0/_common_metadata
4.0K    ./key=0/.part-r-00001.gz.parquet.crc
4.0K    ./key=0/._common_metadata.crc
4.0K    ./key=0/_metadata
12K    ./key=0/part-r-00001.gz.parquet
40K    ./key=0
4.0K    ./key=1/._metadata.crc
4.0K    ./key=1/._SUCCESS.crc
0    ./key=1/_SUCCESS
4.0K    ./key=1/_common_metadata
4.0K    ./key=1/.part-r-00001.gz.parquet.crc
4.0K    ./key=1/._common_metadata.crc
4.0K    ./key=1/_metadata
132K    ./key=1/part-r-00001.gz.parquet
160K    ./key=1
292K    .
292K    total
A bar plot visualizes the signifcant size difference between the two formats. It shows an order of magnitude space deduction.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
index = np.arange(2)
bar_width = 0.35
data = [4200, 292]
header = ['SAS files', 'Parquet'], data)
plt.grid(b=True, which='major', axis='y')
plt.ylabel('File Size by KB')
plt.xticks(index + bar_width, header)

Friday, June 19, 2015

Two alternative ways to query large dataset in SAS

I really appreciate those wonderful comments on my SAS posts by the readers (123). They gave me a lot of inspirations. Due to SAS or SQL’s inherent limitation, recently I feel difficult in deal with some extremely large SAS datasets (it means that I exhausted all possible traditional ways). Here I conclude two alternative solutions in these extreme cases as a follow-up to the comments.
  1. Read Directly
    • Use a scripting language such as Python to Reading SAS datasets directly
  2. Code Generator
    • Use SAS or other scripting languages to generate SAS/SQL codes
The examples still use sashelp.class, which has 19 rows. The target variable is weight.
data class;
    set sashelp.class;
    row = _n_;

Example 1: Find the median

SQL Query

In the comment, Anders Sk├ÂllermoFebruary wrote
Hi! About 1. Calculate the median of a variable:
If you look at the details in the SQL code for calculation the median, then you find that the intermediate file is of size N*N obs, where N is the number of obs in the SAS data set.
So this is OK for very small files. But for a file with 10000 obs, you have an intermediate file of size 100 million obs. / Br Anders Anders Sk├Âllermo Ph.D., Reuma and Neuro Data Analyst
The SQL query below is simple and pure, so that it can be ported to any other SQL platform. However, just like what Anders said, it is just way too expensive.
proc sql;
   select avg(weight) as Median
   from (select e.weight
       from class e, class d
       group by e.weight
       having sum(case when e.weight = d.weight then 1 else 0 end)
          >= abs(sum(sign(e.weight - d.weight)))


In the comment, Anonymous wrote:
I noticed the same thing - we tried this on one of our 'smaller' datasets (~2.9 million records), and it took forever.
Excellent solution, but maybe PROC UNIVARIATE will get you there faster on a large dataset.
Indeed PROC UNIVARIATE is the best solution in SAS to find the median, which utilizes SAS's built-in powers.

Read Directly

When the extreme cases come, say SAS cannot even open the entire dataset, we may have to use the streaming method to Reading the sas7bdat file line by line. The sas7bdat format has been decoded by JavaR and Python. Theoretically we don't need to have SAS to query a SAS dataset.
Heap is an interesting data structure, which easily finds a min or a max. ream the values, we could build a max heap and a min heap to cut the incoming stream into half in Python. The algorithm looks like a heap sorting. The good news is that it only Reading one variable each time and thus saves a lot of space.
#In Python
import heapq
from sas7bdat import SAS7BDAT
class MedianStream(object):
    def __init__(self):
        self.first_half = [] # will be a max heap
        self.second_half = [] # will be a min heap, 1/2 chance has one more element
        self.N = 0

    def insert(self, x):
        heapq.heappush(self.first_half, -x)
        self.N += 1
        if len(self.second_half) == len(self.first_half):
            to_second, to_first = map(heapq.heappop, [self.first_half, self.second_half])
            heapq.heappush(self.second_half, -to_second)
            heapq.heappush(self.first_half, -to_first)
            to_second = heapq.heappop(self.first_half)
            heapq.heappush(self.second_half, -to_second)

    def show_median(self):
        if self.N == 0:
            raise IOError('please use the insert method first')
        elif self.N % 2 == 0:
            return (-self.first_half[0] + self.second_half[0]) / 2.0
        return -self.first_half[0]

if __name__ == "__main__": 
    stream = MedianStream()
    with SAS7BDAT('class.sas7bdat') as infile:
        for i, line in enumerate(infile):
            if i == 0:
    print stream.show_median()


Example 2: Find top K by groups

SQL Query

This query below is very expensive. We have a self-joining O(N^2) and a sorting O(NlogN), and the total time complexity is a terrible O(N^2 + Nlog(N)).
* In SAS
proc sql; 
    select,, a.weight, (select count(distinct b.weight) 
            from class as b where b.weight >= a.weight and = ) as rank 
    from class as a
    where calculated rank <= 3
    order by sex, rank

Code Generator

The overall thought is break-and-conquer. If we synthesize SAS codes from a scripting tool such as Python, we essentially get many small SAS codes segments. For example, the SQL code below is just about sorting. So the time comlexity is largely decreased to O(Nlog(N)).
# In Python
def create_sql(k, candidates):
    template = """
    proc sql outobs = {0};
    select *
    from {1}
    where sex = '{2}' 
    order by weight desc
    for x in candidates:
        current = template.format(k, 'class', x)
        print current
if __name__ == "__main__":
    create_sql(3, ['M', 'F'])

    proc sql outobs = 3;
    select *
    from class
    where sex = 'M' 
    order by weight desc

    proc sql outobs = 3;
    select *
    from class
    where sex = 'F' 
    order by weight desc

Read Directly

This time we use the data structure of heap again in Python. To find the k top rows for each group, we just need to prepare the min heaps with the k size for each group. With the smaller values popped out everytime, we finally get the top k values for each group. The optimized time complexity is O(Nlog(k))
#In Python
from sas7bdat import SAS7BDAT
from heapq import heappush, heappop

def get_top(k, sasfile):
    minheaps = [[], []]
    sexes = ['M', 'F']
    with SAS7BDAT(sasfile) as infile:
        for i, row in enumerate(infile):
            if i == 0:
            sex, weight = row[1], row[-1]
            i = sexes.index(sex)
            current = minheaps[i]
            heappush(current, (weight, row))
            if len(current) > k:
    for x in minheaps:
        for _, y in x:
            print y

if __name__ == "__main__":
    get_top(3, 'class.sas7bdat')

[u'Robert', u'M', 12.0, 64.8, 128.0]
[u'Ronald', u'M', 15.0, 67.0, 133.0]
[u'Philip', u'M', 16.0, 72.0, 150.0]
[u'Carol', u'F', 14.0, 62.8, 102.5]
[u'Mary', u'F', 15.0, 66.5, 112.0]
[u'Janet', u'F', 15.0, 62.5, 112.5]

Example 3: Find Moving Window Maxium

At the daily work, I always want to find three statistics for a moving window: mean, max, and min. The sheer data size poses challenges.
In his blog post, Liang Xie showed three advanced approaches to calculated the moving averages, including PROC EXPANDDATA STEP and PROC SQL. Apparently PROC EXPAND is the winner throughout the comparison. As conclusion, self-joining is very expensive and always O(N^2) and we should avoid it as much as possible.
The question to find the max or the min is somewhat different other than to find the mean, since for the mean only the mean is memorized, while for the max/min the locations of the past min/max should also be memorized.

Code Generator

The strategy is very straightforward: we choose three rows from the table sequentially and calculate the means. The time complexity is O(k*N). The generated SAS code is very lengthy, but the machine should feel comfortable to Reading it.
In addition, if we want to save the results, we could insert those maximums to an empty table.
# In Python
def create_sql(k, N):
    template = """
    select max(weight)
    from class
    where row in ({0})
    SQL = ""
    for x in range(1, N - k + 2):
        current = map(str, range(x, x + 3))
        SQL += template.format(','.join(current))
    print "proc sql;" + SQL + "quit;"
if __name__ == "__main__":
    create_sql(3, 19)

proc sql;
    select max(weight)
    from class
    where row in (1,2,3)
    select max(weight)
    from class
    where row in (2,3,4)
    select max(weight)
    from class
    where row in (3,4,5)
    select max(weight)
    from class
    where row in (4,5,6)
    select max(weight)
    from class
    where row in (5,6,7)
    select max(weight)
    from class
    where row in (6,7,8)
    select max(weight)
    from class
    where row in (7,8,9)
    select max(weight)
    from class
    where row in (8,9,10)
    select max(weight)
    from class
    where row in (9,10,11)
    select max(weight)
    from class
    where row in (10,11,12)
    select max(weight)
    from class
    where row in (11,12,13)
    select max(weight)
    from class
    where row in (12,13,14)
    select max(weight)
    from class
    where row in (13,14,15)
    select max(weight)
    from class
    where row in (14,15,16)
    select max(weight)
    from class
    where row in (15,16,17)
    select max(weight)
    from class
    where row in (16,17,18)
    select max(weight)
    from class
    where row in (17,18,19)

Read Directly

Again, if we want to further decrease the time complexity, say O(N), we have to use better data structure, such as queue. SAS doesn't have queue, so we may switch to Python. Actually it has two loops which adds up to O(2N). However, it is still better than any other methods.
# In Python
from sas7bdat import SAS7BDAT
from collections import deque

def maxSlidingWindow(A, w):
    N = len(A)
    ans =[0] * (N - w + 1)
    myqueue = deque()
    for i in range(w):
        while myqueue and A[i] >= A[myqueue[-1]]:
    for i in range(w, N):
        ans[i - w] = A[myqueue[0]]
        while myqueue and A[i] >= A[myqueue[-1]]:
        while myqueue and myqueue[0] <= i-w:
    ans[-1] = A[myqueue[0]]
    return ans

if __name__ == "__main__":
    weights = []
    with SAS7BDAT('class.sas7bdat') as infile:
        for i, row in enumerate(infile):
            if i == 0:

    print maxSlidingWindow(weights, 3)

[112.5, 102.5, 102.5, 102.5, 102.5, 112.5, 112.5, 112.5, 99.5, 99.5, 90.0, 112.0, 150.0, 150.0, 150.0, 133.0, 133.0]


While data is expanding, we should more and more consider three things -
  • Time complexity: we don't want run data for weeks.
  • Space complexity: we don't want the memory overflow.
  • Clean codes: the colleagues should easily Reading and modify the codes.

Wednesday, June 3, 2015

saslib: a simple Python tool to lookup SAS metadata

saslib is an HTML report generator to lookup the metadata (or the head information) like PROC CONTENTS in SAS.
  • It reads the sas7bdat files directly and quickly, and does not need SAS installed.
  • Emulate PROC CONTENTS by jQuery and DataTables.
  • Extract the meta data from all SAS7bdat files under the specified directory.
  • Support IE(>=10), firefox, chrome and any other modern browser.


pip install saslib
saslib requires sas7bdat and jinjia2.


The module is very simple to use. For example, the SAS data sets under the SASHELP library could be viewed —
from saslib import PROCcontents

sasdata = PROCcontents('c:/Program Files/SASHome/SASFoundation/9.3/core/sashelp')

The resulting HTML file from the codes above will be like here.

Friday, March 20, 2015

Deploy a minimal Spark cluster


Since Spark is rapidly evolving, I need to deploy and maintain a minimal Spark cluster for the purpose of testing and prototyping. A public cloud is the best fit for my current demand.
  1. Intranet speed
    The cluster should easily copy the data from one server to another. MapReduce always shuffles a large chunk of data throughout the HDFS. It’s best that the hard disk is SSD.
  2. Elasticity and scalability
    Before scaling the cluster out to more machines, the cloud should have some elasticity to size up or size down.
  3. Locality of Hadoop
    Most importantly, the Hadoop cluster and the Spark cluster should have one-to-one mapping relationship like below. The computation and the storage should always be on the same machines.
Hadoop Cluster Manager Spark MapReduce
Name Node Master Driver Job Tracker
Data Node Slave Executor Task Tracker

Choice of public cloud:

I simply compare two cloud service provider: AWS and DigitalOcean. Both have nice Python-based monitoring tools(Boto for AWS and python-digitalocean for DigitalOcean).
  1. From storage to computation
    Hadoop’s S3 is a great storage to keep data and load it into the Spark/EC2 cluster. Or the Spark cluster on EC2 can directly read S3 bucket such as s3n://file (the speed is still acceptable). On DigitalOcean, I have to upload data from local to the cluster’s HDFS.
  2. DevOps tools:
      • With default setting after running it, you will get
        • 2 HDFSs: one persistent and one ephemeral
        • Spark 1.3 or any earlier version
        • Spark’s stand-alone cluster manager
      • A minimal cluster with 1 master and 3 slaves will be consist of 4 m1.xlarge EC2 instances
        • Pros: large memory with each node having 15 GB memory
        • Cons: not SSD; expensive (cost $0.35 * 6 = $2.1 per hour)
      • With default setting after running it, you will get
        • HDFS
        • no Spark
        • Mesos
        • OpenVPN
      • A minimal cluster with 1 master and 3 slaves will be consist of 4 2GB/2CPUs droplets
        • Pros: as low as $0.12 per hour; Mesos provide fine-grained control over the cluster(down to 0.1 CPU and 16MB memory); nice to have VPN to guarantee the security
        • Cons: small memory(each has 2GB memory); have to install Spark manually

Add Spark to DigitalOcean cluster

Tom Faulhaber has a quick bash script for deployment. To install Spark 1.3.0, I write it into a fabfile for Python’s Fabric.
Then all the deployment onto the DigitOcean is just one command line.
# is the internal IP address of the master
fab -H deploy_spark
The source codes above are available at my Github

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...