Wednesday, April 30, 2014

Count large chunk of data in Python

The line-by-line feature in Python allows it to count hard disk-bound data. The most frequently used data structures in Python are list and dictionary. Many cases the dictionary has advantages since it is a basically a hash table that many realizes O(1) operations.
However, for the tasks of counting values, the two options make no much difference and we can choose any of them for convenience. I listed two examples below.

Use a dictionary as a counter

There is a question to count the strings in Excel.
Count the unique values in one column in EXCEL 2010. The worksheet has 1 million rows and 10 columns.
or numbers.
For example,
A5389579_10
A1543848_6
A5389579_8
Need to cut off the part after (including) underscore such as from A5389579_10 to A5389579
Commonly Excel on a desktop can’t handle this size of data, while Python would easily handle the job.
# Load the Excel file by the xlrd package
import xlrd
book = xlrd.open_workbook("test.xlsx")
sh = book.sheet_by_index(0)
print sh.name, sh.nrows, sh.ncols
print "Cell D30 is", sh.cell_value(rowx=29, colx=3)

# Count the unique values in a dictionary
c = {} 
for rx in range(sh.nrows):
    word = str(sh.row(rx)[1].value)[:-3]
    try:
        c[word] += 1 
    except:
        c[word] = 1

print c

Use a list as a counter

There is a question to count emails.
A 3-column data set includes sender, receiver and timestamp. How to calculate the time between the sender sends the email
and the receiver sends the reply email?
The challenge is to scale up the small sample data to larger size. The solution I have has the complexity of O(nlogn), which is only limited by the sorting step.
raw_data = """
    SENDER|RECEIVER|TIMESTAMP
    A B 56
    A A 7
    A C 5
    C D 9
    B B 12
    B A 8
    F G 12
    B A 18
    G F 2
    A B 20
    """

# Transform the raw data to a nested list
data = raw_data.split()
data.pop(0) # Remove the Head
data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))

# Sort the nested list by the timestamp 
from operator import itemgetter
data.sort(key=itemgetter(2))
for r in data:
    print r

# Count the time difference in a list
c = []
while len(data) != 1:
    y = data.pop(0)
    for x in data:
        if x[0] == y[1] and x[1] == y[0]:
            diff = x[2] - y[2] 
            print y, x, '---->', diff
            c.append(diff)
            break # Only find the quickest time to respond
print c

P.S.

I come up with the O(n) solution below, which utilizes two hash tables to decrease the complexity.
__author__ = 'dapangmao'

def find_duration(data):
    # Construct two hash tables
    h1 = {}
    h2 = {}
    # Find the starting time for each ID pair
    for x in data:
        if x[0] != x[1]:
            key = x[0] + x[1]
            try:
                h1[key] = x[2]
            except:
                h1[key] = min(h1[key], x[2])
    # Find the minimum duration for each ID pair
    for x in data:
        key = x[1] + x[0]
        if h1.has_key(key):
            duration = x[2] - h1[key]
            try:
                h2[key] = duration
            except:
                h2[key] = min(h2[key], duration)
    return h2

if __name__ == "__main__":
    raw_data = """
        SENDER|RECEIVER|TIMESTAMP
        A B 56
        A A 7
        A C 5
        C D 9
        B B 12
        B A 8
        F G 12
        B A 18
        G F 2
        A B 20
        """

    # Transform the raw data to a nested list
    data = raw_data.split()
    data.pop(0) # Remove the Head
    data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))
    # Verify the result
    print find_duration(data)

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...