Monday, May 26, 2014

Steps to deploy Flask's minitwit on Google App Enginee

Flask is a light-weight web framework for Python, which is well documented and clearly written. Its Github depository provides a few examples, which includes minitwit. The minittwit website enjoys a few basic features of social network such as following, login/logout. The demo site on GAE is http://minitwit-123.appspot.com. The Github repo is https://github.com/dapangmao/minitwit.
Google App Engine or GAE is a major public clouder service besides Amazon EC2. Among the four languages(Java/Python/Go/PHP) it supports, GAE is friendly to Python users, possibly because Guido van Rossum worked there and personally created Python datastore interface. As for me, it is a good choice for a Flask app.

Step1: download GAE SDK and GAE Flask skeleton

GAE’s Python SDK tests the staging app and eventuall pushes the app to the cloud.
A Flask skeleton can be dowloaded from Google Developer Console. It contains three files:
  • app.yaml: specify the entrance of run-time
  • appengine_config.py: add the external libraries such as Flask to system path
  • main.py: the root Python program

Step2: schema design

The dabase used for the original minitwit is SQLite. The schema consists of three tables: user, follower and message, which makes a normalized database together. GAE has two Datastore APIs: DB and NDB. Since neither of them supports joining (in this case one-to-many joining for user to follower), I move the follwer table as an nested text propery into the user table, which eliminatse the need for joining.
As the result, the main.py has two data models: User and Message. They will create and maintain two kinds (or we call them as tables) with the same names in Datastore.
class User(ndb.Model):
  username = ndb.StringProperty(required=True)
  email = ndb.StringProperty(required=True)
  pw_hash = ndb.StringProperty(required=True)
  following = ndb.IntegerProperty(repeated=True)
  start_date = ndb.DateTimeProperty(auto_now_add=True)

class Message(ndb.Model):
  author = ndb.IntegerProperty(required=True)
  text = ndb.TextProperty(required=True)
  pub_date = ndb.DateTimeProperty(auto_now_add=True)
  email = ndb.StringProperty(required=True)
  username = ndb.StringProperty(required=True)

Step3: replace SQL statements

The next step is to replace SQL operations in each of the routing functions with NDB’s methods. NDB’s two fundamental methods are get() that retrieves data from Datastore as a list, and put() that pushes list to Datastore as a row. In short, data is created and manipulated as individual object.
For example, if a follower needs to add to a user, I first retrieve the user by its ID that returns a list like [username, email, pw_hash, following, start_date], where following itself is a list. Then I insert the new follower into the following element and save it back again.
u = User.get_by_id(cid)
if u.following is None:
  u.following = [whom_id]
  u.put()
else:
  u.following.append(whom_id)
  u.put()
People with experience in ORM such as SQLAlchemy will be comfortable to implement the changes.

Setp4: testing and deployment

Without the schema file, now the minitwit is a real single file web app. It’s time to use GAE SDK to test it locally, or eventually push it to the cloud. On GAE, We can check any error or warning through the Logs tab to find bugs, or view the raw data through the Datastore Viewer tab.
In conclusion, GAE has a few advantages and disadvantages to work with Flask as a web app.
  • Pro:
    • It allows up to 25 free apps (great for exercises)
    • Use of database is free
    • Automatical memoryCached for high IO
  • Con:
    • Database is No-SQL, which makes data hard to port
    • More expensive for production than EC2

Wednesday, May 21, 2014

Use recursion and gradient ascent to solve logistic regression in Python

In his book Machine Learning in Action, Peter Harrington provides a solution for parameter estimation of logistic regression . I use pandas and ggplot to realize a recursive alternative. Comparing with the iterative method, the recursion costs more space but may bring the improvement of performance.
# -*- coding: utf-8 -*-
"""
Use recursion and gradient ascent to solve logistic regression in Python
"""

import pandas as pd
from ggplot import *

def sigmoid(inX):
    return 1.0/(1+exp(-inX))

def grad_ascent(dataMatrix, labelMat, cycle):
    """
    A function to use gradient ascent to calculate the coefficients
    """
    if isinstance(cycle, int) == False or cycle < 0:
        raise ValueError("Must be a valid value for the number of iterations")
    m, n = shape(dataMatrix)
    alpha = 0.001
    if cycle == 0:
        return ones((n, 1))
    else:
        weights = grad_ascent(dataMatrix, labelMat, cycle-1)
        h = sigmoid(dataMatrix * weights)
        errors = (labelMat - h)
        return weights + alpha * dataMatrix.transpose()* errors

def plot(vector):
    """
    A funtion to use ggplot to visualize the result
    """
    x = arange(-3, 3, 0.1)
    y = (-vector[0]-vector[1]*x) / vector[2]
    new = pd.DataFrame()
    new['x'] = x
    new['y'] = array(y).flatten()
    infile.classlab = infile.classlab.astype(str)
    p = ggplot(aes(x='x', y='y', colour='classlab'), data=infile) + geom_point()
    return p + geom_line

# Use pandas to manipulate data
if __name__ == '__main__':
    infile = pd.read_csv("https://raw.githubusercontent.com/pbharrin/machinelearninginaction/master/Ch05/testSet.txt", sep='\t', header=None, names=['x', 'y', 'classlab'])
    infile['one'] = 1
    mat1 = mat(infile[['one', 'x', 'y']])
    mat2 = mat(infile['classlab']).transpose()
    result1 = grad_ascent(mat1, mat2, 500)
    print plot(result1)
​r

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...