Wednesday, September 24, 2014

One example of test-driven development in Python

Function or method is the most basic unit in Python programming. Test-driven development is a key for a developer to assure the code quality of those units. In his book, Harry Percival illustrated a few great examples about how to use TDD with Python and Django. It seems that for web development, TDD including unit testing and integration testing is the cornerstone for every success. For data analysis, coding mostly relies on built-in packages instead large framework like Django, which makes TDD easier. In my opnion, TDD in data analysis could have three steps.
  • Step 1: requirement analysis
    Before writing any code for data analysis, the programmer should seriously ask the customer or himself about the requirements.
    • What are the input parameter?
    • what if the input data doesn’t fit the assumptions?
    • What is the purpose of this funtction or method? what are the desired outputs?
  • For example, there is a recent coding challenge called Maximum Product Subarray.
      > Find the contiguous subarray within an array (containing at least one number) which has the largest product.
      For example, given the array [2,3,-2,4],
      the contiguous subarray [2,3] has the largest product = 6.
    
  • OK, understanding this question is quite straight-forward. Given a array(or a list in Python), you return the integer that is the maximum product from a continuous subarry out of the input array.
      def maxProduct(A):
          """ A function to find the maximum product value for a continuous subarray.
          :param A: an array or list
          :type A: list
          """
          if A is None or not isinstance(A, list):
              return None
          if len(A) == 1:
              return A[0]
          pass
    
    • A production version of the codes above should be more like:
      class FunctionInputError(Exception):
        pass
      
      def maxProduct(A):
        """ A function to find the maximum product value for a continuous subarray.
        :param A: an array or list
        :type A: list
        """
        if A is None or not isinstance(A, list):
            raise FunctionInputError('must give a list as input')
        if len(A) == 1:
            return A[0]
        pass
      
  • Step 2: write test cases
    Given not a single line of logic codes has been writen yet, I call the current step as black-box testing, which means that I want this funtion to fail every test cases. Python has a built-in module doctest, which allows embedding the test cases within the docstring. I write six test cases, run the psedu-function below and arbitrarily specify the result to be -1. As expected, it fails all the six test cases with horrible red warnings under the Python shell. That is a good thing: it proves that the testing works.
      import doctest
      def maxProduct(A):
          """ A function to find the maximum product value for a continuous subarray.
          :param A: an array or list
          :type A: list
    
          - testcase1
              >>> maxProduct([0, 2, 3,-2,4, 1, -1])
              48
    
          - testcase2
              >>> maxProduct([2, 3, 0, 2, 4, 0, 3, 4])
              12
    
          - testcase3
              >>> maxProduct([0, 5, 3, -1, -2, 0, -2, 4, 0, 3, 4])
              30
    
          - testcase4
              >>> maxProduct([0, 1, -3, -4, -5, -1, 1, 2, 1, -1, -100, 0, -100000])
              12000
    
          - testcase5
              >>> maxProduct([0, -3000, -2, 0, -100, -100, 0, -9, -8, 1, 1, 2])
              10000
    
          - testcase6
              >>> maxProduct([0, -2, 0])
              0
          """
          if A is None or not isinstance(A, list):
              return None
          if len(A) == 1:
              return A[0]
          return -1
    
      doctest.testmod()
    
  • Step 3: implement the logic
    It’s time to tackle the most difficult part: write the real function. Think about time complexity (it is best to use only one iteration around the input array which means O(n)), and space complexity (it is best not to use extra space). Run testmod() again and again to find mistakes and modify the codes accordingly. Finally I come with a solution with a helper shadow function _maxProduct. And it passes the six test cases. Althoug I am not sure that this function does not have any bug, at least it works now.
      import doctest
      from sys import maxint
    
      def maxProduct(A):
          """ A function to find the maximum product value for a continuous subarray.
          :param A: an array or list
          :type A: list
    
          - testcase1
              >>> maxProduct([0, 2, 3,-2,4, 1, -1])
              48
    
          - testcase2
              >>> maxProduct([2, 3, 0, 2, 4, 0, 3, 4])
              12
    
          - testcase3
              >>> maxProduct([0, 5, 3, -1, -2, 0, -2, 4, 0, 3, 4])
              30
    
          - testcase4
              >>> maxProduct([0, 1, -3, -4, -5, -1, 1, 2, 1, -1, -100, 0, -100000])
              12000
    
          - testcase5
              >>> maxProduct([0, -3000, -2, 0, -100, -100, 0, -9, -8, 1, 1, 2])
              10000
    
          - testcase6
              >>> maxProduct([0, -2, 0])
              0
          """
          if A is None or not isinstance(A, list):
              return None
          if len(A) == 1:
              return A[0]
          return max(_maxProduct(A), _maxProduct([a for a in reversed(A)]))
    
      def _maxProduct(A):
          max_val_forward = 1
          rst = -maxint
          for a in A:
              if a != 0:
                  max_val_forward *= a
                  rst = max(rst, max_val_forward)
              else:
                  rst = max(0, rst)
                  max_val_forward = 1
          return rst
    
      if __name__ == "__main__":
          doctest.testmod()
    
In conclusion, the most important thing about TDD in data analysis is writing test cases, which really needs a lot of training and exercises.

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...