Monday, October 20, 2014

Automated testing by pytest

The most hard part in testing is to write test cases, which is time-consuming and error-prone. Fortunately, besides Python built-in modules such as doctest, unittest, there are quite a few third-party packages that could help with automated testing. My favorite one is pytest, which enjoys proven record and syntax sugar.

Step 1: test-driven development

For example, there is a coding challenge on Leetcode:
Find Minimum in Rotated Sorted Array
Suppose a sorted array is rotated at some pivot unknown to you beforehand.
(i.e., 0 1 2 4 5 6 7 might become 4 5 6 7 0 1 2).
Find the minimum element.
You may assume no duplicate exists in the array.
The straightforward way to find a minimal element in an array(or list in Python) is sequential searching, which goes through every element and has a time complexity of O(N). If the array is sorted, then the minimal one is the first element that only costs O(1).
However, this question provides a rotated sorted array, which suggests a binary search and reduces the complexity from O(N) to O(logN).
As usual, write the test cases first. The great thing for pytest is that it significantly simplies the effort to code the test cases: in this example, I only use 3 lines to generate 101 test cases to cover all conditions from 0 to 99 and also include an null test.
Next step is to code the function. It is easy to transplant the iterative approach of binary search to this question. If the pointer is between a sorted segment, then return the most left element as minimal. Otherwise, adjust the right boundary and the left boundary.
# test1.py
import pytest

# Prepare 101 test cases
array = list(range(100))
_testdata = [[array[i: ] + array[ :i], 0] for i in range(100)]
_testdata += [pytest.mark.empty(([], None))]

# Code the initial binary search function
def findMinPrev(num):
    lo, hi = 0, len(num) - 1
    while lo <= hi:
        if num[lo] <= num[hi]:
            return num[lo]
        mid = (lo + hi) / 2
        if num[mid] < num[hi]:
            hi = mid - 1
        else:
            lo = mid + 1

@pytest.mark.parametrize('input, expected', _testdata)
def test_findMinPrev(input, expected):
    assert findMinPrev(input) == expected
After running the py.test -v test1.py command, part of the results shows below. 65 tests passed and 36 failed; the failed cases return the much bigger values that suggests out of boundary during loops, and the selection of the boudaries may be too aggresive.
test1.py:20: AssertionError
_________________________ test_findMinPrev[input98-0] _________________________

input = [98, 99, 0, 1, 2, 3, ...], expected = 0

    @pytest.mark.parametrize('input, expected', _testdata)
    def test_findMinPrev(input, expected):
>       assert findMinPrev(input) == expected
E       assert 98 == 0
E        +  where 98 = findMinPrev([98, 99, 0, 1, 2, 3, ...])

test1.py:20: AssertionError
==================== 36 failed, 65 passed in 0.72 seconds =====================
Now I adjust the right boundary slightly and finally come up with a solution that passes all the tests.
def findMin(num):
    lo, hi = 0, len(num) - 1
    while lo <= hi:
        if num[lo] <= num[hi]:
            return num[lo]
        mid = (lo + hi) / 2
        if num[mid] < num[hi]:
            hi = mid
        else:
            lo = mid + 1

Step 2: performance profiling

Besides the right solution, I am also interested in if the binary search method has indeed improved the performance. This step I choose line_profiler given its line-by-line ability of profiling. I take the most basic one (the sequential search) as benchmark, and also include the method that applies the min function since a few functions similar to it in Pyhton implement vectorizaiton to speed up. The test case is a rotated sorted array with 10 million elements.
# test2.py
from line_profiler import LineProfiler
from sys import maxint

@profile
def findMinRaw(num):
    """Sequential searching"""
    if not num:
        return 
    min_val = maxint
    for x in num:
        if x < min_val:
            min_val = x
    return min_val

@profile
def findMinLst(num):
    """Searching by list comprehension"""
    if not num:
        return
    return min(num)

@profile
def findMin(num):
    """"Binary search"""
    lo, hi = 0, len(num) - 1
    while lo <= hi:
        if num[lo] <= num[hi]:
            return num[lo]
        mid = (lo + hi) / 2
        if num[mid] < num[hi]:
            hi = mid
        else:
            lo = mid + 1

# Prepare a rotated array
array = list(range(10000000))
_testdata = array[56780: ] + array[ :56780]
# Test the three functions
findMinRaw(_testdata)
findMinLst(_testdata)
findMin(_testdata)
After running kernprof -l -v test2.py, I have the output as below. The sequential search has hit the loops 10000001 times and costs almost 14 seconds. The min function encapsulate all details inside and uses 0.5 seconds which is 28 times faster. On the contrary, the binary search method only takes 20 loops to find the minimal value and spends just 0.0001 seconds. As a result, while dealing with large number, an improved algorithm can really save time.
Total time: 13.8512 s
File: test2.py
Function: findMinRaw at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def findMinRaw(num):
     6         1           13     13.0      0.0      if not num:
     7                                                   return
     8         1            3      3.0      0.0      min_val = maxint
     9  10000001     16031900      1.6     47.5      for x in num:
    10  10000000     17707821      1.8     52.5          if x < min_val:
    11         2            5      2.5      0.0              min_val = x
    12         1            3      3.0      0.0      return min_val

Total time: 0.510298 s
File: test2.py
Function: findMinLst at line 15

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    15                                           @profile
    16                                           def findMinLst(num):
    17         1            4      4.0      0.0      if not num:
    18                                                   return
    19         1      1243016 1243016.0    100.0      return min(num)

Total time: 0.000101812 s
File: test2.py
Function: findMin at line 22

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           @profile
    23                                           def findMin(num):
    24         1           15     15.0      6.0      lo, hi = 0, len(num) - 1
    25        20           40      2.0     16.1      while lo <= hi:
    26        20           48      2.4     19.4          if num[lo] <= num[hi]:
    27         1            2      2.0      0.8              return num[lo]
    28        19           54      2.8     21.8          mid = (lo + hi) / 2
    29        19           50      2.6     20.2          if num[mid] < num[hi]:
    30         5           10      2.0      4.0              hi = mid
    31                                                   else:
    32        14           29      2.1     11.7              lo = mid + 1

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...