Tuesday, June 25, 2013

Use doParallel for rolling regression in R

Rolling regression for a large data set costs lots of resources. There are a few strategies to speed up this process. For example, in R, there is a rollapply function in the dynlm package. In SAS, PROC FCMP is one of the options for optimization.
Revolution Analytics' doParallel package utilizes the power of the foreach package and realizes parallel processing. It seems that the recent update version 1.0.3 has fixed a bug on Windows system. I am interested in seeing if the doParallel package would allow any efficiency boost for rolling regression, on my old Windows computer which has two cores (Intel Core 2 Duo 3.06GHz).
I first created the random vectors of 20000 for x and y, and set the rolling window size to be 20. I selected 3 scenarios: sequential processing, parallel processing with 2 cores and parallel processing with 4 cores. The parameters solved by the rolling regressions are shown in the picture above. The matrices by the three methods have no difference.
It turned out that the time cost has been significantly improved under the parallel mode. Since I actually have no more than 2 cores on this computer, the registerDoParallel(cores=4) automatically killed the redundant connections and performed the same as the cores=2 mode. Since currently most computers have multiple cores, the doParallel package has a lot of potentials in statistics.
####(1) Import libraries and set up directory#################
library(doParallel)
library(ggplot2)
setwd('C:/Google Drive/r')
getwd()

####(2) Simulate the data set ################################
# Set a few parameters for regressions
n <- 20000
beta0 <- 0.01; beta1 <- 0.4
mse = 0.9
set.seed(1234)

# Generate x and y variables
x <- rnorm(n)/100
y <- beta0 + beta1*x + rnorm(n, 0, mse)/100

####(3) Run the tests #######################################
# Set the rollling window size to be 20
windowSize <- 20

# Run sequentially
time1 <- system.time(
  z1 <- foreach(i=seq(1, (n-windowSize+1), 1), .combine=rbind) %do% {
    idx <- i:(i+windowSize-1)
    coefficients(lm(y[idx]~x[idx]))
  }
)

# Run with 2 cores registered
registerDoParallel(cores=2)
time2 <- system.time( 
  z2 <- foreach(i=seq(1, (n-windowSize+1), 1), .combine=rbind) %dopar% {
    idx <- i:(i+windowSize-1)
    coefficients(lm(y[idx]~x[idx]))
  }
)

# Run with 4 cores registered
registerDoParallel(cores=4)
time3 <- system.time( 
  z3 <- foreach(i=seq(1, (n-windowSize+1), 1), .combine=rbind) %dopar% {
    idx <- i:(i+windowSize-1)
    coefficients(lm(y[idx]~x[idx]))
  }
)

####(4) Display the results #################################
# Show the parameters from rolling regression
z <- data.frame(cbind(z1, 1:nrow(z1)))
names(z) <- c('beta0', 'beta1', 'exp')
ggplot(data=z, aes(x=exp, y=beta1, alpha=0.5)) + geom_line() + xlab("Number of regressions") +
  geom_hline(aes(yintercept=0.4, color = 'red3')) + opts(legend.position="none")
ggplot(data=z, aes(x=exp, y=beta0, alpha=0.5)) + geom_line() + xlab("Number of regressions") +
  geom_hline(aes(yintercept=0.01, color = 'red3')) + opts(legend.position="none")

# Show the time elapsed
time <- data.frame(rbind(time1, time2, time3))
row.names(time) <- c('No parallel', '2 cores', '4 cores')
ggplot(data=time, aes(x=row.names(time), y=elapsed)) + geom_bar(aes(fill=row.names(time))) +
  ylab("Total time elapsed by seconds") + xlab("")

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...