Monday, December 12, 2011

Those most productive R developers



The number of R packages on CRAN is 3,483 on 2011-12-12. The growth of R package in the past years can be fitted by a quadratic regression perfectly.

I am always interested in who are maintaining those packages. Then I wrote an R script to extract the package head information from CRAN’s website and stored them in a SQLite database. Most R developers are maintaining 1-3 R packages. Some of them are really productive. By the correspondence addresses (Email), the top 50 R developers are listed below:

developer package
1 Kurt Hornik 23
2 Martin Maechler 23
3 Hadley Wickham 21
4 Rmetrics Core Team 19
5 Achim Zeileis 17
6 Henrik Bengtsson 17
7 Paul Gilbert 17
8 Brian Ripley 14
9 Roger D. Peng 13
10 Torsten Hothorn 13
11 Karline Soetaert 12
12 Philippe Grosjean 12
13 Robin K. S. Hankin 12
14 Charles J. Geyer 11
15 Matthias Kohl 11
16 Charlotte Maia 10
17 Mikis Stasinopoulos 10
18 Simon Urbanek (1) 10
19 Thomas Lumley 10
20 Arne Henningsen 9
21 Gregory R. Warnes 9
22 Jonathan M. Lees 9
23 Michael Hahsler 9
24 Peter Ruckdeschel 9
25 A.I. McLeod 8
26 Brian Lee Yung Rowe 8
27 Dirk Eddelbuettel 8
28 John Fox 8
29 Kaspar Rufibach 8
30 Korbinian Strimmer 8
31 Michael Friendly 8
32 Peter Solymos 8
33 Roger Bivand 8
34 Simon Urbanek (2) 8
35 Christopher Brown 7
36 David Meyer 7
37 ORPHANED 7
38 Revolution Analytics 7
39 Rob J Hyndman 7
40 Romain Francois 7
41 Ulrike Groemping 7
42 Christophe Genolini 6
43 Frank Schaarschmidt 6
44 G. Grothendieck 6
45 Hana Sevcikova 6
46 Jeffrey A. Ryan 6
47 Kjetil Halvorsen 6
48 Pei Wang 6
49 Trevor Hastie 6
50 Yihui Xie 6


### A script of R to extract R package information and 
### build a SQLite databse by hchao8@gmail.com 
library(ggplot2)
library(XML)
library(RSQLite)

# Create and connect a SQLite database
conn <- dbConnect("SQLite", dbname = "c:/Rpackage.db")

# Extract names of R packages available from web
allPackageURL <-
  "http://cran.r-project.org/web/packages/available_packages_by_name.html"
allPackage <- na.omit(melt(readHTMLTable(allPackageURL))[, c("V1")])

# Extract individual package information from web and store data in SQLite 
for (i in 1:length(allPackage)){
  packageName <- allPackage[i]
  packageURL <- paste("http://cran.r-project.org/web/packages/",packageName,
                      "/index.html", sep="")
  y <- melt(readHTMLTable(packageURL))
  y$L1 <- packageName
  if(dbExistsTable(conn, "Rpackage")) {
     dbWriteTable(conn, "Rpackage", y, append = TRUE)
  } else {
     dbWriteTable(conn, "Rpackage", y)
  }
} 
# Pull out maintainer information from SQLite database
all <- fetch(dbSendQuery(conn, "
          select v2 as author, count(v2) as package
          from rpackage
          where v1 = 'Maintainer:'
          group by v2
          order by package desc
          ;"))

# Disconnect SQLite database
dbDisconnect(conn)

# Draw a histogram
qplot(package, data = all, binwidth = 1, ylab = "Frequency",
      xlab = "R packages maintained by individual developer")
ggsave("c:/Rlist.png")

# Find 50 most productive developers
head(all, 50)

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...