Using parallel processing in R

multicore
Lately I’ve been running a lot of complex models with huge datasets, which is grinding my computer to a halt for hours. Streamlining code can only go so far, but R is limited because the default session runs on only 1 core. In a time when computers have at least 2 cores, if not more, why not take advantage of that extra computing power? (Heck, even my phone has 2 cores.*)

Luckily, R comes bundled with the “parallel” package, which helps to distribute the workload across multiple cores. It’s a cinch to set up on a local machine:

1) Set the number of cores on which to launch an R session

2) Send the data and whatever packages you need to execute your calculations to each of those cores

3) Run the calculations by supplying a list of operations

And that’s it! I’ve been doing this a lot when running models with multiple responses or different sets of predictors. As long as those combinations can be stored in a list, then the system downtime is greatly reduced. Here’s an example using generalized additive models (GAMs) which tend to take longer to run than other kinds of models:

[EDIT 6/13/2013] I forgot to mention that this code was written on my Windows PC. Mac users have access to the function `mclapply` in the `multicore` package.

library(parallel) # calls: makeCluster, clusterEvalQ, parLapply
#install.packages("mgcv")
library(mgcv) # Calls: gamSim, gam

# First, send an instance of R to each core on the local machine
# The detectCores() function detects the number of phyiscal cores and sends R to 
# all of them, but one could replace the function with a number to utilize fewer 
# than the maximum number of cores
cl=makeCluster(detectCores()) #Example: cl=makeCluster(2) would use 2 cores

# Load package on all instances of R on all cores
clusterEvalQ(cl,c(library(mgcv)))
# Use function clusterExport() to send dataframes or other objects to each core
# clusterExport(cl,varlist=c("exampledata"))
# Create datasets for analysis
data=gamSim(1,n=10000,dist="normal",scale=2) 
data1=gamSim(1,n=10000,dist="normal",scale=2)
data2=gamSim(1,n=10000,dist="normal",scale=2) 
# Bind datasets in a list
data.list=list(data,data1,data2)

# Use parLapply to run GAM
system.time( parLapply(cl,data.list,function(i) {
  gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=i) } ) )

#Close the cluster
stopCluster(cl)

# For comparison's sake, how long would this take to run using regular lapply and 
# 1 core?

# Use lapply to run GAM
system.time( lapply(data.list,function(i) {
  gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=i) } ) )
# System downtime is twice as long!

Created by Pretty R at inside-R.org

*Except my phone doesn’t have R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s