Lately I’ve been running a lot of complex models with huge datasets, which is grinding my computer to a halt for hours. Streamlining code can only go so far, but R is limited because the default session runs on only 1 core. In a time when computers have at least 2 cores, if not more, why not take advantage of that extra computing power? (Heck, even my phone has 2 cores.*)
Luckily, R comes bundled with the “parallel” package, which helps to distribute the workload across multiple cores. It’s a cinch to set up on a local machine:
1) Set the number of cores on which to launch an R session
2) Send the data and whatever packages you need to execute your calculations to each of those cores
3) Run the calculations by supplying a list of operations
And that’s it! I’ve been doing this a lot when running models with multiple responses or different sets of predictors. As long as those combinations can be stored in a list, then the system downtime is greatly reduced. Here’s an example using generalized additive models (GAMs) which tend to take longer to run than other kinds of models:
[EDIT 6/13/2013] I forgot to mention that this code was written on my Windows PC. Mac users have access to the function `mclapply` in the `multicore` package.
library(parallel) # calls: makeCluster, clusterEvalQ, parLapply #install.packages("mgcv") library(mgcv) # Calls: gamSim, gam # First, send an instance of R to each core on the local machine # The detectCores() function detects the number of phyiscal cores and sends R to # all of them, but one could replace the function with a number to utilize fewer # than the maximum number of cores cl=makeCluster(detectCores()) #Example: cl=makeCluster(2) would use 2 cores # Load package on all instances of R on all cores clusterEvalQ(cl,c(library(mgcv))) # Use function clusterExport() to send dataframes or other objects to each core # clusterExport(cl,varlist=c("exampledata")) # Create datasets for analysis data=gamSim(1,n=10000,dist="normal",scale=2) data1=gamSim(1,n=10000,dist="normal",scale=2) data2=gamSim(1,n=10000,dist="normal",scale=2) # Bind datasets in a list data.list=list(data,data1,data2) # Use parLapply to run GAM system.time( parLapply(cl,data.list,function(i) { gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=i) } ) ) #Close the cluster stopCluster(cl) # For comparison's sake, how long would this take to run using regular lapply and # 1 core? # Use lapply to run GAM system.time( lapply(data.list,function(i) { gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=i) } ) ) # System downtime is twice as long!
Created by Pretty R at inside-R.org
*Except my phone doesn’t have R