Parallel execution of randomForestSRC

I guess I’m the resident expert on resampling methods at work. I’ve been using bagged predictors and random forests for a while, and have recently been using the randomForestSRC (RF-SRC) package in R (http://cran.r-project.org/web/packages/randomForestSRC). This package merges the two randomForest implementations, randomForest package for regression and classification forests and the randomSurvivalForest package for survival forests.

By default the package is installed to run on one processor, however, being embarrassingly parallelizable, a major advantage of RF-SRC is that it can be compiled to run on multicore machines easily. It does take a little tweaking to get it to work though, and this post is intended to document that process. I assume you have R installed, and have a compiler for package installation (R-dev libraries possibly).

As Larry Wall put it “There’s More Than One Way to Do It”, and there certainly could be another smoother path to get this to work. I’ll just note what I did, and am open to modifications.

First, we do need to compile from source, so download the source package from CRAN at  http://cran.r-project.org/web/packages/randomForestSRC and unpack it in your favorite dev directory.

For serial execution, you can either install as is

R CMD INSTALL randomForestSRC

or from within R, just use

install.packages("randomForestSRC")

For parallel code, open your terminal for the following commands:

cd randomForestSRC
autoconf

autoconf with create a configure file for compilation of the source code.

cd ..
R CMD INSTALL randomForestSRC

This will compile and install the code in your library. If you also want to install an alternate binary (x86_64 and i386 on Mac OS X) you will also need the following

R32 CMD INSTALL --clean --libs-only randomForestSRC

or

R64 CMD INSTALL --clean --libs-only randomForestSRC

Depending on which architecture your machine reverts to by default.

At this point, you can run either architecture R32/R64 or simply the default R, and load the package.

library(randomForestSRC)

Then run an example like:

### Survival analysis
### Veteran data
### Randomized trial of two treatment regimens for lung cancer

data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)

# print and plot the grow object
print(v.obj)
plot(v.obj)

And watch all the processors light up in htop.

You can also control the processor use by either setting the RF_CORES environment variable, or adding

options(rf.cores = x)

to your ~/.Rprofile file.

Happy burying your processors!

Advertisements

One thought on “Parallel execution of randomForestSRC

  1. jehrlinger February 14, 2013 / 3:26 pm

    Of course, this information is all available in the randomForestSRC documentation. First install the package from CRAN, then load the serial version with the library(randomForestSRC) statement…Then issue:

    package?randomForestSRC

    This also indicates a better .Rprofile line would be

    options(rf.cores = -1L, mc.cores=-1L)

    Which will let R use all but 1 processors with both openMP (parallel RF-SRC methods) and the parallel package.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s