ggRandomForests: Visually Exploring random forests. V1.1.1 release.

Release early and often.
http://cran.r-project.org/web/packages/ggRandomForests/index.html

I may have been aggressive numbering the first CRAN release at v1.0, but there’s no going back now. The design of the feature set is complete even if the code has some catching up to do.

After the v1.1.0 release we found a bug in gg_partial when handling plots for categorical variables. I’ve also moved the eventtable function to gg_survival. gg_survival will now be used for all Kaplan-Meier and Nelson-Aalen estimates. I still need to extend it for curves other than survival, but it’s a start.

I’ve also made more progress on the vignette, which is driving the TODO list pretty hard. I’m knocking things off as I need them, so it goes both ways.

Parallel execution of randomForestSRC

I guess I’m the resident expert on resampling methods at work. I’ve been using bagged predictors and random forests for a while, and have recently been using the randomForestSRC (RF-SRC) package in R (http://cran.r-project.org/web/packages/randomForestSRC). This package merges the two randomForest implementations, randomForest package for regression and classification forests and the randomSurvivalForest package for survival forests.

By default the package is installed to run on one processor, however, being embarrassingly parallelizable, a major advantage of RF-SRC is that it can be compiled to run on multicore machines easily. It does take a little tweaking to get it to work though, and this post is intended to document that process. I assume you have R installed, and have a compiler for package installation (R-dev libraries possibly).

As Larry Wall put it “There’s More Than One Way to Do It”, and there certainly could be another smoother path to get this to work. I’ll just note what I did, and am open to modifications.

First, we do need to compile from source, so download the source package from CRAN at  http://cran.r-project.org/web/packages/randomForestSRC and unpack it in your favorite dev directory.

For serial execution, you can either install as is

R CMD INSTALL randomForestSRC

or from within R, just use

install.packages("randomForestSRC")

For parallel code, open your terminal for the following commands:

cd randomForestSRC
autoconf

autoconf with create a configure file for compilation of the source code.

cd ..
R CMD INSTALL randomForestSRC

This will compile and install the code in your library. If you also want to install an alternate binary (x86_64 and i386 on Mac OS X) you will also need the following

R32 CMD INSTALL --clean --libs-only randomForestSRC

or

R64 CMD INSTALL --clean --libs-only randomForestSRC

Depending on which architecture your machine reverts to by default.

At this point, you can run either architecture R32/R64 or simply the default R, and load the package.

library(randomForestSRC)

Then run an example like:

### Survival analysis
### Veteran data
### Randomized trial of two treatment regimens for lung cancer

data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)

# print and plot the grow object
print(v.obj)
plot(v.obj)

And watch all the processors light up in htop.

You can also control the processor use by either setting the RF_CORES environment variable, or adding

options(rf.cores = x)

to your ~/.Rprofile file.

Happy burying your processors!