Another release day: ggRandomForests V1.1.3

Continuing progress with the vignettes mean bug fixes in the code. Plus I’m presenting the regression random forest vignette to the stats group here tomorrow.

http://cran.r-project.org/web/packages/ggRandomForests/index.html

I’ve got another blog post percolating that will detail the biggest change in this version (improved code testing), so here’s the change summary.

  • Update “ggRandomForests: Visually Exploring a Random Forest for Regression” vignette.
  • Further development of draft package vignette “Survival with Random Forests”.
  • Rename vignettes to align with randomForestSRC package usage.
  • Add more tests and example functions.
  • Refactor gg_ functions into S3 methods to allow future implementation for other random forest packages.
  • Improved help files.
  • Updated DESCRIPTION file to remove redundant parts.
  • Misc Bug Fixes.

As always, comments and suggestions are welcome at the ggRandomForests package GitHub development site: https://github.com/ehrlinger/ggRandomForests

Christmas release: ggRandomForests V1.1.2

I’ve posted a new release of the ggRandomForests: Visually Exploring Random Forests to CRAN at (http://cran.r-project.org/package=ggRandomForests)

The biggest news is the inclusion of some holiday reading – a ggRandomForests package vignette!
ggRandomForests: Visually Exploring a Random Forest for Regression

The vignette is a tutorial for using the ggRandomForests package with the randomForestSRC package for building and post-processing a regression random forest. In this tutorial, we explore a random forest model for the Boston Housing Data, available in the MASS package. We grow a random forest for regression and demonstrate how ggRandomForests can be used to when determining variable associations, interactions and how the response depends on predictive variables within the model.

The tutorial demonstrates the design and usage of many of ggRandomForests functions and features. We also demonstrate how to modify and customize the resulting ggplot2 graphic objects along the way.

Next up in the package development queue is the completion of the Survival in Random Forests vignette (Preliminary draft version is avialable on CRAN also). A time to event analysis of the primary biliary cirrhosis (PBC) of the liver data from Fleming and Harrington (1991) “Counting processes and survival analysis.”

The development version of ggRandomForests is on GitHub at (https://github.com/ehrlinger/ggRandomForests)

ggRandomForests: Visually Exploring random forests. V1.1.1 release.

Release early and often.
http://cran.r-project.org/web/packages/ggRandomForests/index.html

I may have been aggressive numbering the first CRAN release at v1.0, but there’s no going back now. The design of the feature set is complete even if the code has some catching up to do.

After the v1.1.0 release we found a bug in gg_partial when handling plots for categorical variables. I’ve also moved the eventtable function to gg_survival. gg_survival will now be used for all Kaplan-Meier and Nelson-Aalen estimates. I still need to extend it for curves other than survival, but it’s a start.

I’ve also made more progress on the vignette, which is driving the TODO list pretty hard. I’m knocking things off as I need them, so it goes both ways.

Parallel execution of randomForestSRC

I guess I’m the resident expert on resampling methods at work. I’ve been using bagged predictors and random forests for a while, and have recently been using the randomForestSRC (RF-SRC) package in R (http://cran.r-project.org/web/packages/randomForestSRC). This package merges the two randomForest implementations, randomForest package for regression and classification forests and the randomSurvivalForest package for survival forests.

By default the package is installed to run on one processor, however, being embarrassingly parallelizable, a major advantage of RF-SRC is that it can be compiled to run on multicore machines easily. It does take a little tweaking to get it to work though, and this post is intended to document that process. I assume you have R installed, and have a compiler for package installation (R-dev libraries possibly).

As Larry Wall put it “There’s More Than One Way to Do It”, and there certainly could be another smoother path to get this to work. I’ll just note what I did, and am open to modifications.

First, we do need to compile from source, so download the source package from CRAN at  http://cran.r-project.org/web/packages/randomForestSRC and unpack it in your favorite dev directory.

For serial execution, you can either install as is

R CMD INSTALL randomForestSRC

or from within R, just use

install.packages("randomForestSRC")

For parallel code, open your terminal for the following commands:

cd randomForestSRC
autoconf

autoconf with create a configure file for compilation of the source code.

cd ..
R CMD INSTALL randomForestSRC

This will compile and install the code in your library. If you also want to install an alternate binary (x86_64 and i386 on Mac OS X) you will also need the following

R32 CMD INSTALL --clean --libs-only randomForestSRC

or

R64 CMD INSTALL --clean --libs-only randomForestSRC

Depending on which architecture your machine reverts to by default.

At this point, you can run either architecture R32/R64 or simply the default R, and load the package.

library(randomForestSRC)

Then run an example like:

### Survival analysis
### Veteran data
### Randomized trial of two treatment regimens for lung cancer

data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)

# print and plot the grow object
print(v.obj)
plot(v.obj)

And watch all the processors light up in htop.

You can also control the processor use by either setting the RF_CORES environment variable, or adding

options(rf.cores = x)

to your ~/.Rprofile file.

Happy burying your processors!