From Markdown to LaTeX output using RMarkdown.

I’ve been working on the ggRandomForests vignettes pretty consistently now. I’m writing the randomForestSRC-Survival vignette in LaTeX with the knitr vignette engine. I wrote the the randomForestSRC-Regression vignette in markdown.

I’ve decided to upload the Regression vignette to arXiv for additional distribution. The arXiv submission process prefers LaTeX files, and since RMarkdown can compile to pdf, using pandoc through a LaTeX document, I was hoping for a simple way to go from Markdown to LaTeX. My idea was to generate the LaTeX source, and do a few cleanup edits before submitting.

I tried a few things, Rstudio tends to remove the intermediate tex file after compile. So I went to the rmarkdown::render command. The intermediate files were still removed.

Then I found the presentation at http://blog.rstudio.org/2014/06/18/r-markdown-v2/. The “Aha!” moment was when Yuhui said that the yaml metadata commands pdf_document, html_document and word_document are commands within the RMarkdown package. A quick help search:

> ?pdf_document

pdf_document(toc = FALSE, toc_depth = 2, number_sections = FALSE,
  fig_width = 6.5, fig_height = 4.5, fig_crop = TRUE,
  fig_caption = FALSE, highlight = "default", template = "default",
  keep_tex = FALSE, latex_engine = "pdflatex", includes = NULL,
  pandoc_args = NULL)

and there is a keep_tex argument. Suddenly, the rest of the yaml markdown syntax also makes sense.

I changed my output syntax from:

output:
  pdf_document: 
    fig_caption: true

to:

output:
  pdf_document: 
    fig_caption: true
    keep_tex: true

The Rstudio knit PDF button still removes the tex file, but using the command line render command works as I need.

Now I just need to add a few edits… and I’m off!

Advertisements

Testing, testing, testing!

R testthat unit tests with GitHub, Travis-CI continuous integration and the covr package for Coveralls code coverage.

I’ve been working pretty hard on getting the ggRandomForests package wrapped up so I can work on some other projects that have as much or more potential impact. This is my second CRAN package, and I’ve learned a lot about R programming, with loads of help from the hadleyverse.

I’ve come to statistics from an engineering background, and R is not my first language. I’m familiar with unit testing, what it is and how it’s supposed to work. The advantages of writing tests first seem to be enormous, but I have not really been able to get to the “nuts and bolts” of it. How do I apply this to my work specifically?

So, I started by starting! I went through my ggRandomForests package and wrote a series of testthat tests to just make sure objects belonged to the correct class. At JSM!2014, I cornered Hadley Wickham got him to help me get the tests to run on R CMD CHECK and with devtools test() function the way I expected. So I was off and running.

Except that I really didn’t get the part about “write a test, then code to the test” part. So my test framework languished.

On Monday, I learned about the covr package. I’ve tried code coverage in the past, also without much luck. Getting things instrumented and then figuring it all out was a huge amount of work in C/C++. I’ve also briefly tried some of the R code coverage tools. The covr package has the advantage of putting a bunch of tools into the mix to make the whole toolchain work. And I am benefiting from the process.

I’ll try to briefly describe what I’ve done to implement code coverage to improve my testthat tests and hopefully make my package more stable now and as I add more features in the future.

Code coverage using the covr will require three web based tools:

  • GitHub will host your R code. Hadley covers version control, and using GitHub in his R Packages book. Look at the Git and GitHub chapter.
  • I was already using Travis-CI for continuous integration. You set up this site to watch your GitHub repository, and test your package at every git commit. I’ve caught some silly dependency bugs with this, because I’m using development versions of R packages that are not widely available yet. I haven’t figured out PackRat yet, and I’m not convinced that is the correct solution to this particular problem.

  • Coveralls is the last piece. You set up Coveralls to watch the Travis-CI build, and it will generate a report showing what lines of code you’ve actually hit… with your testthat tests.

GitHub

If you’re not using GitHub for your package development, I suggest you start. You’ll need this account to start. Create a repository for each package. Then Commit early and often. Sit back and watch the fireworks.

My workflow is to develop and write during the day, and commit changes only when I have a clean R CMD CHECK.

Travis CI

I don’t remember how I found Travis-CI, though I’m going to guess it was through either watching Hadley’s Github traffic or reading a retweet of his. Either way, setup was a breeze when I found this GitHub wiki (https://github.com/craigcitro/r-travis/wiki)

You create an account with your GitHub account. You’ll see a list of all your GitHub repos, and you select which ones you want to test continuously.

For R packages, you add .travis.yml file to your package root that tells Travis-CI how to test your package. This is (mostly) mine from the ggRandomForests package. I basically copy and pasted the default from the wiki page.

# Sample .travis.yml for R projects.
#
# See README.md for instructions, or for more configuration options,
# see the wiki:
#   https://github.com/craigcitro/r-travis/wiki

language: c

# For code coverage
before_install:
  - curl -OL http://raw.github.com/craigcitro/r-travis/master/scripts/travis-tool.sh
  - chmod 755 ./travis-tool.sh
  - ./travis-tool.sh bootstrap
install:
  - ./travis-tool.sh install_deps

script: ./travis-tool.sh run_tests

after_failure:
  - ./travis-tool.sh dump_logs

notifications:
  email:
    on_success: change
    on_failure: change

If everything is in order, you’ll now get an email on git push to GitHub whenever the Travis-CI status changes.

You can also watch as Travis-CI does it’s testing by going to the website. You’ll see why your code doesn’t build as well as other diagnostics. And you can add a nice badge to your README.md file which will be displayed on your repo GitHub landing page. The badge is updated in real time… here’s my ggRandomForests badge:
Build Status

Clicking the badge will take you to the Travis-CI build page for the repo. But come back, there’s more!

Coveralls

I was happy with everything, until Jim Hester posted a devtools issue about a pull request for a use_covr() function (use_coveralls() maybe?).

So I clicked through to the covr GitHub page. The README.md file is really short, with simple instructions to get this up and running. So easy, I had to do it right then.

Basically, repeat the Travis-CI setup I did at the Coveralls website (https://coveralls.io/repos/new) and then add 2 lines into my .travis.yml file.

install:
  - ./travis-tool.sh github_package jimhester/covr


after_success:
  - Rscript -e 'covr::coveralls()'

This instructs Travis-CI to install the latest covr package from GitHub before running the tests, then run the covr::coveralls() function to get the data over to (https://coveralls.io/).

And then you can get a new badge, from coveralls!
Coverage Status

It’s like collecting stickers!

Now what?

OK, so at writing, the ggRandomForests code coverage was at 75%. Two days ago, it was at 43%, so I’m pretty pleased with this. How did I get this improvement? By writing better testthat scripts.

I thought about screen shots, but this is long enough already. So, click on the Coverage badge, and it will take you to my Coveralls stats page for the R package (It will direct you away from this page by default). What you see is a history of all the builds I’ve done since I started the Coveralls process. The coverage badge will have an icon indicating how the coverage changed, and what the current percentage of code coverage is.

It took me a little bit to figure out that if you click on the commit message, you’ll see the report that can really help. This page lists the code coverage for each file in your R code directory. If you click on a file, you can see which lines your testthat tests actually hit.

The hard part is to figure out what test you’ll need to add to get your Coveralls number to improve. My particular case required sending in some bad objects, removing some code that I no longer needed and just thinking about what the code was supposed to be doing. I spent about a day really working on tightening up my tests, and I gained a significant increase in test coverage.

If you look at my pages, you’ll see that I have one file that has 0 coverage. That file is probably taking my coverage down from somewhere close to 90%. I am consciously choosing to not test that particular function for two reasons.

1.The function is time intensive. I think it takes about 20-40 minutes to run on a reasonable machine.
2. The function doesn’t test my package, but could be used for a test of the randomForestSRC package I depend on.

I wrote the function to make my life easier. I distribute it in case users want to create their own cached versions of randomForestSRC objects. If I remove the function, my stat goes up, but for the time being I’m OK with the lower number. At least I know why the number is what it is.

I’ll add one more link to Hadley’s books: To really improve testthat tests, I’m going to be returning to (http://r-pkgs.had.co.nz/tests.html). Just because I’ve gotten this far doesn’t mean I’m really at “best practices.”

Another release day: ggRandomForests V1.1.3

Continuing progress with the vignettes mean bug fixes in the code. Plus I’m presenting the regression random forest vignette to the stats group here tomorrow.

http://cran.r-project.org/web/packages/ggRandomForests/index.html

I’ve got another blog post percolating that will detail the biggest change in this version (improved code testing), so here’s the change summary.

  • Update “ggRandomForests: Visually Exploring a Random Forest for Regression” vignette.
  • Further development of draft package vignette “Survival with Random Forests”.
  • Rename vignettes to align with randomForestSRC package usage.
  • Add more tests and example functions.
  • Refactor gg_ functions into S3 methods to allow future implementation for other random forest packages.
  • Improved help files.
  • Updated DESCRIPTION file to remove redundant parts.
  • Misc Bug Fixes.

As always, comments and suggestions are welcome at the ggRandomForests package GitHub development site: https://github.com/ehrlinger/ggRandomForests

Christmas release: ggRandomForests V1.1.2

I’ve posted a new release of the ggRandomForests: Visually Exploring Random Forests to CRAN at (http://cran.r-project.org/package=ggRandomForests)

The biggest news is the inclusion of some holiday reading – a ggRandomForests package vignette!
ggRandomForests: Visually Exploring a Random Forest for Regression

The vignette is a tutorial for using the ggRandomForests package with the randomForestSRC package for building and post-processing a regression random forest. In this tutorial, we explore a random forest model for the Boston Housing Data, available in the MASS package. We grow a random forest for regression and demonstrate how ggRandomForests can be used to when determining variable associations, interactions and how the response depends on predictive variables within the model.

The tutorial demonstrates the design and usage of many of ggRandomForests functions and features. We also demonstrate how to modify and customize the resulting ggplot2 graphic objects along the way.

Next up in the package development queue is the completion of the Survival in Random Forests vignette (Preliminary draft version is avialable on CRAN also). A time to event analysis of the primary biliary cirrhosis (PBC) of the liver data from Fleming and Harrington (1991) “Counting processes and survival analysis.”

The development version of ggRandomForests is on GitHub at (https://github.com/ehrlinger/ggRandomForests)

ggRandomForests: Visually Exploring random forests. V1.1.1 release.

Release early and often.
http://cran.r-project.org/web/packages/ggRandomForests/index.html

I may have been aggressive numbering the first CRAN release at v1.0, but there’s no going back now. The design of the feature set is complete even if the code has some catching up to do.

After the v1.1.0 release we found a bug in gg_partial when handling plots for categorical variables. I’ve also moved the eventtable function to gg_survival. gg_survival will now be used for all Kaplan-Meier and Nelson-Aalen estimates. I still need to extend it for curves other than survival, but it’s a start.

I’ve also made more progress on the vignette, which is driving the TODO list pretty hard. I’m knocking things off as I need them, so it goes both ways.

Look! It’s your data!

If a picture is worth a thousand words, then how many tables are a single visualization worth? Exploratory data analysis is a great way to see what is and is not in your dataset.

I work in a hospital research group. Most of my colleagues are more comfortable in SAS than in R. One of the first ideas I had to help the workflow here was to create an R script that generated a series of data plots to visualize variables within our analysis dataset. This allowed our statistical programmers to do some data checking for quality, outliers and missing data before handing the data set to our biostatisticians. This moved the error checking step up in the project workflow, hopefully saving some time and making us all more productive.

The problem was this script required some custom edits for every data set. Something that the stat programmers wanted to learn, but in the heat of (deadlines) battle, that time for learning new things rarely happens. Hence it gets put off until “later”.

I’ve been contemplating how to use Shiny apps for a long time, and this finally caused me jumped in. It was pretty simple to get started actually, I just dumped a bunch of code I had been using (read: copy and pasting into lots of jobs) and put it in the server.R file. I created a ui.R and BAM! the xportEDA Shiny app was born!

xportEDA A shiny app for data visualization

xportEDA is a Shiny app that generates graphics used to explore your data set. It started out for SAS xport files only, but we’ve added csv and some rdata file support. Written in [R](http://cran.r-project.org/), this shiny app requires the following packages:

  • shiny
  • foreign (to load SAS xport files)
  • ggplot2 (nice figures)
  • reshape2
  • RColorBrewer (color palettes are my friend)

Description

The xportEDA app makes it easy to visualize your data quickly, without requiring programming effort to get a jump on your data wrangling.

You supply the app with a data file. The app can read in a data.frame from a SAS xpt, csv or rdata file, and generates a set of data visualizations.

The app first classifies the variables as continuous, logical or categorical. Any variable with only 2 unique values is interpreted as logical. An ad-hoc definition for categorical variables is any factor plus any variable with more than 2 and less than 10 unique values. By default, character variables are converted to factors, however if we have more than 20 levels, we will not show a panel figure for that variable.

The app creates a faceted set of histograms for all categorical and logical variables, and another set of scatter plots for all continuous variables. Since we often are working in time-to-event settings, the app searches the variable names for some of our “standard” time related variable names to use for the x-axis. Typically, we use a “date of procedure” for this. However, if your data does not have a “time” variable name, we will select the first continuous variable for the x-axis. This variable can be changed though the Shiny interface.

A separate page is set up for visualizing individual variables, making it easy to export a single figure for use in reports or other communications. Useful for when your collaborators do not believe you are missing large chunks of data in a variable, or there are negative values for strictly positive variables, like height.

We also include a data summary page for further data debugging purposes.

Examples

This example is from our research data.

categorical
Categorical Data

The first figure shows all the categorical data in this dataset. Each of the histograms look pretty uniformly distributed, though there are some missing values shown in the hx_fcad variable.

continuous
Continuous variables

Continuous variables are also informative. Here we see the uniformly distributed data over the time of interest (about 7 years of follow up). However, there are a series of extreme values many variables like bun_pr or BMI. BMI is a variable made from height and weight, so the extremely short or extremely large people can make for very strange BMI measures. This is typically a problem with units of measurement. What should we do with these extreme values?

The bottom panel also shows a simple way to check goodness of follow up for time to event data. We expect the triangular shape as subjects that entered the study early should have the longest follow up. The internal part of the triangle should be mostly red xs, indicating an event occurred (death) and most of the blue circles should be on the hypotenuse of the triangle, as they indicate censored or alive case, occuring hopefully at the end of the study as opposed to lost to follow up cases in the interior.

single
Single variable

A single plot for bun_pr makes it pretty clear there are only 3 values that are suspect. Since there are quite a few observations, we may want to make these missing and use imputation, or return these observations for data correction.

summary
Data summary

This is not the optimal way to view a data summary, but it may help the user understand where issues with the app are coming from.

Running the App

To try it out, you can see it on the shinapps.io site
https://ehrlinger.shinyapps.io/xportEDA/

I have also posted the app code to a GitHub repository where you can download it, and try it out. Let me know how it goes, report bugs or contribute back. I’d love to make this better, and learn more Shiny tricks along the way.

The easiest way is to run the app locally is to download it from the https://github.com/ehrlinger/xportEDA repository.

Then run it from R
R> library(shiny)
R> runApp()

or run it in R directly from the repository:

R> shiny::runGitHub("ehrlinger/xportEDA")

Issues and Caveats

I could put the standard “use at your own risk” disclaimer here. I will also add:

The app is written with my specific problem domain in mind though I am open for suggestions on how to improve it.

We tend to use time to event data (working in a hospital after all).

Our group uses SAS predominantly, hence the “xport” functionality and the app naming structure.

xportEDA will have trouble with large p data sets, as I have not figured out how to make shiny extend the figures indefinitely down the page. I do dynamically set the number of columns in an effort to control how small the panel plots get. But if you get into the 75 categorical or continuous variable range, it may become illegible.

Progress not perfection! The best way to start, is to start.

Parallel execution of randomForestSRC

I guess I’m the resident expert on resampling methods at work. I’ve been using bagged predictors and random forests for a while, and have recently been using the randomForestSRC (RF-SRC) package in R (http://cran.r-project.org/web/packages/randomForestSRC). This package merges the two randomForest implementations, randomForest package for regression and classification forests and the randomSurvivalForest package for survival forests.

By default the package is installed to run on one processor, however, being embarrassingly parallelizable, a major advantage of RF-SRC is that it can be compiled to run on multicore machines easily. It does take a little tweaking to get it to work though, and this post is intended to document that process. I assume you have R installed, and have a compiler for package installation (R-dev libraries possibly).

As Larry Wall put it “There’s More Than One Way to Do It”, and there certainly could be another smoother path to get this to work. I’ll just note what I did, and am open to modifications.

First, we do need to compile from source, so download the source package from CRAN at  http://cran.r-project.org/web/packages/randomForestSRC and unpack it in your favorite dev directory.

For serial execution, you can either install as is

R CMD INSTALL randomForestSRC

or from within R, just use

install.packages("randomForestSRC")

For parallel code, open your terminal for the following commands:

cd randomForestSRC
autoconf

autoconf with create a configure file for compilation of the source code.

cd ..
R CMD INSTALL randomForestSRC

This will compile and install the code in your library. If you also want to install an alternate binary (x86_64 and i386 on Mac OS X) you will also need the following

R32 CMD INSTALL --clean --libs-only randomForestSRC

or

R64 CMD INSTALL --clean --libs-only randomForestSRC

Depending on which architecture your machine reverts to by default.

At this point, you can run either architecture R32/R64 or simply the default R, and load the package.

library(randomForestSRC)

Then run an example like:

### Survival analysis
### Veteran data
### Randomized trial of two treatment regimens for lung cancer

data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)

# print and plot the grow object
print(v.obj)
plot(v.obj)

And watch all the processors light up in htop.

You can also control the processor use by either setting the RF_CORES environment variable, or adding

options(rf.cores = x)

to your ~/.Rprofile file.

Happy burying your processors!