From Markdown to LaTeX output using RMarkdown.

I’ve been working on the ggRandomForests vignettes pretty consistently now. I’m writing the randomForestSRC-Survival vignette in LaTeX with the knitr vignette engine. I wrote the the randomForestSRC-Regression vignette in markdown.

I’ve decided to upload the Regression vignette to arXiv for additional distribution. The arXiv submission process prefers LaTeX files, and since RMarkdown can compile to pdf, using pandoc through a LaTeX document, I was hoping for a simple way to go from Markdown to LaTeX. My idea was to generate the LaTeX source, and do a few cleanup edits before submitting.

I tried a few things, Rstudio tends to remove the intermediate tex file after compile. So I went to the rmarkdown::render command. The intermediate files were still removed.

Then I found the presentation at http://blog.rstudio.org/2014/06/18/r-markdown-v2/. The “Aha!” moment was when Yuhui said that the yaml metadata commands pdf_document, html_document and word_document are commands within the RMarkdown package. A quick help search:

> ?pdf_document

pdf_document(toc = FALSE, toc_depth = 2, number_sections = FALSE,
  fig_width = 6.5, fig_height = 4.5, fig_crop = TRUE,
  fig_caption = FALSE, highlight = "default", template = "default",
  keep_tex = FALSE, latex_engine = "pdflatex", includes = NULL,
  pandoc_args = NULL)

and there is a keep_tex argument. Suddenly, the rest of the yaml markdown syntax also makes sense.

I changed my output syntax from:

output:
  pdf_document: 
    fig_caption: true

to:

output:
  pdf_document: 
    fig_caption: true
    keep_tex: true

The Rstudio knit PDF button still removes the tex file, but using the command line render command works as I need.

Now I just need to add a few edits… and I’m off!

Advertisements

Testing, testing, testing!

R testthat unit tests with GitHub, Travis-CI continuous integration and the covr package for Coveralls code coverage.

I’ve been working pretty hard on getting the ggRandomForests package wrapped up so I can work on some other projects that have as much or more potential impact. This is my second CRAN package, and I’ve learned a lot about R programming, with loads of help from the hadleyverse.

I’ve come to statistics from an engineering background, and R is not my first language. I’m familiar with unit testing, what it is and how it’s supposed to work. The advantages of writing tests first seem to be enormous, but I have not really been able to get to the “nuts and bolts” of it. How do I apply this to my work specifically?

So, I started by starting! I went through my ggRandomForests package and wrote a series of testthat tests to just make sure objects belonged to the correct class. At JSM!2014, I cornered Hadley Wickham got him to help me get the tests to run on R CMD CHECK and with devtools test() function the way I expected. So I was off and running.

Except that I really didn’t get the part about “write a test, then code to the test” part. So my test framework languished.

On Monday, I learned about the covr package. I’ve tried code coverage in the past, also without much luck. Getting things instrumented and then figuring it all out was a huge amount of work in C/C++. I’ve also briefly tried some of the R code coverage tools. The covr package has the advantage of putting a bunch of tools into the mix to make the whole toolchain work. And I am benefiting from the process.

I’ll try to briefly describe what I’ve done to implement code coverage to improve my testthat tests and hopefully make my package more stable now and as I add more features in the future.

Code coverage using the covr will require three web based tools:

  • GitHub will host your R code. Hadley covers version control, and using GitHub in his R Packages book. Look at the Git and GitHub chapter.
  • I was already using Travis-CI for continuous integration. You set up this site to watch your GitHub repository, and test your package at every git commit. I’ve caught some silly dependency bugs with this, because I’m using development versions of R packages that are not widely available yet. I haven’t figured out PackRat yet, and I’m not convinced that is the correct solution to this particular problem.

  • Coveralls is the last piece. You set up Coveralls to watch the Travis-CI build, and it will generate a report showing what lines of code you’ve actually hit… with your testthat tests.

GitHub

If you’re not using GitHub for your package development, I suggest you start. You’ll need this account to start. Create a repository for each package. Then Commit early and often. Sit back and watch the fireworks.

My workflow is to develop and write during the day, and commit changes only when I have a clean R CMD CHECK.

Travis CI

I don’t remember how I found Travis-CI, though I’m going to guess it was through either watching Hadley’s Github traffic or reading a retweet of his. Either way, setup was a breeze when I found this GitHub wiki (https://github.com/craigcitro/r-travis/wiki)

You create an account with your GitHub account. You’ll see a list of all your GitHub repos, and you select which ones you want to test continuously.

For R packages, you add .travis.yml file to your package root that tells Travis-CI how to test your package. This is (mostly) mine from the ggRandomForests package. I basically copy and pasted the default from the wiki page.

# Sample .travis.yml for R projects.
#
# See README.md for instructions, or for more configuration options,
# see the wiki:
#   https://github.com/craigcitro/r-travis/wiki

language: c

# For code coverage
before_install:
  - curl -OL http://raw.github.com/craigcitro/r-travis/master/scripts/travis-tool.sh
  - chmod 755 ./travis-tool.sh
  - ./travis-tool.sh bootstrap
install:
  - ./travis-tool.sh install_deps

script: ./travis-tool.sh run_tests

after_failure:
  - ./travis-tool.sh dump_logs

notifications:
  email:
    on_success: change
    on_failure: change

If everything is in order, you’ll now get an email on git push to GitHub whenever the Travis-CI status changes.

You can also watch as Travis-CI does it’s testing by going to the website. You’ll see why your code doesn’t build as well as other diagnostics. And you can add a nice badge to your README.md file which will be displayed on your repo GitHub landing page. The badge is updated in real time… here’s my ggRandomForests badge:
Build Status

Clicking the badge will take you to the Travis-CI build page for the repo. But come back, there’s more!

Coveralls

I was happy with everything, until Jim Hester posted a devtools issue about a pull request for a use_covr() function (use_coveralls() maybe?).

So I clicked through to the covr GitHub page. The README.md file is really short, with simple instructions to get this up and running. So easy, I had to do it right then.

Basically, repeat the Travis-CI setup I did at the Coveralls website (https://coveralls.io/repos/new) and then add 2 lines into my .travis.yml file.

install:
  - ./travis-tool.sh github_package jimhester/covr


after_success:
  - Rscript -e 'covr::coveralls()'

This instructs Travis-CI to install the latest covr package from GitHub before running the tests, then run the covr::coveralls() function to get the data over to (https://coveralls.io/).

And then you can get a new badge, from coveralls!
Coverage Status

It’s like collecting stickers!

Now what?

OK, so at writing, the ggRandomForests code coverage was at 75%. Two days ago, it was at 43%, so I’m pretty pleased with this. How did I get this improvement? By writing better testthat scripts.

I thought about screen shots, but this is long enough already. So, click on the Coverage badge, and it will take you to my Coveralls stats page for the R package (It will direct you away from this page by default). What you see is a history of all the builds I’ve done since I started the Coveralls process. The coverage badge will have an icon indicating how the coverage changed, and what the current percentage of code coverage is.

It took me a little bit to figure out that if you click on the commit message, you’ll see the report that can really help. This page lists the code coverage for each file in your R code directory. If you click on a file, you can see which lines your testthat tests actually hit.

The hard part is to figure out what test you’ll need to add to get your Coveralls number to improve. My particular case required sending in some bad objects, removing some code that I no longer needed and just thinking about what the code was supposed to be doing. I spent about a day really working on tightening up my tests, and I gained a significant increase in test coverage.

If you look at my pages, you’ll see that I have one file that has 0 coverage. That file is probably taking my coverage down from somewhere close to 90%. I am consciously choosing to not test that particular function for two reasons.

1.The function is time intensive. I think it takes about 20-40 minutes to run on a reasonable machine.
2. The function doesn’t test my package, but could be used for a test of the randomForestSRC package I depend on.

I wrote the function to make my life easier. I distribute it in case users want to create their own cached versions of randomForestSRC objects. If I remove the function, my stat goes up, but for the time being I’m OK with the lower number. At least I know why the number is what it is.

I’ll add one more link to Hadley’s books: To really improve testthat tests, I’m going to be returning to (http://r-pkgs.had.co.nz/tests.html). Just because I’ve gotten this far doesn’t mean I’m really at “best practices.”

Another release day: ggRandomForests V1.1.3

Continuing progress with the vignettes mean bug fixes in the code. Plus I’m presenting the regression random forest vignette to the stats group here tomorrow.

http://cran.r-project.org/web/packages/ggRandomForests/index.html

I’ve got another blog post percolating that will detail the biggest change in this version (improved code testing), so here’s the change summary.

  • Update “ggRandomForests: Visually Exploring a Random Forest for Regression” vignette.
  • Further development of draft package vignette “Survival with Random Forests”.
  • Rename vignettes to align with randomForestSRC package usage.
  • Add more tests and example functions.
  • Refactor gg_ functions into S3 methods to allow future implementation for other random forest packages.
  • Improved help files.
  • Updated DESCRIPTION file to remove redundant parts.
  • Misc Bug Fixes.

As always, comments and suggestions are welcome at the ggRandomForests package GitHub development site: https://github.com/ehrlinger/ggRandomForests