The vignette is a tutorial for using the ggRandomForests package with the randomForestSRC package for building and post-processing a regression random forest. In this tutorial, we explore a random forest model for the Boston Housing Data, available in the MASS package. We grow a random forest for regression and demonstrate how ggRandomForests can be used to when determining variable associations, interactions and how the response depends on predictive variables within the model.
The tutorial demonstrates the design and usage of many of ggRandomForests functions and features. We also demonstrate how to modify and customize the resulting ggplot2 graphic objects along the way.
Next up in the package development queue is the completion of the Survival in Random Forests vignette (Preliminary draft version is avialable on CRAN also). A time to event analysis of the primary biliary cirrhosis (PBC) of the liver data from Fleming and Harrington (1991) “Counting processes and survival analysis.”
I may have been aggressive numbering the first CRAN release at v1.0, but there’s no going back now. The design of the feature set is complete even if the code has some catching up to do.
After the v1.1.0 release we found a bug in gg_partial when handling plots for categorical variables. I’ve also moved the eventtable function to gg_survival. gg_survival will now be used for all Kaplan-Meier and Nelson-Aalen estimates. I still need to extend it for curves other than survival, but it’s a start.
I’ve also made more progress on the vignette, which is driving the TODO list pretty hard. I’m knocking things off as I need them, so it goes both ways.
If a picture is worth a thousand words, then how many tables are a single visualization worth? Exploratory data analysis is a great way to see what is and is not in your dataset.
I work in a hospital research group. Most of my colleagues are more comfortable in SAS than in R. One of the first ideas I had to help the workflow here was to create an R script that generated a series of data plots to visualize variables within our analysis dataset. This allowed our statistical programmers to do some data checking for quality, outliers and missing data before handing the data set to our biostatisticians. This moved the error checking step up in the project workflow, hopefully saving some time and making us all more productive.
The problem was this script required some custom edits for every data set. Something that the stat programmers wanted to learn, but in the heat of (deadlines) battle, that time for learning new things rarely happens. Hence it gets put off until “later”.
I’ve been contemplating how to use Shiny apps for a long time, and this finally caused me jumped in. It was pretty simple to get started actually, I just dumped a bunch of code I had been using (read: copy and pasting into lots of jobs) and put it in the server.R file. I created a ui.R and BAM! the xportEDA Shiny app was born!
xportEDA A shiny app for data visualization
xportEDA is a Shiny app that generates graphics used to explore your data set. It started out for SAS xport files only, but we’ve added csv and some rdata file support. Written in [R](http://cran.r-project.org/), this shiny app requires the following packages:
foreign (to load SAS xport files)
ggplot2 (nice figures)
RColorBrewer (color palettes are my friend)
The xportEDA app makes it easy to visualize your data quickly, without requiring programming effort to get a jump on your data wrangling.
You supply the app with a data file. The app can read in a data.frame from a SAS xpt, csv or rdata file, and generates a set of data visualizations.
The app first classifies the variables as continuous, logical or categorical. Any variable with only 2 unique values is interpreted as logical. An ad-hoc definition for categorical variables is any factor plus any variable with more than 2 and less than 10 unique values. By default, character variables are converted to factors, however if we have more than 20 levels, we will not show a panel figure for that variable.
The app creates a faceted set of histograms for all categorical and logical variables, and another set of scatter plots for all continuous variables. Since we often are working in time-to-event settings, the app searches the variable names for some of our “standard” time related variable names to use for the x-axis. Typically, we use a “date of procedure” for this. However, if your data does not have a “time” variable name, we will select the first continuous variable for the x-axis. This variable can be changed though the Shiny interface.
A separate page is set up for visualizing individual variables, making it easy to export a single figure for use in reports or other communications. Useful for when your collaborators do not believe you are missing large chunks of data in a variable, or there are negative values for strictly positive variables, like height.
We also include a data summary page for further data debugging purposes.
This example is from our research data.
The first figure shows all the categorical data in this dataset. Each of the histograms look pretty uniformly distributed, though there are some missing values shown in the hx_fcad variable.
Continuous variables are also informative. Here we see the uniformly distributed data over the time of interest (about 7 years of follow up). However, there are a series of extreme values many variables like bun_pr or BMI. BMI is a variable made from height and weight, so the extremely short or extremely large people can make for very strange BMI measures. This is typically a problem with units of measurement. What should we do with these extreme values?
The bottom panel also shows a simple way to check goodness of follow up for time to event data. We expect the triangular shape as subjects that entered the study early should have the longest follow up. The internal part of the triangle should be mostly red xs, indicating an event occurred (death) and most of the blue circles should be on the hypotenuse of the triangle, as they indicate censored or alive case, occuring hopefully at the end of the study as opposed to lost to follow up cases in the interior.
A single plot for bun_pr makes it pretty clear there are only 3 values that are suspect. Since there are quite a few observations, we may want to make these missing and use imputation, or return these observations for data correction.
This is not the optimal way to view a data summary, but it may help the user understand where issues with the app are coming from.
I have also posted the app code to a GitHub repository where you can download it, and try it out. Let me know how it goes, report bugs or contribute back. I’d love to make this better, and learn more Shiny tricks along the way.
I could put the standard “use at your own risk” disclaimer here. I will also add:
The app is written with my specific problem domain in mind though I am open for suggestions on how to improve it.
We tend to use time to event data (working in a hospital after all).
Our group uses SAS predominantly, hence the “xport” functionality and the app naming structure.
xportEDA will have trouble with large p data sets, as I have not figured out how to make shiny extend the figures indefinitely down the page. I do dynamically set the number of columns in an effort to control how small the panel plots get. But if you get into the 75 categorical or continuous variable range, it may become illegible.
Progress not perfection! The best way to start, is to start.