archivist: Boost the reproducibility of your research

A few days ago Journal of Statistical Software has published our article (in collaboration with Marcin Kosiński) archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.

Why should you care? Let’s see.

Starter

a
Would you want to retrieve a ggplot2 object with the plot on the right?
Just call the following line in your R console.

archivist::aread('pbiecek/Eseje/arepo/65e430c4180e97a704249a56be4a7b88')

Want to check versions of packages loaded when the plot was created?
Just call

archivist::asession('pbiecek/Eseje/arepo/65e430c4180e97a704249a56be4a7b88')

Wishful Thinking?

When people talk about reproducibility, usually they focus on tools like packrat, MRAN, docker or RSuite. These are great tools, that help to manage the execution environment in which analyses are performed. The common belief is that if one is able to replicate the execution environment then the same R source code will produce same results.

And it’s true in most cases, maybe even more than 99% of cases. Except that there are things in the environment that are hard to control or easy to miss. Things like external system libraries or dedicated hardware or user input. No matter what you will copy, you will never know if it was enough to recreate exactly same results in the future. So you can hope that results will be replicated, but do not bet too high.
Even if some result will pop up eventually, how can you check if it’s the same result as previously?

Literate programming is not enough

There are other great tools like knitr, Sweave, Jupiter or others. The advantage of them is that results are rendered as tables or plots in your report. This gives you chance to verify if results obtained now and some time ago are identical.
But what about more complicated results like a random forest with 100k trees created with 100k variables or some deep neural network. It will be hard to verify by eye that results are identical.

So, what can I do?

The safest solution would be to store copies of every object, ever created during the data analysis. All forks, wrong paths, everything. Along with detailed information which functions with what parameters were used to generate each result. Something like the ultimate TimeMachine or GitHub for R objects.

With such detailed information, every analysis would be auditable and replicable.
Right now the full tracking of all created objects is not possible without deep changes in the R interpreter.
The archivist is the light-weight version of such solution.

What can you do with archivist?

Use the saveToRepo() function to store selected R objects in the archivist repository.
Use the addHooksToPrint() function to automatically keep copies of every plot or model or data created with knitr report.
Use the aread() function to share your results with others or with future you. It’s the easiest way to access objects created by a remote shiny application.
Use the asearch() function to browse objects that fit specified search criteria, like class, date of creation, used variables etc.
Use asession() to access session info with detailed information about versions of packages available during the object creation.
Use ahistory() to trace how given object was created.

Lots of function, do you have a cheatsheet?

Yes! It’s here.
If it’s not enough, find more details in the JSS article.

Shiny + archivist = reproducible interactive exploration


Shiny is a great tool for interactive exploration (and not only for that). But, due to its architecture, all objects/results that are generated are stored in a separate R process so you cannot access them easily from your R console.

In some cases you may wish to retrieve a model or a plot that you have just generated. Or maybe just wish to store all R objects (plots, data sets, models) that have been ever generated by your Shiny application. Or maybe you would like to do some further tuning or validation of a selected model or plot. Or maybe you wish to collect and compare all lm() models ever generated by your app? Or maybe you would like to have an R code that will recover given R object in future.

So, how to do this?

Czytaj dalej Shiny + archivist = reproducible interactive exploration

All your models belong to us: how to combine package archivist and function trace()

Let’s see how to collect all linear regression models that you will ever create in R.

It’s easy with the trace() function. A really powerful, yet not that popular function, that allows you to inject any R code in any point of a body of any function.
Useful in debugging and have other interesting applications.
Below I will show how to use this function to store a copy of every linear model that is created with lm(). In the same way you may store copies of plots/other models/data frames/anything.

To store a persistent copy of an object one can simply use the save() function. But we are going to use the archivist package instead. It stores objects in a repository and give you some nice features, like searching within repository, sharing the repository with other users, checking session info for a particular object or restoring packages to versions consistent with a selected object.

To use archivist with the trace() function you just need to call two lines. First one will create an empty repo, and the second will execute ‘saveToLocalRepo()’ at the end of each call to the lm() function.

library(archivist)
# create an empty repo
createLocalRepo ("allModels", default = TRUE)
# add tracing code
trace(lm, exit = quote(saveToRepo(z)))

Now, at the end of every lm() function the fitted model will be stored in the repository.
Let’s see this in action.

> lm(Sepal.Length~., data=iris) -> m1
Tracing lm(Sepal.Length ~ ., data = iris) on exit 

> lm(Sepal.Length~ Petal.Length, data=iris) -> m1
Tracing lm(Sepal.Length ~ Petal.Length, data = iris) on exit 

> lm(Sepal.Length~-Species, data=iris) -> m1
Tracing lm(Sepal.Length ~ -Species, data = iris) on exit

All models are stored as rda files in a disk based repository.
You can load them to R with the asearch() function.
Let’s get all lm objects, apply the AIC function to each of them and sort along AIC.

> asearch("class:lm") %>% 
    sapply(., AIC) %>% 
    sort
4c3ae060f3aaa2509b2faf63d857358e 5c5751e36b31b2251d2767d96993320a 
                        79.11602                        160.04042 
ed2f4d257fd568c5c6f231fadc7aa645 
                       372.07953

The aread() function will download the selected model.

> aread("4c3ae060f3aaa2509b2faf63d857358e")

Call:
lm(formula = Sepal.Length ~ ., data = iris)

Coefficients:
      (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
           2.1713             0.4959             0.8292            -0.3152  
Speciesversicolor   Speciesvirginica  
          -0.7236            -1.0235 

Now you can just create model after model and if needed they all can be restored.

Read more about the archivist here: http://pbiecek.github.io/archivist/.

Why should you backup your R objects?

There is a saying that there are two groups of people: those who are already doing backups and those who will. So, how this is linked with reproducible research and R?

If your work is to analyze data then you often face a need to restore/recreate/update results that you have generated some time ago.
You may think ,,I have a knitr reports for everything!”. That’s great! It will save you a lot of troubles. But to have 100% of warranty for exactly same results you need to have exactly the same environment and same versions of packages.

Do you know how many R packages have been updated during last 12 months?

I took list of top 20 R packages from here, scrap dates of their current and older CRAN releases from here and generate a plot with dates of submissions to CRAN sorted along date of last submission.

Czytaj dalej Why should you backup your R objects?