There is a saying that there are two groups of people: those who are already doing backups and those who will. So, how this is linked with reproducible research and R?
If your work is to analyze data then you often face a need to restore/recreate/update results that you have generated some time ago.
You may think ,,I have a knitr reports for everything!”. That’s great! It will save you a lot of troubles. But to have 100% of warranty for exactly same results you need to have exactly the same environment and same versions of packages.
Do you know how many R packages have been updated during last 12 months?
I took list of top 20 R packages from here, scrap dates of their current and older CRAN releases from here and generate a plot with dates of submissions to CRAN sorted along date of last submission.
Load this plot directly to R: archivist::aread('pbiecek/archivist/scripts/packDev/039745c40ab717f4459c5144343baca1')
How many of current versions of selected packages were on CRAN 12 months ago?
The ecdf for dates of current releases.
Load this plot directly to R: archivist::aread('pbiecek/archivist/scripts/packDev/923ec99f79cce099408d4973471dd30d)
Around 50% of these packages were updated in last 12 months. And sometimes these changes have a huge impact, like version 2.0 of ggplot2.
In order to recreate the exactly same results you either need to keep copy of important (all?) packages or keep copy of obtained results.
With current version of archivist (2.0) you can easily (just with one line) archive all created objects and embed hooks to these objects into your report. It’s enough to use addHooksToPrint() function at the beginning of your knitr script.
How it’s better than simple ‘save()’ function? Lot’s of additional features, like you can ask for session info for a given artifact
R> archivist::asession("pbiecek/archivist/scripts/packDev/923ec99f79cce099408d4973471dd30d")
$packages
package * version date
1 archivist * 2.0 2016-02-12
2 assertthat 0.1 2013-12-06
3 bitops 1.0-6 2013-08-17
4 colorspace 1.2-6 2015-03-11
5 DBI 0.3.1 2014-09-24
6 devtools 1.9.1 2015-09-11
7 digest 0.6.9 2016-01-08
8 dplyr * 0.4.3 2015-09-01
9 DT * 0.1 2015-06-09
For more examples see this knitr+archivist report
https://rawgit.com/pbiecek/archivist/master/scripts/listOfPackages.html
to reproduce or retrieve all results presented here.
The another solution is to keep the copies of package versions used by a project thanks to packrat package (snapshot(), restore()).
Thanks! These are great tools and in 99% of cases they will be sufficient for reproducibility (the 1% is related to system and system libraries, some packages are working on windows not of osx or vice verse, maybe input data are very very large or you need some specialized hardware e.g .GPU). And it is much better to have R source code because it is easier to understand how objects were created.
But they do not allow for easy access to objects that are already calculated and presented in the report. So it’s slightly different use case. Here you can immediately reach for objects that are presented, and you have API to search in the repository for objects that you are looking for. And you may check versions of packages that were attached at the moment of backuping.
I’d like to add checkpoint.
packrat is extremely slow if the project is on a network share and at least in my case seems to commit suicide after a few weeks (i.e. I have to rebuild the package cache every single time I open the project).
checkpoint has fewer features but also less stuff that can go wrong.
Recently I was reading about doing data science in containers: http://jamiehall.cc/post/how-to-get-started-with-data-science-in-containers . I guess it’s more useful when working not with R but with Python 😉