archivist: Boost the reproducibility of your research

A few days ago Journal of Statistical Software has published our article (in collaboration with Marcin Kosiński) archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.

Why should you care? Let’s see.


Would you want to retrieve a ggplot2 object with the plot on the right?
Just call the following line in your R console.


Want to check versions of packages loaded when the plot was created?
Just call


Wishful Thinking?

When people talk about reproducibility, usually they focus on tools like packrat, MRAN, docker or RSuite. These are great tools, that help to manage the execution environment in which analyses are performed. The common belief is that if one is able to replicate the execution environment then the same R source code will produce same results.

And it’s true in most cases, maybe even more than 99% of cases. Except that there are things in the environment that are hard to control or easy to miss. Things like external system libraries or dedicated hardware or user input. No matter what you will copy, you will never know if it was enough to recreate exactly same results in the future. So you can hope that results will be replicated, but do not bet too high.
Even if some result will pop up eventually, how can you check if it’s the same result as previously?

Literate programming is not enough

There are other great tools like knitr, Sweave, Jupiter or others. The advantage of them is that results are rendered as tables or plots in your report. This gives you chance to verify if results obtained now and some time ago are identical.
But what about more complicated results like a random forest with 100k trees created with 100k variables or some deep neural network. It will be hard to verify by eye that results are identical.

So, what can I do?

The safest solution would be to store copies of every object, ever created during the data analysis. All forks, wrong paths, everything. Along with detailed information which functions with what parameters were used to generate each result. Something like the ultimate TimeMachine or GitHub for R objects.

With such detailed information, every analysis would be auditable and replicable.
Right now the full tracking of all created objects is not possible without deep changes in the R interpreter.
The archivist is the light-weight version of such solution.

What can you do with archivist?

Use the saveToRepo() function to store selected R objects in the archivist repository.
Use the addHooksToPrint() function to automatically keep copies of every plot or model or data created with knitr report.
Use the aread() function to share your results with others or with future you. It’s the easiest way to access objects created by a remote shiny application.
Use the asearch() function to browse objects that fit specified search criteria, like class, date of creation, used variables etc.
Use asession() to access session info with detailed information about versions of packages available during the object creation.
Use ahistory() to trace how given object was created.

Lots of function, do you have a cheatsheet?

Yes! It’s here.
If it’s not enough, find more details in the JSS article.

Explain! Explain! Explain!

Predictive modeling is fun. With random forest, xgboost, lightgbm and other elastic models…
Problems start when someone is asking how predictions are calculated.
Well, some black boxes are hard to explain.
And this is why we need good explainers.

In the June Aleksandra Paluszynska defended her master thesis Structure mining and knowledge extraction from random forest. Find the corresponding package randomForestExplainer and its vignette here.

In the September David Foster published a very interesting package xgboostExplainer. Try it to extract useful information from a xgboost model and create waterfall plots that explain variable contributions in predictions. Read more about this package here.

In the October Albert Cheng published lightgbmExplainer. Package with waterfall plots implemented for lightGBM models. Its usage is very similar to the xgboostExplainer package.

Waterfall plots that explain single predictions are great. They are useful also for linear models. So if you are working with lm() or glm() try the brand new breakDown package (hmm, maybe it should be named glmExplainer). It creates graphical explanations for predictions and has such a nice cheatsheet:


Install the package from

Thanks to RStudio for the cheatsheet’s template.

intsvy: PISA for research and PISA for teaching

The Programme for International Student Assessment (PISA) is a worldwide study of 15-year-old school pupils’ scholastic performance in mathematics, science, and reading. Every three years more than 500 000 pupils from 60+ countries are surveyed along with their parents and school representatives. The study yields in more than 1000 variables concerning performance, attitude and context of the pupils that can be cross-analyzed. A lot of data.

OECD prepared manuals and tools for SAS and SPSS that show how to use and analyze this data. What about R? Just a few days ago Journal of Statistical Software published an article ,,intsvy: An R Package for Analyzing International Large-Scale Assessment Data”. It describes the intsvy package and gives instructions on how to download, analyze and visualize data from various international assessments with R. The package was developed by Daniel Caro and me. Daniel prepared various video tutorials on how to use this package; you may find them here:

PISA is intended not only for researchers. It is a great data set also for teachers who may employ it as an infinite source of ideas for projects for students. In this post I am going to describe one such project that I have implemented in my classes in R programming.

I usually plan two or three projects every semester. The objective of my projects is to show what is possible with R. They are not set to verify knowledge nor practice a particular technique for data analysis. This year the first project for R programming class was designed to experience that ,,With R you can create an automated report that summaries various subsets of data in one-page summaries”.
PISA is a great data source for this. Students were asked to write a markdown file that generates a report in the form of one-page summary for every country. To do this well you need to master loops, knitr, dplyr and friends (we are rather focused on tidyverse). Students had a lot of freedom in trying out different things and approaches and finding out what works and how.

This project has finished just a week ago and the results are amazing.
Here you will find a beamer presentation with one-page summary, smart table of contents on every page, and archivist links that allow you to extract each ggplot2 plots and data directly from the report (click to access full report or the R code).


Here you will find one-pagers related to the link between taking extra math and students’ performance for boys and girls separately (click to access full report or the R code).


And here is a presentation with lots of radar plots (click to access full report or the R code).


Find all projects here:

And if you are willing to use PISA data for your students or if you need any help, just let me know.

DIY – cheat sheets

I found recently, that in addition to a great list of cheatsheets designed by RStudio, one can also download a template for new cheatsheets from RStudio Cheat Sheets webpage.
With this template you can design your own cheatsheet, and submit it to the collection of Contributed Cheatsheets (Garrett Grolemund will help to improve the submission if needed).

Working on a new cheatsheet is pretty enjoying. You need to squeeze selected important stuff into a quite small surface like one or two pages.
Lots of fun.
I did it for eurostat and survminer packages (cheatsheets below).
And would love to see one for caret.

How to weigh a dog with a ruler? (looking for translators)

We are working on a series of comic books that introduce statistical thinking and could be used as activity booklets in primary schools. Stories are built around adventures of siblings: Beta (skilled mathematician) and Bit (data hacker).

What is the connection between these comic books and R? All plots are created with ggplot2.

The first story (How to weigh a dog with a ruler?) is translated to English, Polish and Czech. If you would like to help us to translate this story to your native language, just write to me (przemyslaw.biecek at gmail) or create an issue on GitHub. It’s just 8 pages long, translations are available on Creative Commons BY-ND licence.

Click images below to get the comic book:
In English

In Polish

In Czech

The main point of the first story is to find the relation between Height and Weight of different animals and then assess the weight of dinosaur T-Rex based only on the length of its skeleton. A method called Regression by Eye.


PISA 2015 – how to read/process/plot the data with R

Yesterday OECD has published results and data from PISA 2015 study (Programme for International Student Assessment). It’s a very cool study – over 500 000 pupils (15-years old) are examined every 3 years. Raw data is publicly available and one can easily access detailed information about pupil’s academic performance and detailed data from surveys for studetns, parents and school officials (~2 000 variables). Lots of stories to be found.

You can download the dataset in the SPSS format from this webpage. Then use the foreign package to read sav files and intsvy package to calculate aggregates/averages/tables/regression models (for 2015 data you shall use the GitHub version of the package).

Below you will find a short example, how to read the data, calculate weighted averages for genders/countries and plot these results with ggplot2. Here you will find other use cases for the intsvy package.


ggmail + forecast = how many emails I will get tomorrow?

During the eRum 2016, Adam Zagdański gave a very good tutorial about time series modeling. Among other things I’ve learned that the forecast package (created by Rob Hyndman) got cool new plots based on the ggplot2 package.

Let’s use it to play with mailbox statistics for my gmail account!

1. Get the data

Follow this link to download the data from your gmail account as a single mbox file.
It may be large (15GB in my case), but for further steps it’s enough to keep only headers. grep + cat will do the job.

Czytaj dalej ggmail + forecast = how many emails I will get tomorrow?

Program of the european R users meeting [only 7 days to go]

The european R users meeting [eRum] is going to start in just 7 days.

We expect over 250 participants, 10 invited talks, 47 regular talks, 13 lightning talks and 12 posters. In order to handle that much content we scheduled 18 sessions [+ workshops].

Find the program of the conference here or here. In the second sheet you will find a detailed list of talks and sessions.

As you see the conference is full of very interesting stuff. So, get prepared and see you in Poznań!

MinechaRts #1 (Minecraft + R + Edgar Anderson’s Iris Data)

How to use R to draw 3D scatterplots in Minecraft? Let’s see.

Minecraft is a game about placing blocks and going on adventures (source). Blocks are usually placed by players but there are add-ons that allow to add/modify/remove blocks through external API.
And this feature is being used in educational materials that show how to use Minecraft to learn Python (or how to use Python to modify Minecraft worlds, see this book for example). You need to master loops to build a pyramid or a cube. And you need to touch some math to build an igloo or a fractal. Find a lot of cool examples by googling ‚minecraft python programming’.

So, Python+Minecraft are great, but how to do similar things in R?
You need to do just three things:

  1. Install the Spigot Minecraft Server along with all required dependencies. The detailed instruction how to do this is here.
  2. Create a socket connection to the Minecraft Server port 4711. In R it’s just a single line

  3. Send building instructions through this connection. For example

    will create a cube 11x11x11 made of TNT blocks (id=46 is for TNT, see the full list here) placed between coordinates (0,70,0) (10,80,10). You can add and remove blocks, move players, spawn entities and so on. See a short overview of the server API.

The R code below creates a connection to the minecraft server, builds a flat grassland around the spawning point and plots 3d scatterplot with 150 blocks (surprise surprise, blocks coordinates correspond to Sepal.Length, Sepal.Width, Petal.Length variables from the iris dataset).

If you do not like scatterplots try barcharts 😉

A link that can tell more than dozens of lines of R code – what’s new in archivist?

Can you spot the difference between this plot:

And this one:

You are right! The latter has an embedded piece of R code.
What for?

It’s a call to a function aread from archivist – a package that manages external copies of R objects. This piece of code was added by the function addHooksToPrint(), that enriches knitr reports in links to all objects of a given class, e.g. ggplot.

You can copy this hook to your R session and you will automagically recreate this plot in your local session.

But it’s not all.
Actually here the story is just beginning.

Don’t you think, that this plot is badly annotated? It is not clear what is being presented. Something about terrorism, but for which year, are these results for all countries or there is some filtering? What is on axes? Why the author skip all these important information? Why he does not include the full R code that explains how this plot was created?

Actually, having this single link you can get answers for all these questions.

First, let’s download the plot and extract the data out of it.

This data object is also in the repository so I can download it with the aread function.

But here is the coolest part.
Having an object one can (in some cases) examine the history of this objects, i.e. check how it was created. Here is how to do this:

Now you can see what operations have been used to create data used in this plot. It’s clear how the aggregation has been done, what is the filtering condition and so on.
Also you have hashes to all objects created along the way, co you can download the partial results. This history is being recorded with an operator %a% that is working in a similar fashion to %>%.

We have the plot, now we know what is being presented, let’s change some annotations.

The ahistory() function for remote repositories was introduced to archivist in version 2.1 (on CRAN since yesterday). Other new feature is the support for repositories in shiny applications. Now you can enrich your app in links to copies of R objects generated by shiny.
You can find more information about these and other features in the useR2016 presentation about archivist (html, video).
Or look for Marcin Kosiński talk during the european R users meeting in Poznań.

The data presented in here is just a small fraction of data from National Consortium for the Study of Terrorism and Responses to Terrorism (START) (2016) retrieved from