Explore the landscape of R packages for automated data exploration

Do you spend a lot of time on data exploration? If yes, then you will like today’s post about AutoEDA written by Mateusz Staniak.

If you ever dreamt of automating the first, laborious part of data analysis when you get to know the variables, print descriptive statistics, draw a lot of histograms and scatter plots – you weren’t the only one. Turns out that a lot of R developers and users thought of the same thing. There are over a dozen R packages for automated Exploratory Data Analysis and the interest in them is growing quickly. Let’s just look at this plot of number of downloads from the official CRAN repository.

Replicate this plot with

stats <- archivist::aread("mstaniak/autoEDA-resources/autoEDA-paper/52ec")
stat <- stats %>%
  filter(date > "2014-01-01" ) %>%
  arrange(date) %>%
  group_by(package) %>%
  mutate(cums = cumsum(count),
         packages = paste0(package, " (",max(cums),")"))
stat$packages <- reorder(stat$packages, stat$cums, function(x)-max(x))
ggplot(stat, aes(date, cums, color = packages)) +
  geom_step() +
  scale_x_date(name = "", breaks = as.Date(c("2014-01-01", "2015-01-01",
                                           "2016-01-01", "2017-01-01",
                                           "2018-01-01", "2019-01-01")),
               labels = c(2014:2019)) +
  scale_y_continuous(name = "", labels = comma) + 
  DALEX::theme_drwhy() +
  theme(legend.position = "right", legend.direction = "vertical") +
  scale_color_discrete(name="") +
  ggtitle("Total number of downloads", "Based on CRAN statistics")

New tools arrive each year with a variety of functionalities: creating summary tables, initial visualization of a dataset, finding invalid values, univariate exploration (descriptive and visual) and searching for bivariate relationships.

We compiled a list of R packages dedicated to automated EDA, where we describe twelve packages: their capabilities, their strong aspects and possible extensions. You can read our review paper on arxiv: https://arxiv.org/abs/1904.02101.

Spoiler alert: currently, automated means simply fast. The packages that we describe can perform typical data analysis tasks, like drawing bar plot for each categorical feature, creating a table of summary statistics, plotting correlations, with a single command. While this speeds up the work significantly, it can be problematic for high-dimensional data and it does not take the advantage of AI tools for actual automatization. There is a lot of potential for intelligent data exploration (or model exploration) tools.

More extensive list of software (including Python libraries and web applications) and papers is available on Mateusz’s GitHub. Researches can follow our autoEDA project on ResearchGate.

iBreakDown: faster, prettier and more precise explanations for predictive models (with interactions)

LIME and SHAP are two very popular methods for instance level explanations of machine learning models (XAI).
They work nicely for images and text inputs, but share similar weakness in case of tabular data: explanations are additive while complex models are (sometimes) not. iBreakDown addresses this problem.

iBreakDown is a a successor of the breakDown package. Yesterday it has arrived on CRAN. Key new features are:

– It identifies and shows feature interactions (if there are local interactions in the model).
– It is much faster. For additive explanations the complexity is O(p) instead of O(p^2).
– The plotD3 function creates an interactive D3-based break-down plot (thanks to r2d3).
– iBreakDown has a new design, created by Hanna Dyrcz. We will have a talk about it ,,Machine learning meets design. Design meets machine learning.” at satRdays. Try the new theme `theme_drwhy()`!.
– It shows explanation level uncertainty – how good are explanations?

A methodology behind this package is described in the iBreakDown: Uncertainty of Model Explanations for Non-additive Predictive Models.

A nice titanic-powered use-case is described in the titanic vignette.

An example of the D3 interactive explainer is here.

Some intuition is introduced in the Visual Exploration, Explanation and Debugging (working version, still in progress).

iBreakDown is a part of the DrWhy.AI family of explainers consistent with the DALEX.

Let us know if you like it. Feel free to create a pull request with new features, add issue with new idea or star the github repository if you like this package.

To już 5 lat! Gość specjalny na najbliższym SERze opowie o wyjaśnialnym ML

W najbliższy czwartek o godzinie 18:00 startujemy z 38. spotkaniem Entuzjastów R.
Aż trudno uwierzyć, że minęło już 5 lat od naszego pierwszego spotkania w ICMie. Przez te 5 lat gościliśmy ponad 70 prelegentów, często osoby, które znaliśmy z ciekawych blogów, pakietów czy książek. Większość prelegentów pracuje w Warszawie, ale też byli ciekawi goście z innych miast, krajów czy nawet spoza Europy. Społeczność sympatyków na meetupie przekroczyła niedawno 2000 osób. W kontynentalnej Europie większa jest tylko grupa użytkowników R w Madrycie (niewiele większa, więc kto wie).

Na najbliższym – jubileuszowym – spotkaniu będziemy gościć profesora Marko Robnik-Šikonja z uniwersytetu w Ljubljanie. Autor kilkudziesięciu znanych prac naukowych z obszaru uczenia maszynowego, autor pakietu ExplainPrediction dla programu R. Na SERze opowie o technikach permutacyjnego wyjaśniania złożonych modeli, w szczególności EXPLAIN, IME, LIME czy SHAP.
Super gorący temat opowiedziany przez światowej sławy specjalistę.
Czegóż chcieć więcej na urodziny?

Ach oczywiście.
Będzie też tort!

Zapraszamy wszystkich sympatyków R, SERów czy wyjaśnialnego uczenia maszynowego. Spotykamy się w sali 107 na wydziale MiNI PW (Koszykowa 75, Warszawa).
Najlepiej zaznaczyć obecność przez stronę na meetupe, ułatwi nam to planowanie wielkości tortu. Do zobaczenia!

DALEX has a new skin! Learn how it was designed at gdansk2019.satRdays

DALEX is an R package for visual explanation, exploration, diagnostic and debugging of predictive ML models (aka XAI – eXplainable Artificial Intelligence). It has a bunch of visual explainers for different aspects of predictive models. Some of them are useful during model development some for fine tuning, model diagnostic or model explanations.

Recently Hanna Dyrcz designed a new beautiful theme for these explainers. It’s implemented in the `DALEX::theme_drwhy()` function.
Find some teaser plots below. A nice Interpretable Machine Learning story for the Titanic data is presented here.

Hanna is a very talented designer. So I’m super happy that at the next satRdays @ gdansk2019 we will have a joint talk ,,Machine Learning meets Design. Design meets Machine Learning”.

New plots are available in the GitHub version of DALEX 0.2.8 (please star if you like it/use it. This helps to attract new developers). Will get to the CRAN soon (I hope).

Instance level explainers, like Break Down or SHAP

Instance level profiles, like Ceteris Paribus or Partial Dependency

Global explainers, like Variable Importance Plots

See you at satRdays!

shapper is on CRAN, it’s an R wrapper over SHAP explainer for black-box models

Written by: Alicja Gosiewska

In applied machine learning, there are opinions that we need to choose between interpretability and accuracy. However in field of the Interpretable Machine Learning, there are more and more new ideas for explaining black-box models. One of the best known method for local explanations is SHapley Additive exPlanations (SHAP).

The SHAP method is used to calculate influences of variables on the particular observation. This method is based on Shapley values, a technique borrowed from the game theory. SHAP was introduced by Scott M. Lundberg and Su-In Lee in A Unified Approach to Interpreting Model Predictions NIPS paper. Originally it was implemented in the Python library shap.

The R package shapper is a port of the Python library shap. In this post we show the functionalities of shapper. The examples are provided on titanic_train data set for classification.

While shapper is a port for Python library shap, there are also pure R implementations of the SHAP method, e.g. iml or shapleyR.


The shapper wraps up the Python library, therefore installation requires a bit more effort than installation of an ordinary R package.

Install the R package shapper

First of all we need to install shapper, this may be the stable release from CRAN


or the developer version form GitHub.


Install the Python library shap

Before you run shapper, make sure that you have installed Python.

Python library shap is required to use shapper. The shap can be installed both by Python or R. To install it through R, you an use function install_shap() from the shapper package.


If you experience any problems related to the installation of Python libraries or evaluation of Python code, see the reticulate documentation. The shapper access Python within reticulate, therefore the solution to the problem is likely to be in there ;-).

Would you survive sinking of the RMS Titanic?

The example usage is presented on the titanic_train dataset from the R package titanic. We will predict the Survived status. The other variables used by the model are: Pclass, Sex, Age, SibSp, Parch, Fare and Embarked.

titanic <- titanic_train[,c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")]
titanic$Survived <- factor(titanic$Survived)
titanic$Sex <- factor(titanic$Sex)
titanic$Embarked <- factor(titanic$Embarked)
titanic <- na.omit(titanic)
##   Survived Pclass    Sex Age SibSp Parch    Fare Embarked
## 1        0      3   male  22     1     0  7.2500        S
## 2        1      1 female  38     1     0 71.2833        C
## 3        1      3 female  26     0     0  7.9250        S
## 4        1      1 female  35     1     0 53.1000        S
## 5        0      3   male  35     0     0  8.0500        S
## 7        0      1   male  54     0     0 51.8625        S

Let’s build a model

Let’s see what are our chances assessed by the random forest model.

model_rf <- randomForest(Survived ~ . , data = titanic)
## Call:
##  randomForest(formula = Survived ~ ., data = titanic) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
##         OOB estimate of  error rate: 18.63%
## Confusion matrix:
##     0   1 class.error
## 0 384  40  0.09433962
## 1  93 197  0.32068966

Prediction to be explained

Let’s assume that we want to explain the prediction of a particular observation (male, 8 years old, traveling 1-st class embarked at C, without parents and siblings.

new_passanger <- data.frame(
            Pclass = 1,
            Sex = factor("male", levels = c("female", "male")),
            Age = 8,
            SibSp = 0,
            Parch = 0,
            Fare = 72,
            Embarked = factor("C", levels = c("","C","Q","S"))

Model prediction for this observation is .558 for survival.

predict(model_rf, new_passanger, type = "prob")
##       0     1
## 1 0.442 0.558
## attr(,"class")
## [1] "matrix" "votes"

Here shapper starts

To use the function shap() function (alias for individual_variable_effect()) we need four elements

  • a model,
  • a data set,
  • a function that calculated scores (predict function),
  • an instance (or instances) to be explained.

The shap() function can be used directly with these four arguments, but for the simplicity here we are using the DALEX package with preimplemented predict functions.

exp_rf <- explain(model_rf, data = titanic[,-1])

The explainer is an object that wraps up a model and meta-data. Meta data consists of, at least, the data set used to fit model and observations to explain.

And now it’s enough to generate SHAP attributions with explainer for RF model.

ive_rf <- shap(exp_rf, new_observation = new_passanger)
##     Pclass  Sex Age SibSp Parch Fare Embarked _id_ _ylevel_ _yhat_
## 1        1 male   8     0     0   72        C    1           0.558
## 1.1      1 male   8     0     0   72        C    1           0.558
## 1.2      1 male   8     0     0   72        C    1           0.558
## 1.3      1 male   8     0     0   72        C    1           0.558
## 1.4      1 male   8     0     0   72        C    1           0.558
## 1.5      1 male   8     0     0   72        C    1           0.558
##     _yhat_mean_ _vname_ _attribution_ _sign_      _label_
## 1     0.3672941  Pclass   0.070047752      + randomForest
## 1.1   0.3672941     Sex  -0.154519708      - randomForest
## 1.2   0.3672941     Age   0.143046212      + randomForest
## 1.3   0.3672941   SibSp   0.003154522      + randomForest
## 1.4   0.3672941   Parch  -0.018111585      - randomForest
## 1.5   0.3672941    Fare   0.086728705      + randomForest

Plotting results

To generate a plot of Shapley values you can simply pass an object of class importance_variable_effect to a plot() function. Since we are interested in the class Survived = 1 we may add additional parameter that filter only selected classes.


Labels on y-axis show values of variables for this particular observation. Black arrows show predictions of model, in this case, probabilities of each status. Other arrows show effect of each variable on this prediction. Effects may be positive or negative and they sum up to the value of prediction.

On this plot we can see that model predicts that the passenger will survive. Changes are higher due to young age and 1st class, only the Sex = male decreases chances of survival for this observation.

More models

It is useful to contrast prediction of two models. Here we will show how to use shapper for such contrastive explanations.

We will compare randomForest with svm implemented in the e1071.

model_svm <- svm(Survived~. , data = titanic, probability = TRUE)
## Call:
## svm(formula = Survived ~ ., data = titanic, probability = TRUE)
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1 
## Number of Support Vectors:  338

This model predict 32.5% chances of survival.

attr(predict(model_svm, newdata = new_passanger, probability = TRUE), "probabilities")
##           0         1
## 1 0.6748768 0.3251232

Again, we create an explainer that wraps model, data and the predict function.

exp_svm <- explain(model_svm, data = titanic[,-1])
ive_svm <- shap(exp_svm, new_passanger)

Shapley values plot may be modified. To show more than one model you can pass more individual_variable_plot objects.

plot(ive_rf, ive_svm)

To see only attributions use option show_predcited = FALSE.

plot(ive_rf, show_predcited = FALSE)


Documentation and more examples are available at https://modeloriented.github.io/shapper/. The stable version of the package is on CRAN, the development version is on GitHub (https://github.com/ModelOriented/shapper). Shapper is a part of the DALEX universe.

x-mas tRees with gganimate, ggplot, plotly and friends

At the last homework before Christmas I asked my students from DataVisTechniques to create a ,,Christmas style” data visualization in R or Python (based on simulated data).

Libaries like rbokeh, ggiraph, vegalite, shiny+ggplot2 or plotly were popular last year. This year there are also some nice submissions that use gganimate.

Find source codes here. Plots created last year are here.
And here are homeworks from this year.

Trees created with gganimate (and gifski)

Trees created with ggplot2 (and sometimes shiny)

Trees created with plotly

Trees created with python

Trees created with rbokeh

Trees created with vegalite

Trees created with ggiraph

Data, movies and ggplot2

Yet another boring barplot?
I’ve asked my students from MiNI WUT to visualize some data about their favorite movies or series.
Results are pretty awesome.
Believe me or not, but charts in these posters are created with ggplot2 (most of them)!

Star Wars

Fan of StaR WaRs? Find out which color is the most popular for lightsabers!
Yes, these lightsabers are created with ggplot2.
Would you guess which characters are overweighed?
Find the R code and the data on the GitHub.

Harry Pixel

Take fames from Harry Potter movies, use k-means to extract dominant colors for each frame, calculate derivative for color changes and here you are.
The R code and the poster are here.
(steep derivatives in color space is a nice proxy for dynamic scenes).

Social Network for Super Heroes

Have you ever wondered how the distribution of super powers looks like among Avengers?
Check put this poster or dive in the data.

Pardon my French, but…

Scrap transcripts from over 100k movies, find out how many curse words you may find in these movies, plot these statistics.
Here are sources and the poster.
(Bonus Question 1: how curse words are related to Obama/Trump presidency?
Bonus Question 2: is the number of hard curse words increasing or not?)

Rick and Morty

Interested in the demography of characters from Rick and Morty?
Here is the R code and the poster.
(Tricky question: what is happening with season 3?)

Twin Peaks

Transcripts from Twin Peaks are full of references to coffee and donuts.
Except the episode in which the Laura’s murdered is revealed (ups, spoiler alert).
Check out this by yourself with these scripts.

The Lion King

Which Disney’s movie is the most popular?
It wasn’t hard to guess.

Box Office

5D scatterplots?
Here you have.

Next time I will ask my students to visualize data about R packages…
Or maybe you have some other ideas?

Ceteris Paribus v0.3 is on CRAN

Ceteris Paribus package is a part of DALEX family of model explainers. Version 0.3 just gets to CRAN. It’s equipped with new functions for very elastic visual exploration of black box models. Its grammar generalizes Partial Dependency Plots, Individual Conditional Expectations, Wangkardu Plots and gives a lot of flexibility in model comparisons, groups comparisons and so on.

See a 100 sec introduction to the ceterisPackage package on YouTube.

Here you will find a one-pager cheat-sheet with selected use cases.

Here is a longer introduction with notation and some theory.

Here there is a vignette with examples for regression (housing prices).

And here for multiclass classification (HR models).

It’s a work in progress. Feel free to contribute!

No worries! Afterthoughts from UseR 2018

This year the UseR conference took place in Brisbane, Australia. UseR is my favorite conference and this one was mine 11th (counting from Dortmund 2008). 
Every UseR is unique. Every UseR is great. But my feelings are that European UseRs are (on average) more about math, statistics and methodology while US UseRs are more about big data, data science, technology and tools. 

So, how was the one in Australia? Was it more similar to Europe or US?

IMHO – neither of them. 
This one was (for me) about welcoming of new users, being open for diversified community, being open for changes, caring about R culture. Some footmarks of these values were present in most keynotes.

Talking about keynotes. All of them were great, but the ,,Teaching R to New Users” given by Roger Peng was outstanding. I will use the video or the essay as the MUST READ material for students in my R programming classes.

Venue, talks, atmosphere were great as well (thanks to the organizing crew led by Di Cook). Lots of people (including myself) spend time around the hex wall looking for their favorite packages (here you will read more about it ). There was an engaging team exercise during the conference diner (how much your table knows about R). The poster sessions was being handled on TV screens, therefore some posters were interactive (Miles McBain had poster related to R and Virtual Reality, cool). 

Last but not least, there was a great mixture of contributed talks and workshops. Everyone could find something for himself. And even too often it was hard to choose between few tempting options (fortunately, talks are recorded). 
Here I would like to mention three talks I found inspiring.

,,The Minard Paradox” given by Paul Murrel was refreshing. 
One may think nowadays we are so good in data vis, with all these shiny tools and interactive widgets. Yet Paul showed how hard it is to reproduce great works like Minard’s Map even in the cutting edge software (i.e. R). God is in the detail. Watch Paul’s talk here.

,,Data Preprocessing using Recipes” given by Max Kuhn touched an important, jet often neglected truth: Columns in the source data are unnecessary final features. Between ‘read the data’ and ‘fit the model’ there is an important process of feature engineering. This process needs to be reproducible, needs to be based on some well planned grammar. The recipes package helps here. Find the recipes talk here (tutorial is also recorded)

,,Glue strings to data in R” given by James Hester shows a package that is doing only one thing (glue strings) but is doing it extremely well. I have not expected 20 minutes of absorbing talk focused only on gluing strings. Yet, this is my third favourite. Watch it here.

David Smith shared his highlights here. You will find there quite a collection of links.

Videos for recorded talks, keynotes and tutorials are on R consortium youtube.

Local Goodness-of-Fit Plots / Wangkardu Explanations – a new DALEX companion

The next DALEX workshop will take place in 4 days at UseR. In the meantime I am working on a new explainer for a single observation.
Something like a diagnostic plot for a single observation. Something that extends Ceteris Paribus Plots. Something similar to Individual Conditional Expectation (ICE) Plots. An experimental version is implemented in ceterisParibus package.

For a single observation, Ceteris Paribus Plots (What-If plots) show how predictions for a model change along a single variable. But they do not tell if the model is well fitted around this observation.

Here is an idea how to fix this:
(1) Take N points from validation dataset, points that are closest to a selected observation (Gower distance is used by default).
(2) Plot N Ceteris Paribus Plots for these points,
(3) Since we know the true y for these points, then we can plot model residuals in these points.

Here we have an example for a random forest model. The validation dataset has 9000 observations. We use N=18 observations closest to the observation of interest to show the model stability and the local goodness-of-fit.

(click to enlarge)

The empty circle in the middle stands for the observation of interest. We may read its surface component (OX axis, around 85 sq meters), and the model prediction (OY axis, around 3260 EUR).
The thick line stands for Ceteris Paribus Plot for the observation of interest.
Grey points stands for 18 closest observations from the validation dataset while grey lines are their Ceteris Paribus Plots. 
Red and blue lines stand for residuals for these neighbours. Corresponding true values of y are marked with red and blue circles. 

Red and blue intervals are short and symmetric so one may say that the model is well fitted around the observation of interest.
Czytaj dalej Local Goodness-of-Fit Plots / Wangkardu Explanations – a new DALEX companion