## DALEX Stories – Warsaw apartments

This Monday we had a machine learning oriented meeting of Warsaw R Users. Prof Bernd Bischl from LMU gave an excellent overview of mlr package (machine learning in R), then I introduced DALEX (Descriptive mAchine Learning EXplanations) and Mateusz Staniak introduced live and breakDown packages.

The meeting pushed me to write down a gentle introduction to the DALEX package. You may find it here https://pbiecek.github.io/DALEX_docs/.
The introduction is written around a story based on Warsaw apartments data.

The story goes like that:
We have two models fitted to apartments data: a linear model and a randomForest model. It happens that both models have exactly identical root mean square for errors calculated on a validation dataset.
So an interesting question arise: which model we should choose?

The analysis of variable importance for both models reveals that variable construction.year is important for randomForest but is completely neglected by linear model.
New observation: something is going on with construction.year.

The analysis of model responses reveals that the relation between construction.year and price of square meter is nonlinear.
At this point it looks like the random forest model is the better one, since it captures relation, which the linear model do not see.

But (there is always but) when you audit residuals from the random forest model it turns out that it heavily under predicts prices of expensive apartments.
This is a very bad property for a pricing model. This may result in missed opportunities for larger profits.
New observation: do not use this rf model for predictions.

So, what to do?
DALEX shows that despite equal root mean squares of both models they are very different and capture different parts of the signal.
As we increase our understanding of the signal we are able to design a better model. And here we go.
This new liner model has much lower root mean square of residuals, as it is build on strengthens of both initial models.

All plots were created with DALEX explainers. Find R code and more details here.

## DALEX: which variables are really important? Ask your black box model!

Third post from the short series about black-box explainers implemented in the DALEX package. Learn more about DALEX at SER (Warsaw, April 2018), eRum (Budapest, May 2018), WhyR (Wroclaw, June 2018) or UseR (Brisbane, July 2018).

Two weeks ago I wrote about single variable conditional responses and last week I wrote about decompositions of a single prediction.

Sometimes we would like to know the general structure of a model, or at least know which variables are the most influential. There is a lot of different approaches to this problem proposed in literature. A nice, simple, and model agnostic approach is described in this article (Fisher, Rudin, and Dominici 2018). To see how important is variable X let’s permute it’s values and measure the drop in model accuracy.
This procedure is implemented in the DALEX package in the variable_dropout() function. There are some tweaks (for large datasets you do not need to permute all rows while for small datasets you could consider some oversampling) but the idea is the same.

In the figure below you will find variable importances for three models created with the HR dataset. It is easy to spot that the randomForest model results in the best model and satisfaction_level is the most important variable in all three models.

There are two things that I like in this explainer.

1) Variable effects for a single model are interesting, but ability to compare effects for many modes is even more interesting. In the DALEX you can simply contrast/compare explainers across different models.

2) There is no reason to start variable importance plots in the point 0, since the initial model performance is different for different plots. It is much more informative to present both the initial model performance and drop in the performance resulting from the dropout of a variable.

If you want to learn more about DALEX package and variable importances consult following vignette or the DALEX website.

## DALEX: how would you explain this prediction?

Last week I wrote about single variable explainers implemented in the DALEX package. They are useful to plot relation between a model output and a single variable.

But sometimes we are more focused on a single model prediction. If our model predicts possible drug response for a patient, we really need to know which factors drive the model prediction for a particular patient. For linear models it is relatively easy as the structure of the model is additive. In 2017 we have developed breakDown package for lm/glm models.

But how to explain/decompose/approximate predictions for any black box model?
There are several approaches. The (probably) most known is LIME with great examples for image and text data. There is an R port lime developed by Thomas Pedersen. In collaboration with Mateusz Staniak we developed live package, similar approach, easy to use with models created by mlr package.
The other technique that can be used here are Shapley values which use attribution theory/game theory to attribute effects of different variables for a single prediction.

Recently we have developed a yet another approach (paper under preparation, implemented in the breakDown version 0.4) that works in a model agnostic way (here you can check how to use it for caret models). You can play with it via the single_prediction() function in the DALEX package.
Such decomposition is useful for many reasons mentioned in papers listed above (deeper understanding, validation, trust, etc).
And, what is really extra about the DALEX package, you can compare contributions of different models on the same scale.

Let’s train three models (glm / gradient boosting model and random forest model) to predict quality of wine. These models are very different in structure. In the figure below, for a single wine, we compare predictions calculated by these models. For this single wine, for all models the most influential variable is the alcohol concentration as the wine has much higher concentration than average. Then pH and sulphates take second and third positions in all three models. Looks like models have some agreement even if they structure is very different.

Find our DALEX workshops at SER (Warsaw, April 2018), ERUM (Budapest, May 2018), WhyR (Wroclaw, June 2018) or UseR (Brisbane, July 2018).

## DALEX: understand a black box model – conditional responses for a single variable

Black-box models, like random forest model or gradient boosting model, are commonly used in predictive modelling due to their elasticity and high accuracy. The problem is, that it is hard to understand how a single variable affects model predictions.

As a remedy one can use excellent tools like pdp package (Brandon Greenwell, pdp: An R Package for Constructing Partial Dependence Plots, The R Journal 9(2017)) or ALEPlot package (Apley, Dan. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models (2016)).
OR
Now one can use the DALEX package to not only plot a conditional model response but also superimpose responses from different models to better understand differences between models.

Consult the following vignette to learn more about the DALEX package and explainers for a single variable.

OR