Why should you backup your R objects?

There is a saying that there are two groups of people: those who are already doing backups and those who will. So, how this is linked with reproducible research and R?

If your work is to analyze data then you often face a need to restore/recreate/update results that you have generated some time ago.
You may think ,,I have a knitr reports for everything!”. That’s great! It will save you a lot of troubles. But to have 100% of warranty for exactly same results you need to have exactly the same environment and same versions of packages.

Do you know how many R packages have been updated during last 12 months?

I took list of top 20 R packages from here, scrap dates of their current and older CRAN releases from here and generate a plot with dates of submissions to CRAN sorted along date of last submission.

Czytaj dalej Why should you backup your R objects?

geom_christmas_tree(): a new geom for ggplot2 v2.0

Version 2.0 of the ggplot2 package (on GitHub) has a very nice mechanism for adding new geoms and stats (more about it here).
Christmas are coming, so maybe one would like to make his plots more tree’ish?
Below you will find a definition of geom_christmas_tree() geom. It supports following aesthetics: size (number of segments), fill, color, x and y.

With mpg data you can plot a colourful forest.


Czytaj dalej geom_christmas_tree(): a new geom for ggplot2 v2.0

Hack the Proton. A data-crunching game from the Beta and Bit series

I’ve prepared a short console-based data-driven R game named ,,The Proton Game’’. The goal of a player is to infiltrate Slawomir Pietraszko’s account on a Proton server. To do this, you have to solve four data-based puzzles.

The game can be played by beginners as well as heavy users of R. Survey completed by people who completed the beta version of this game shows that the game gives around 15 minutes of fun to people experienced in R and up to around 60 minutes to people that just start programming and using R. More details about the results from beta-version are presented on the plot on the bottom.


Czytaj dalej Hack the Proton. A data-crunching game from the Beta and Bit series

R in Insurance – the November meetup of the Warsaw R User Group

Inspired by the conference held in Amsterdam „R in Insurance”, we would like to dedicate the November meetup of Warsaw R Users Group to Insurance. The presentations will cover the practical aspects of insurance and more specifically the applications of R in insurance.
Join us on Thursday, November 26, 2015, 6:00 PM, Koszykowa 75, Warsaw, Room 329 MINI PW. Meetup will be in English.

18.00-18.05 Welcome
18.05-18.40 „Experience vs. Data” Markus Gesmann (Lloyd’s, London)
18.40-19.00 Pizza break
19.00-19.35 „Non life insurance in R” Emilia Kalarus (Triple A – Risk Finance)
19.35-20.10 „Stochastic mortality modelling” Adam Wróbel (NN)
20.15 – Afterparty

This time our agenda is quite tight, since we have 3 very interesting presentations. We invite R programmers, data analysts as well as actuaries and risk professionals.

Czytaj dalej R in Insurance – the November meetup of the Warsaw R User Group

Warsaw R-Users Group Meeting #12


After summer holidays we are back with two talks:
6pm-6:30 – Adolfo Álvarez PhD
,,5 lessons I have learned at Analyx”.
7pm-7:30 – Piotr Migdał, PhD
,,Jupyter – the environment for learning and doing data analysis’’.

See you tomorrow (22/10/2015) at 6 pm, Department of Mathematics, Warsaw University of Technology, Koszykowa 75 room 329.
You will find more details here (meetup).
You will find more materials here (github).

Warsaw Meetings of R Users / Warszawskie Spotkania Entuzjastów R


With the summer holiday season coming to an end, we are back with Warsaw Meetings of R Users (Warszawskie Spotkania Entuzjastów R).

Three meetings ahead:

  • September 26 th (this Saturday) – let’s start with data-hack-day (DHD). Having data from Polish Seym (votes and transcripts), we are going to prepare some nice summaries of last cadency. Elections ahead, it is a good time for such statistics. MaszPrawoWiedzieć will support us in this effort. Be prepared for a lot of data cleaning and nice data exploration.
  • October 22 nd (Thursday), we will be talking about R and education. Two excellent speakers in the roster: Adolfo Álvarez (Advanced Customer Analyst at Analyx) and dr hab. Michał Ramsza (SGH).
  • November 26 th (Thursday). Topic for this meeting is ,,R in insurance’’. One of our special guest: Markus Gesmann (Lloyd’s, London). More to come.

You will find more information on our meetup page: http://www.meetup.com/Spotkania-Entuzjastow-R-Warsaw-R-Users-Group-Meetup/.

Thanks go to our partners and sponsors: Revolution Analytix/Microsoft, MINI PW, WLOG Solutions and SmarterPoland.

Incredible Adventures of Beta and Bit

I am working on a project that introduces data-driven reasoning (and of course R) to secondary schools conveyed by the fictional adventures of two teenagers, Beta and Bit.

Beta is a level-headed girl who has a passion for maths, logic and the art of deduction.
Bit is a hot-headed hacker and self-educated robotic engineer.

The first story from the series, called Pietraszko’ Cave, is available at this website (in English, Polish and Russian).

In the series, in each story strange adventures introduce Beta and Bit to concepts like: randomness, probability distributions, correlation, linear regression, hypothesis testing or some tools used by data analysts (so called data scientists nowadays).

Czytaj dalej Incredible Adventures of Beta and Bit

Interview with SERATRON – Lego EV3 robot driven by R


Next meeting of Warsaw R Enthusiasts (SER = Spotkania Entuzjastów R) will take place on December 8. We are going to start with Roger Bivand talk about spatial statistics (R Foundation / NHH, author of many R packages). The second talk, by Bartosz Meglicki (IPI PAN), will introduce the SERATRON – fusion of R and Lego Mindstorms.

Below we publish an exclusive SER interview with SERATRON (we are sorry but SERATRON is a naughty one):

SER: Hi, could you introduce yourself?
SERATRON: I am a robot coded in pure R with ev3dev.R bindings. The bindings run on ev3dev linux distribution. The operating system runs on Lego EV3 hardware. We are all under heavy development.

Czytaj dalej Interview with SERATRON – Lego EV3 robot driven by R

Variability of weather forecasts

Screen Shot 2014-11-02 at 20.30.19

Have you wondered how stable are weather forecasts?

curiosity + R = fun,

let’s do a little test.

I’ve used a function getWeatherForecast {SmarterPoland} (github release) to download weather forecasts from the Dark Sky API. Hourly forecasts are downloaded every 10 minutes and stored in this repository based on archivist package (it’s easier to manage larger collection of objects). Having a bundle of forecasts I’ve plot them with the use of ggplot2 package and create an animation with the use of animation package.

And here it is (if you do not see the move above, click here).
Short movie with evolution of temperature forecast for The All Saints’ day in Warsaw.
Red curve – current forecast, grey curves – old forecasts, blue curve – history.

It’s funny that colder forecasts for first evening go with warmer forecasts of next morning.
And changes in weather forecasts are not that smooth as I’ve expected.

Lazy load with archivist

Version 1.1 of the archivist package reached CRAN few days ago.
This package supports operations on disk based repository of R objects. It makes the storing, restoring and searching for an R objects easy (searching with the use of meta information). Want to share your object with article reviewers or collaborators? This package should help.
We’ve created some vignettes to present core use-cases. Here is one of them.

Lazy load with archivist

by Marcin Kosiński

Too big .Rdata file causing problems? Interested in few objects from a huge .Rdata file? Regular load() into Global Environment takes too long or crashes R session? Want to load or copy an object with unknown name? Maintaing environment with thousands of objects became perplexing and troublesome?

If stacked with any of the above applies, this use case is a must read for you.

The archivist package is a great solution that helps administer, archive and restore your artifacts created in R package.

Combining archivist and lazy load may be miraculous

If your .RData file is too big and you do not need or do not want to load the whole of it, you can simply convert the .RData file into a lazy-load database which serializes each entry separately and creates an index. The nice thing is that the loading will be on-demand.

Loading the database then only loads the index but not the contents. The contents are loaded as they are used.

Now you can create your own local archivist-like Repository which will make maintainig artifacts as easy as possible.

Then objects from the Huge.RData file may be archived into Repository created in DIRectory directory. The attribute tags (see Tags) specified as realName is added to every artifact before the saveToRepo() call, in order to remember its name in the Repository.

Now if you are interested in a specific artifact but the only thing you remember about it is its class was data.frame and its name started with letters ir then it is possible to search for that artifact using the searchInLocalRepo() function.

As a result we got md5hashes of artifacts which class was data.frame (hashes1) and md5hashes of artifacts which names (stored in tag named realName) starts with ir. Now we can proceed with an intersection on those two vectors of characters.

After we got one md5hash corresponding to the artfiact we are interested in, we can load it using the loadFromLocalRepo() function.

One can always check the realName tag of that artifact with the getTagsLocal() function.

If only specific artifacts from previously created Repository in DIRectory directory are needed in future, they may be copied to a new Repository in new directory. New, smaller Repository will use less memory and may be easier to send to contributors when working in groups. Here is an example of copying artifacts only from selected classes. Because DIRectory2 directory did not exist, the parameter force=TRUE was needed to force creating empty Repository. Vector hashesR contains md5hashes or artifacts that are related to other artifacts, which mean they are datasets used to compute other artifacts. The special parameter fixed = TRUE specifies to search in tags that start with letters relation.

You can even tar your Repository with tarLocalRepo() function and send it to anybody you want.

You can check the summary of Repository using the summaryLocalRepo() function. As you can see, some of the coxph artifacts have an addtional class. There are also 8 datasets. Those are artifacts needed to compute other artifacts and archived additionaly in the saveToRepo() call with default parameter archiveData = TRUE.

When Repository is no longer necessary we may simply delete it with deleteRepo() function.