Variability of weather forecasts

Screen Shot 2014-11-02 at 20.30.19

Have you wondered how stable are weather forecasts?

curiosity + R = fun,

let’s do a little test.

I’ve used a function getWeatherForecast {SmarterPoland} (github release) to download weather forecasts from the Dark Sky API. Hourly forecasts are downloaded every 10 minutes and stored in this repository based on archivist package (it’s easier to manage larger collection of objects). Having a bundle of forecasts I’ve plot them with the use of ggplot2 package and create an animation with the use of animation package.

And here it is (if you do not see the move above, click here).
Short movie with evolution of temperature forecast for The All Saints’ day in Warsaw.
Red curve – current forecast, grey curves – old forecasts, blue curve – history.

It’s funny that colder forecasts for first evening go with warmer forecasts of next morning.
And changes in weather forecasts are not that smooth as I’ve expected.

Lazy load with archivist

Version 1.1 of the archivist package reached CRAN few days ago.
This package supports operations on disk based repository of R objects. It makes the storing, restoring and searching for an R objects easy (searching with the use of meta information). Want to share your object with article reviewers or collaborators? This package should help.
We’ve created some vignettes to present core use-cases. Here is one of them.

Lazy load with archivist

by Marcin Kosiński

Too big .Rdata file causing problems? Interested in few objects from a huge .Rdata file? Regular load() into Global Environment takes too long or crashes R session? Want to load or copy an object with unknown name? Maintaing environment with thousands of objects became perplexing and troublesome?

If stacked with any of the above applies, this use case is a must read for you.

The archivist package is a great solution that helps administer, archive and restore your artifacts created in R package.

Combining archivist and lazy load may be miraculous

If your .RData file is too big and you do not need or do not want to load the whole of it, you can simply convert the .RData file into a lazy-load database which serializes each entry separately and creates an index. The nice thing is that the loading will be on-demand.

Loading the database then only loads the index but not the contents. The contents are loaded as they are used.

Now you can create your own local archivist-like Repository which will make maintainig artifacts as easy as possible.

Then objects from the Huge.RData file may be archived into Repository created in DIRectory directory. The attribute tags (see Tags) specified as realName is added to every artifact before the saveToRepo() call, in order to remember its name in the Repository.

Now if you are interested in a specific artifact but the only thing you remember about it is its class was data.frame and its name started with letters ir then it is possible to search for that artifact using the searchInLocalRepo() function.

As a result we got md5hashes of artifacts which class was data.frame (hashes1) and md5hashes of artifacts which names (stored in tag named realName) starts with ir. Now we can proceed with an intersection on those two vectors of characters.

After we got one md5hash corresponding to the artfiact we are interested in, we can load it using the loadFromLocalRepo() function.

One can always check the realName tag of that artifact with the getTagsLocal() function.

If only specific artifacts from previously created Repository in DIRectory directory are needed in future, they may be copied to a new Repository in new directory. New, smaller Repository will use less memory and may be easier to send to contributors when working in groups. Here is an example of copying artifacts only from selected classes. Because DIRectory2 directory did not exist, the parameter force=TRUE was needed to force creating empty Repository. Vector hashesR contains md5hashes or artifacts that are related to other artifacts, which mean they are datasets used to compute other artifacts. The special parameter fixed = TRUE specifies to search in tags that start with letters relation.

You can even tar your Repository with tarLocalRepo() function and send it to anybody you want.

You can check the summary of Repository using the summaryLocalRepo() function. As you can see, some of the coxph artifacts have an addtional class. There are also 8 datasets. Those are artifacts needed to compute other artifacts and archived additionaly in the saveToRepo() call with default parameter archiveData = TRUE.

When Repository is no longer necessary we may simply delete it with deleteRepo() function.

Parameterized SQL queries

Mateusz Żółtak asked me to spread the word about his new R package for parameterized SQL queries. Below you can find the copy of package vignette. If you work with SQL in R you may find it useful.

The package RODBCext is an extension of the RODBC database connectivity package. It provides support for parameterized queries. This document describes what parameterized queries are and when they should be used. In addition some examples of ROBDCext usage are shown.

It is assumed that you already know the RODBC package and the basics of SQL and ODBC. If not, please read about them first, e.g. see the ODBC Connectivity vignette of the RODBC package.

1 What are parameterized queries for?

Parameterized queries (also known as prepared statements) are a technique of query execution which separates a query string from query parameters values. There are two main reasons for using this technique:

  • avoiding SQL injections,
  • speeding up query execution in some scenarios.

Both are discussed below.

2 SQL injections

SQL injection is an attack against your code which uses SQL queries. Malicious query parameter values are passed in order to modify and execute a query. If you use SQL data sources, it is highly likely that sooner or later your R code will experience a problem similar to an SQL injection (or an SQL injection itself). Consider the following:

  • Even data from trusted data sources (even SQL ones) can cause problems in SQL queries if use improper programming techniques.
  • Are you sure that your data came from a really trusted source?
  • All Shiny applications which process data from SQL data sources can be a target of an SQL injection attack.

2.1 Example – an apostrophe sign in data

Let us begin with a simple example illustrating how your own data can lead to problems similar to a SQL injections attack.

Imagine a simple SQL database with a table called cakes:

cake price
Chocolate cake 10
Strawberry cupcake 3
Kevin’s Cherry Tart 12.3

We receive a CSV data file containing the same database but with new prices. You are asked to update the database. So you write your R code as below:

Such a code will fail on a Kevin’s Cherry Tart because this name contains an apostrophe. The resulting query would be UPDATE cakes SET price = 12.3 WHERE cake = 'Kevin's Cherry Tart'; which is not a proper SQL query. To deal with the Kevin’s Cherry Tart we need to escape the apostrophe in the cake’s name so that the database knows that it doesn’t denote the end of the string value.

2.2 Example – simple SQL injection

There is a nice XKCD about that – see here. Let’s translate it into an example in R.

We have got a database of students with a table students

last_name first_name
Smith John
Lee Andrew
Wilson Linda

A new school year has begun and new students have come. We have just received a CSV file with the same structure as the table students and we are asked to add it to the database. So we prepare a simple script:

Unfortunately one of our new students’ name is:

last_name first_name
Smith Robert’); DROP TABLE students; –

For this student our query would be:

These are in fact two SQL queries and one SQL comment:

  • INSERT INTO students (last_name, first_name) VALUES ('Smith', 'Robert');
  • DROP TABLE students;
  • --')

Execution of such a query can lead to a serious data loss (hopefully we have made a backup copy or do not have sufficient rights to drop the students table). To avoid such problems we should properly escape parameters values in our SQL queries.

2.3 How to escape values in SQL queries?

At this point we already know that we should properly escape parameters values in our SQL queries. There are many techniques of doing that:

  • Manually checking the data types.
  • Using parameterized queries.
  • Using high-level functions which escape values for us.

2.3.1 Manually checking data types

You can escape your data manually, e.g.

  • cast numeric columns to numbers using as.numeric(column) or sprintf(“%d %f”, integerColumn, realColumn),
  • cast dates using as.character(as.Date(column)),
  • escape strings using gsub(“‘“,”’’”, column),
  • etc.

This is possible but is also very error prone, especially when escaping string values. Everyone knows that apostrophes have to be escaped, but:

  • Different database systems may use different escape sequences (e.g. C-style with a backslash or repeat-style a with double apostrophe).
  • our database system may handle HTML/XML entities or inserting characters by a Unicode value (or many, many other strange ways of data input), so e.g. my’value or my\U0027value will be converted into my’value and then lead to errors in your query.

It is almost impossible to remember all caveats by yourself, so it is strongly advised not to use this method.

2.3.2 Using parameterized queries

Another solution is to separate the query string from its parameters (data). In such case a query execution is divided into two steps:

  • query parsing and planing,
  • passing parameter values to query and query execution.

As query parameters are passed separately, parameter values cannot modify (and break) the query string. To indicate places in the query where parameters will be placed, a special character is used, typically a question mark.

Let us rewrite our cakes example using the sqlExecute(connHandle, queryString, data) function from the RODBCext package:

We replaced the parameter values in query with a question mark and passed query and data as separate function parameters. We made our code not only SQL injection resistant, but also easier to read.

Moreover, the function function sqlExecute() supports vectorized data, so we can make it even simpler:

2.3.3 Using high-level functions which deal with escaping values for us

This would be the most straightforward solution.

An excellent example is dplyr, which provides a complete R to SQL mapper and allows us to completely forget about the SQL. Another example are the sqlSave(), sqlUpdate(), sqlCopy() and sqlCopyTable() functions from the RODBC package which deal with escaping values for us.

The problem is that: * Dplyr escapes values rather naively. With respect to strings only simple ‘to’’ escaping is performed which is enough to prevent silly errors but will fail against more advanced SQL injections. * RODBC’s high-level functions escape values in a safe way (by internally using parameterized queries), but have very limited functionality. Interestingly, judging from the comments in the source code, the parameterized queries have been introduced to them not to make them safe but to improve speed.

2.4 Summary

When using SQL we must pay attention to escape query parameter values properly. The existing R database connectivity packages do not provide a completely reliable way of doing that. A set of SQL injections safe functions provides very limited functionality and more flexible functions are using naive escaping methods. That is why RODBCext is a preferred way to make your R code SQL injections safe.

I hope dplyr developers will switch to use parameterized queries internally at some point. This would provide R community with a brilliant and safe R to SQL mapper and to forget about a manual preparation of SQL queries.

3 Speeding up query execution using parameterized queries

SQL query execution is being performed in a few steps. The first two steps are

  • Parsing the query string into internal database query data structures.
  • Planning the query, e.g. deciding the order of joining the tables, indexes which should be used to execute a query, etc.

If we repeat the same query many times and only values of query parameters are changing, it will be faster to perform these steps only once and then reuse the already parsed and planed query. This can be achieved by using parameterized queries.

3.1 Example – big insert

A typical scenario is an insert of many rows to a table:

3.2 Example – speeding up a SELECT query

Also repeated execution of a SELECT query can benefit from using parameterized variant:

The longer query string, the more complicated query planning and the more query repetitions, the bigger amount of time can be saved.

4 Parameterized SQL queries in R

Unfortunately all known to me R packages providing support for SQL databases lacks support for parameterized queries. Even the R DBI interface doesn’t define any methods which would allow to implement parameterized queries. The main reason for that is probably that R packages developers used to see SQL databases as just another storage backend for data frames rather than powerful data processing engines (which modern SQL databases already are).

4.1 RODBCext

RODBCext package tries to fill this gap by introducing parameterized queries support on the top of the RODBC package. RODBCext provides only two functions, both of them using database connection handlers from RODBC:

  • sqlPrepare(connHandle, SQLquery, errors = TRUE)
  • sqlExecute(connHandle, SQLquery, data, fetch = FALSE, errors = TRUE, ...)

4.1.1 sqlExecute()

Allows execution of SQL queries separated from query parameters values, e.g.:

The nice thing is that sqlExecute() (in opposite to sqlQuery()) supports vectorization. In the example below data will contain results of all five queries bound by rows.

Results can be also fetched separately using RODBC’s sqlGetResults(). This also provides a way to fetch results in parts:

As sqlExecute() uses internally sqlGetResults() to fetch results of the query, it also accept all parameters of the sqlGetResults():

4.1.2 sqlPrepare()

Parses a query string and plans a query. Query can be executed later using sqlExecute() with a parameter query set too NULL. This can provide some performance gain when executing the same query multiple times (see the chapter Speeding up query execution using parameterized queries). Usage example:

Gender gap and visualisation challenge @ useR!2014

7 days to go for submissions in the DataVis contest at useR!2014 (see contest webpage).
Note that the contest is open for all R users, not only conference participants.
Submit your solution soon!

PISA dataset allows to challenge some ,,common opinions”, like are boys or girls better in math / reading. But, how to compare differences between genders? Averages are not fancy enough 😉 Let’s use weighted quantiles calculated with function wtd.quantile() from the Hmisc package.

Below we present quantile-quantile plots for results in math and reading. Each point represents a single centile. X coordinate stands for centiles of male scores while Y coordinate correspond to female scores respectively. Black line is the diagonal, points on this line correspond to equal centiles in both distributions.
Points below black line correspond to centile for which boys are doing better than girls. Points above black line correspond to centiles for which girls are doing better. For reading the story is simple, girls are doing better for all centiles, whole distribution is shifted. But for math the story is different for different countries.

[UK – distribution of math scores for boys is shifted in comparison to females]

[Finland – distribution of math scores for boys is wider, weak boys are weaker than girls but strong one are stronger than corresponding girl’s centile]

[USA – somewhere between Finland and UK]

What is the story for your country?

And the R code.

Proficiency levels @ PISA and visualisation challenge @ useR!2014

16 days to go for submissions in the DataVis contest at useR!2014 (see contest webpage).
The contest is focused on PISA data and students’ skills. The main variables that reflect pupil skills in math / reading / science are plausible values e.g. columns PV1MATH, PV1READ, PV1SCIE in the dataset.
But, these values are normalized to have mean 500 and sd 100. And it is not that easy to understand what the skill level 600 means and is 12 points in average a big difference. To overcome this PISA has introduced seven proficiency levels (from 0 to 6, see that base on plausible values with cutoffs 358, 420, 482, 545, 607, 669.
It is assumed that, for example, at level 6 ,,students can conceptualize, generalize, and utilize information based on their investigations and modeling of complex problem situations, and can use their knowledge in relatively non-standard contexts”.

So, instead of looking at means we can now take a look at fractions of students at given proficiency level. To have some fun we use sp and rworldmap and RColorBrewer packages to have country shapes instead of bars and dots that are supposed to represent pupils that take part in the study. The down side is that area does not correspond to height so it might be confusing. We add horizontal lines to expose the height.

And here is the R code

Are they happy at school? PISA data and visualisation challange @ useR!2014

I hope you have already heard about DataVis contest at useR!2014 (in case you did not, here is more information:
As a jury member I am not going to submit any work, but I am going to encourage you to play with the data, crunch a finding and submit interesting plot.

How easy is to download the data and create some graph?

Let’s take a look at students’ answers for questions: ‚Do you feel happy at school?’ and ‚How familiar are you with Declarative Fractions?’.

Are boys more familiar with ‚Declarative Fractions’ than girls?
(funny fact: ‚Declarative Fraction’ is a fake concept that is used in PISA to measure overconfidence)
(funny fact 2: plot familiarity with ‚Declarative Fractions’ across countries, results are very interesting)

In which countries the fraction of kids that declare that they are happy at school is the highest?

There is a lot of ‚kind of Likert’ scale questions in the PISA study, for sure there is a lot of nice stories as well.

Data visualisation challange at useR 2014 conference

Familiar with R and DataVis? Take a try in PISA data visualisation contest for useR2014 participants.
There are two contest tracks. In each you can win 700$.

More information about the contest can be found here:

The prizes are funded by The Organisation for Economic Co-operation and Development (OECD) and tracks are related to data from The Programme for International Student Assessment (PISA).

Note that the deadline is Sunday 29 june 2014 Pacific Daylight Time.

Below you can find an simple example how to read PISA data and plot some basic graphs.

Let’s see in which countries pupils have high average reading + math score and in which countries pupils are on average better in math than reading.

Let’s add country sizes to the plot.

I am very eager to see winning submissions.

PISA 2012, occupations and the shiny app

OECD has just released a new PISA in Focus titled „Do parents’ occupations have an impact on student performance?”.

A shiny app is an add-on to this article, you can open it from this link:

The app allows for comparisons of average performance in mathematics/reading/science of 15-years old pupils in PISA study across different groups of parental occupations.
Are kids of Professionals ‘on average’ performing better than kids of other professions?

You can also compare countries or regions to check if spread of occupational averages is larger or smaller here or there.

The nice thing about this article and results is that they are created entirely in R (data, results and R codes are open).

R sources of this app are available on github:

The PISA2003lite package is released. Let’s explore!

Today I’m going to show how to install PISA2003lite, what is inside and how to use this R package. Datasets from this package will be used to compare student performance in four math sub-areas across different countries.
At the end of the day we will find out in which areas top performers from different countries are stronger and in which they are weaker.

In the post ,,The PISA2009lite is released” an R package PISA2009lite with data from PISA 2009 was introduced. The same approach was applied to convert data from PISA 2003 into an R package. PISA (Programme for International Student Assessment) is a worldwide study focused on measuring scholastic performance on mathematics, science and reading of 15-year-old school pupils.
The study performed in 2003 was focused on mathematics. Note, that PISA 2012 was focused on mathematics as well, so it will be very interesting to compare results between both studies [when only date from PISA 2012 data become public].

The package PISA2003lite is on github, so to install it you just need

You will find three data sets in this package. These are: data from student questionnaire, school questionnaire and cognitive items.

Let's plot something! What about strong and weak sides in particular MATH sub-areas?
In this dataset the overall performance is represented by five plausible values: PV1MATH, PV2MATH, PV3MATH, PV4MATH, PV5MATH. But for each student also performance in four sub-scales is measured. These sub-scales are: Space and Shape, Change and Relationships, Uncertainty and Quantity (plausible values: PVxMATHx, x=1..5, y=1..4).

Let's find out how good are top performers in each country in different sub-scales.
For every country, let's calculate the 95% quantile of performance in every subscale.

Just few more lines to add the proper row- and col- names.

And here are results. Table looks nice, but there is so many numbers.
Let's use PCA to reduce dimensionality of the data.

plot of chunk unnamed-chunk-5

Quite interesting!
It looks like first PCA coordinate is an average over all sub-scales. Thus on the plot above the further left, the better are top performers in all sub-scales.
But the second PCA coordinate differentiates between countries in which top performers are better in 'Space and Shape' [top] versus ‚Uncertainty’ [bottom]. Compare this plot with the table above, and you will see that for Czech Republic, Slovak Republic, Russian Federation the 'Space and Shape' is the strongest side of top performers.
On the other side Sweden, USA, Norway, Ireland score higher in ‚Uncertainty’.

As you may easily check, that results are pretty similar if you focus on averages or performers from the bottom tail.

Which direction is better? Of course it depends. But since ,,change is the only sure thing'', understanding of uncertainty might be useful.