Gender gap and visualisation challenge @ useR!2014

7 days to go for submissions in the DataVis contest at useR!2014 (see contest webpage).
Note that the contest is open for all R users, not only conference participants.
Submit your solution soon!

PISA dataset allows to challenge some ,,common opinions”, like are boys or girls better in math / reading. But, how to compare differences between genders? Averages are not fancy enough 😉 Let’s use weighted quantiles calculated with function wtd.quantile() from the Hmisc package.

Below we present quantile-quantile plots for results in math and reading. Each point represents a single centile. X coordinate stands for centiles of male scores while Y coordinate correspond to female scores respectively. Black line is the diagonal, points on this line correspond to equal centiles in both distributions.
Points below black line correspond to centile for which boys are doing better than girls. Points above black line correspond to centiles for which girls are doing better. For reading the story is simple, girls are doing better for all centiles, whole distribution is shifted. But for math the story is different for different countries.

[UK – distribution of math scores for boys is shifted in comparison to females]

[Finland – distribution of math scores for boys is wider, weak boys are weaker than girls but strong one are stronger than corresponding girl’s centile]

[USA – somewhere between Finland and UK]

What is the story for your country?

And the R code.

# read students data from PISA 2012
# directly from URL
con <- url("")
# plot quantiles for a country
plotCountry <- function(cnt) {
  cutoffs <- seq(0,1,0.01)
  selected <- student2012[student2012$CNT == cnt, ]
  getQuants <- function(group) {
    selectedG <- selected[group, ]
    wtd.quantile(selectedG$PV1MATH, weights=selectedG$W_FSTUWT, probs=cutoffs)
  ecdf1 <- getQuants(selected$ST04Q01 == "Male")
  ecdf2 <- getQuants(selected$ST04Q01 == "Female")
  df1 <- data.frame(cutoffs, Male = ecdf1, Female = ecdf2, subject="MATH")
  getQuants <- function(group) {
    selectedG <- selected[group, ]
    wtd.quantile(selectedG$PV1READ, weights=selectedG$W_FSTUWT, probs=cutoffs)
  ecdf1 <- getQuants(selected$ST04Q01 == "Male")
  ecdf2 <- getQuants(selected$ST04Q01 == "Female")
  df2 <- data.frame(cutoffs, Male = ecdf1, Female = ecdf2, subject="READ")
  df <- rbind(df1, df2)
  ggplot(df, aes(x = Male, y = Female, col = subject, shape = subject)) + 
    geom_point() +
    xlim(350,650) + ylim(350,650) + 
    geom_abline(intercept=0, slope=1) + ggtitle(cnt)
# plot results for selected countries  
for (cnt in c("Canada", "United Kingdom","United States of America", "Finland"))
  ggsave(plotCountry(cnt), filename=paste0("Quant", cnt, ".png"), width=600/72, height=600/72)

Proficiency levels @ PISA and visualisation challenge @ useR!2014

16 days to go for submissions in the DataVis contest at useR!2014 (see contest webpage).
The contest is focused on PISA data and students' skills. The main variables that reflect pupil skills in math / reading / science are plausible values e.g. columns PV1MATH, PV1READ, PV1SCIE in the dataset.
But, these values are normalized to have mean 500 and sd 100. And it is not that easy to understand what the skill level 600 means and is 12 points in average a big difference. To overcome this PISA has introduced seven proficiency levels (from 0 to 6, see that base on plausible values with cutoffs 358, 420, 482, 545, 607, 669.
It is assumed that, for example, at level 6 ,,students can conceptualize, generalize, and utilize information based on their investigations and modeling of complex problem situations, and can use their knowledge in relatively non-standard contexts”.

So, instead of looking at means we can now take a look at fractions of students at given proficiency level. To have some fun we use sp and rworldmap and RColorBrewer packages to have country shapes instead of bars and dots that are supposed to represent pupils that take part in the study. The down side is that area does not correspond to height so it might be confusing. We add horizontal lines to expose the height.

And here is the R code

library(RColorBrewer) <- map_data(map = "world")
cols <- brewer.pal(n=7, "PiYG")
# read students data from PISA 2012
# directly from URL
con <- url("")
prof.scores <- c(0, 358, 420, 482, 545, 607, 669, 1000)
prof.levels <- cut(student2012$PV1MATH, prof.scores, paste("level", 1:7))
plotCountry <- function(cntname = "Poland", cntname2 = cntname) {
  props <- prop.table(tapply(student2012$W_FSTUWT[student2012$CNT == cntname],
         prof.levels[student2012$CNT == cntname], 
  cntlevels <- rep(1:7, times=round(props*5000))
  cntcontour <-[$region == cntname2,]
  cntcontour <- cntcontour[cntcontour$group == names(which.max(table(cntcontour$group))), ]
  wspx <- range(cntcontour[,1])
  wspy <- range(cntcontour[,2])
  N <- length(cntlevels)
  px <- runif(N) * diff(wspx) + wspx[1]
  py <- sort(runif(N) * diff(wspy) + wspy[1])
  sel <- which(, py, cntcontour[,1], cntcontour[,2], mode.checked=FALSE) == 1)
  df <- data.frame(long = px[sel], lat = py[sel], level=cntlevels[sel])  
  par(pty="s", mar=c(0,0,4,0))
  plot(df$long, df$lat, col=cols[df$level], pch=19, cex=3,
       bty="n", xaxt="n", yaxt="n", xlab="", ylab="")
# PISA and World maps are using differnt country names,
# thus in some cases we need to give two names
plotCountry(cntname = "Korea", cntname2 = "South Korea")
plotCountry(cntname = "Japan", cntname2 = "Japan")
plotCountry(cntname = "Finland")
plotCountry(cntname = "Poland")
plotCountry(cntname = "France", cntname2 = "France")
plotCountry(cntname = "Italy", cntname2 = "Italy")
plotCountry(cntname = "United States of America", cntname2 = "USA")

Are they happy at school? PISA data and visualisation challange @ useR!2014

I hope you have already heard about DataVis contest at useR!2014 (in case you did not, here is more information:
As a jury member I am not going to submit any work, but I am going to encourage you to play with the data, crunch a finding and submit interesting plot.

How easy is to download the data and create some graph?

Let’s take a look at students' answers for questions: ’Do you feel happy at school?' and ’How familiar are you with Declarative Fractions?'.

Are boys more familiar with 'Declarative Fractions' than girls?
(funny fact: 'Declarative Fraction' is a fake concept that is used in PISA to measure overconfidence)
(funny fact 2: plot familiarity with 'Declarative Fractions' across countries, results are very interesting)

# read students data from PISA 2012
# directly from URL
con <- url("")
# variable ST62Q13 codes familiarity with 'Declarative Fraction'
tab <- pisa.table(variable="ST62Q13", by="ST04Q01", data=student2012)
ptab <- acast(tab, ST04Q01~ST62Q13, value.var="Percentage")
ddat <- data.frame(Item=rownames(ptab), ptab)
# plot it with the likert package, center=2) + ggtitle("Declarative Fraction") + theme_bw()

In which countries the fraction of kids that declare that they are happy at school is the highest?

# variable ST87Q07 codes 'Sense of Belonging - Feel Happy at School'
# pisa.table calculates weighted fractions and it's standard errors
tab <- pisa.table(variable="ST87Q07", by="CNT", data=student2012)
ptab <- acast(tab, CNT~ST87Q07, value.var="Percentage")
# need to add fake column with 0, otherwise will fail
ddat <- data.frame(Item=rownames(ptab), cbind(ptab[,4:3], 0, ptab[,2:1]))
ddat <- ddat[-grep(rownames(ddat),pattern="(", fixed=TRUE),]
# plot it with the likert package,  plot.percent.neutral=FALSE) + ggtitle("Sense of Belonging - Feel Happy at School")

There is a lot of 'kind of Likert' scale questions in the PISA study, for sure there is a lot of nice stories as well.

Data visualisation challange at useR 2014 conference

Familiar with R and DataVis? Take a try in PISA data visualisation contest for useR2014 participants.
There are two contest tracks. In each you can win 700$.

More information about the contest can be found here:

The prizes are funded by The Organisation for Economic Co-operation and Development (OECD) and tracks are related to data from The Programme for International Student Assessment (PISA).

Note that the deadline is Sunday 29 june 2014 Pacific Daylight Time.

Below you can find an simple example how to read PISA data and plot some basic graphs.

Let’s see in which countries pupils have high average reading + math score and in which countries pupils are on average better in math than reading.

# read students data from PISA 2012
# directly from URL
con <- url("")
# calculate weighted mean from math / reading scores
# W_FSTUWT stands for final weights
mathScores <- unclass(by(student2012[,c("PV1MATH", "W_FSTUWT")], 
                 function(x) weighted.mean(x[,1], x[,2])) )
readScores <- unclass(by(student2012[,c("PV1READ", "W_FSTUWT")], 
                 function(x) weighted.mean(x[,1], x[,2])) )
# create a data.frame with scores and country names
# remove names with ( in name)
readMathScores <- data.frame(Country=names(readScores), readScores, mathScores)
readMathScores <- readMathScores[-grep(readMathScores$Country, pattern="(", fixed=TRUE),]
ggplot(readMathScores, aes(x=mathScores + readScores, y = mathScores - readScores, label = Country)) + 
  geom_text() +

Let’s add country sizes to the plot.

sizeScores <- unclass(by(student2012[,"W_FSTUWT"], 
                         sum) )
# create a data.frame with scores, sizes and country names
# remove names with ( in name)
readMathScores <- data.frame(Country=names(readScores), readScores, mathScores, sizeScores)
readMathScores <- readMathScores[-grep(readMathScores$Country, pattern="(", fixed=TRUE),]
ggplot(readMathScores, aes(x=mathScores + readScores, y = mathScores - readScores, label = Country, size = sqrt(sizeScores))) + 
  geom_text() +
  theme_bw() + theme(legend.position="none")

I am very eager to see winning submissions.

PISA 2012, occupations and the shiny app

OECD has just released a new PISA in Focus titled „Do parents’ occupations have an impact on student performance?”.

A shiny app is an add-on to this article, you can open it from this link:

The app allows for comparisons of average performance in mathematics/reading/science of 15-years old pupils in PISA study across different groups of parental occupations.
Are kids of Professionals ‘on average’ performing better than kids of other professions?

You can also compare countries or regions to check if spread of occupational averages is larger or smaller here or there.

The nice thing about this article and results is that they are created entirely in R (data, results and R codes are open).

R sources of this app are available on github:

The PISA2003lite package is released. Let’s explore!

Today I’m going to show how to install PISA2003lite, what is inside and how to use this R package. Datasets from this package will be used to compare student performance in four math sub-areas across different countries.
At the end of the day we will find out in which areas top performers from different countries are stronger and in which they are weaker.

In the post ,,The PISA2009lite is released” an R package PISA2009lite with data from PISA 2009 was introduced. The same approach was applied to convert data from PISA 2003 into an R package. PISA (Programme for International Student Assessment) is a worldwide study focused on measuring scholastic performance on mathematics, science and reading of 15-year-old school pupils.
The study performed in 2003 was focused on mathematics. Note, that PISA 2012 was focused on mathematics as well, so it will be very interesting to compare results between both studies [when only date from PISA 2012 data become public].

The package PISA2003lite is on github, so to install it you just need

# don't download 120MB of compressed data if the package is already installed
if (length(find.package("PISA2003lite", quiet = TRUE)) == 0) 
     install_github("PISA2003lite", "pbiecek")
# and load the package

You will find three data sets in this package. These are: data from student questionnaire, school questionnaire and cognitive items.

## [1] 276165    407
## [1] 10274   192
## [1] 276165    178

Let's plot something! What about strong and weak sides in particular MATH sub-areas?
In this dataset the overall performance is represented by five plausible values: PV1MATH, PV2MATH, PV3MATH, PV4MATH, PV5MATH. But for each student also performance in four sub-scales is measured. These sub-scales are: Space and Shape, Change and Relationships, Uncertainty and Quantity (plausible values: PVxMATHx, x=1..5, y=1..4).

Let's find out how good are top performers in each country in different sub-scales.
For every country, let's calculate the 95% quantile of performance in every subscale.

res <- by(student2003[,c("PV1MATH1", "PV1MATH2", "PV1MATH3", "PV1MATH4")], student2003$COUNTRY, function(x) {
  apply(x, 2, quantile, 0.95)
res <- t(simplify2array(res))

Just few more lines to add the proper row- and col- names.

subAreasNames <- sapply(strsplit(student2003dict[colnames(res)], split = " *- *"), `[`, 2)
colnames(res) <- subAreasNames

And here are results. Table looks nice, but there is so many numbers.
Let's use PCA to reduce dimensionality of the data.

##                    Space and Shape Change and Relationships Uncertainty
## Australia                    687.5                    679.8       687.2
## Austria                      706.3                    669.0       652.5
## Belgium                      704.8                    712.2       692.1
## Brazil                       513.7                    546.0       523.2
## Canada                       664.8                    674.7       672.0
## Czech Republic               744.4                    702.3       673.4
## Denmark                      674.2                    666.2       661.4
## Finland                      685.7                    694.6       679.1
## France                       675.9                    677.9       653.9
## Germany                      684.4                    679.0       649.0
## Greece                       595.7                    604.9       601.7
## Hong Kong-China              733.4                    702.0       712.8
## Hungary                      660.3                    656.3       631.4
## Iceland                      655.1                    663.7       678.9
## Indonesia                    508.1                    506.8       497.5
## Ireland                      632.8                    647.1       664.8
## Italy                        675.8                    644.5       641.2
## Japan                        724.9                    707.6       682.5
## Korea                        743.2                    705.4       680.9
## Latvia                       656.1                    651.9       613.7
## Liechtenstein                701.3                    708.5       670.5
## Luxembourg                   653.9                    652.5       649.7
## Macao-China                  688.7                    674.3       674.8
## Mexico                       535.8                    534.8       535.0
## Netherlands                  676.9                    704.1       692.2
## New Zealand                  697.1                    693.5       700.1
## Norway                       651.4                    646.2       675.9
## Poland                       668.4                    654.0       632.2
## Portugal                     609.4                    623.7       604.8
## Russian Federation           666.2                    644.6       593.7
## Slovak Republic              702.8                    667.7       622.8
## Spain                        628.9                    642.3       632.1
## Sweden                       657.9                    680.9       672.8
## Switzerland                  701.7                    684.5       663.8
## Thailand                     590.9                    583.9       561.1
## Tunisia                      513.5                    510.8       485.6
## Turkey                       596.9                    629.8       613.0
## United Kingdom               661.7                    671.3       672.8
## United States                631.3                    639.3       649.8
## Uruguay                      579.3                    601.8       582.8
##                    Quantity
## Australia             671.3
## Austria               655.2
## Belgium               692.2
## Brazil                542.8
## Canada                668.2
## Czech Republic        700.9
## Denmark               660.5
## Finland               681.4
## France                657.1
## Germany               675.1
## Greece                605.8
## Hong Kong-China       699.2
## Hungary               645.5
## Iceland               666.5
## Indonesia             509.8
## Ireland               645.5
## Italy                 662.3
## Japan                 683.5
## Korea                 679.3
## Latvia                619.4
## Liechtenstein         675.9
## Luxembourg            647.2
## Macao-China           669.7
## Mexico                562.0
## Netherlands           681.7
## New Zealand           670.9
## Norway                646.2
## Poland                636.2
## Portugal              619.3
## Russian Federation    625.8
## Slovak Republic       663.3
## Spain                 652.3
## Sweden                655.5
## Switzerland           669.2
## Thailand              591.4
## Tunisia               520.1
## Turkey                611.3
## United Kingdom        664.2
## United States         640.7
## Uruguay               603.4
par(xpd = NA)

plot of chunk unnamed-chunk-5

Quite interesting!
It looks like first PCA coordinate is an average over all sub-scales. Thus on the plot above the further left, the better are top performers in all sub-scales.
But the second PCA coordinate differentiates between countries in which top performers are better in 'Space and Shape' [top] versus 'Uncertainty' [bottom]. Compare this plot with the table above, and you will see that for Czech Republic, Slovak Republic, Russian Federation the 'Space and Shape' is the strongest side of top performers.
On the other side Sweden, USA, Norway, Ireland score higher in 'Uncertainty'.

As you may easily check, that results are pretty similar if you focus on averages or performers from the bottom tail.

Which direction is better? Of course it depends. But since ,,change is the only sure thing'', understanding of uncertainty might be useful.

The PISA2009lite package is released

This post introduces a new R package named PISA2009lite. I will show how to install this package, what is inside and how to use it.


PISA (Programme for International Student Assessment) is a worldwide study focused on measuring performance of 15-year-old school pupils. More precisely, scholastic performance on mathematics, science and reading is measured in more than 500 000 pupils from 65 countries.

First PISA study was performed in 2000, second in 2003, and then in 2006, 2009 and the last one in 2012. Data from the last study will be made public on December 2013. Data from previous studies are accessible through the PISA website
Note that this data set is quite large. This is why the PISA2009lite package will use more than 220MB of your disk [data sets are compressed on disk] and much more RAM [data sets will be decompressed after [lazy]loading to R].

Let's see some numbers. In PISA 2009 study, number of examined pupils: 515 958 (437 variables for each pupil), number of examined parents: 106 287 (no, their questions are not related to scholastic performance), number of schools from which pupils were sampled: 18 641. Pretty large, complex and interesting dataset!

On the official PISA webpage there are instructions how to read data from 2000-2009 studies into SAS and SPSS statistical packages. I am transforming these dataset to R packags, to make them easier to use for R users.
Right now, PISA2009lite is mature enough to share it. There are still many things to correct/improve/add. Fell free to point them [or fix them].

This is not the first attempt to get the PISA data available for R users. On the github you can find 'pisa' package maintained by Jason Bryer ( with data from PISA 2009 study.
But since I need data from all PISA editions, namely 2000, 2003, 2006, 2009 and 2012 I've decided to create few new packages, that will support consistent way to access data from different PISA studies.

Open the R session

The package is on github, so in order to install it you need just

# dont download 220MB of compressed data if the package is already
# installed
if (length(find.package("PISA2009lite", quiet = TRUE)) == 0) install_github("PISA2009lite", 

now PISA2009lite is ready to be loaded


You will find five data sets in this package [actually ten, I will explain this later]. These are: data from student questionnaire, school questionnaire, parent questionnaire, cognitive items and scored cognitive items.

## [1] 515958    437
## [1] 106287     90
## [1] 18641   247
## [1] 515958    273
## [1] 515958    227

You can do a lot of things with these data sets. And I am going to show some examples in next posts.

Country ranking in just few lines of code

But as a warmer let's use it to calculate average performance in mathematics for each country.

Note that student2009$W_FSTUWT stands for sampling weights, student2009$PV1MATH stands for first plausible value from MATH scale while student2009$CNT stands for country

means <- unclass(by(student2009[, c("PV1MATH", "W_FSTUWT")], student2009$CNT, 
                        function(x) weighted.mean(x[, 1], x[, 2])))
# sort them
means <- sort(means)

And plot it.

dotchart(means, pch = 19)
abline(v = seq(350, 600, 50), lty = 3, col = "grey")

plot of chunk unnamed-chunk-6