Today I’m going to show how to install PISA2003lite, what is inside and how to use this R package. Datasets from this package will be used to compare student performance in four math sub-areas across different countries.
At the end of the day we will find out in which areas top performers from different countries are stronger and in which they are weaker.
In the post ,,The PISA2009lite is released” an R package PISA2009lite with data from PISA 2009 was introduced. The same approach was applied to convert data from PISA 2003 into an R package. PISA (Programme for International Student Assessment) is a worldwide study focused on measuring scholastic performance on mathematics, science and reading of 15-year-old school pupils.
The study performed in 2003 was focused on mathematics. Note, that PISA 2012 was focused on mathematics as well, so it will be very interesting to compare results between both studies [when only date from PISA 2012 data become public].
The package PISA2003lite is on github, so to install it you just need
library(devtools) # don't download 120MB of compressed data if the package is already installed if (length(find.package("PISA2003lite", quiet = TRUE)) == 0) install_github("PISA2003lite", "pbiecek") # and load the package library(PISA2003lite) |
You will find three data sets in this package. These are: data from student questionnaire, school questionnaire and cognitive items.
dim(student2003) ## [1] 276165 407 dim(school2003) ## [1] 10274 192 dim(scoredItem2003) ## [1] 276165 178 |
Let's plot something! What about strong and weak sides in particular MATH sub-areas?
In this dataset the overall performance is represented by five plausible values: PV1MATH, PV2MATH, PV3MATH, PV4MATH, PV5MATH. But for each student also performance in four sub-scales is measured. These sub-scales are: Space and Shape, Change and Relationships, Uncertainty and Quantity (plausible values: PVxMATHx, x=1..5, y=1..4).
Let's find out how good are top performers in each country in different sub-scales.
For every country, let's calculate the 95% quantile of performance in every subscale.
res <- by(student2003[,c("PV1MATH1", "PV1MATH2", "PV1MATH3", "PV1MATH4")], student2003$COUNTRY, function(x) { apply(x, 2, quantile, 0.95) }) res <- t(simplify2array(res)) |
Just few more lines to add the proper row- and col- names.
subAreasNames <- sapply(strsplit(student2003dict[colnames(res)], split = " *- *"), `[`, 2) colnames(res) <- subAreasNames |
And here are results. Table looks nice, but there is so many numbers.
Let's use PCA to reduce dimensionality of the data.
res ## Space and Shape Change and Relationships Uncertainty ## Australia 687.5 679.8 687.2 ## Austria 706.3 669.0 652.5 ## Belgium 704.8 712.2 692.1 ## Brazil 513.7 546.0 523.2 ## Canada 664.8 674.7 672.0 ## Czech Republic 744.4 702.3 673.4 ## Denmark 674.2 666.2 661.4 ## Finland 685.7 694.6 679.1 ## France 675.9 677.9 653.9 ## Germany 684.4 679.0 649.0 ## Greece 595.7 604.9 601.7 ## Hong Kong-China 733.4 702.0 712.8 ## Hungary 660.3 656.3 631.4 ## Iceland 655.1 663.7 678.9 ## Indonesia 508.1 506.8 497.5 ## Ireland 632.8 647.1 664.8 ## Italy 675.8 644.5 641.2 ## Japan 724.9 707.6 682.5 ## Korea 743.2 705.4 680.9 ## Latvia 656.1 651.9 613.7 ## Liechtenstein 701.3 708.5 670.5 ## Luxembourg 653.9 652.5 649.7 ## Macao-China 688.7 674.3 674.8 ## Mexico 535.8 534.8 535.0 ## Netherlands 676.9 704.1 692.2 ## New Zealand 697.1 693.5 700.1 ## Norway 651.4 646.2 675.9 ## Poland 668.4 654.0 632.2 ## Portugal 609.4 623.7 604.8 ## Russian Federation 666.2 644.6 593.7 ## Slovak Republic 702.8 667.7 622.8 ## Spain 628.9 642.3 632.1 ## Sweden 657.9 680.9 672.8 ## Switzerland 701.7 684.5 663.8 ## Thailand 590.9 583.9 561.1 ## Tunisia 513.5 510.8 485.6 ## Turkey 596.9 629.8 613.0 ## United Kingdom 661.7 671.3 672.8 ## United States 631.3 639.3 649.8 ## Uruguay 579.3 601.8 582.8 ## Quantity ## Australia 671.3 ## Austria 655.2 ## Belgium 692.2 ## Brazil 542.8 ## Canada 668.2 ## Czech Republic 700.9 ## Denmark 660.5 ## Finland 681.4 ## France 657.1 ## Germany 675.1 ## Greece 605.8 ## Hong Kong-China 699.2 ## Hungary 645.5 ## Iceland 666.5 ## Indonesia 509.8 ## Ireland 645.5 ## Italy 662.3 ## Japan 683.5 ## Korea 679.3 ## Latvia 619.4 ## Liechtenstein 675.9 ## Luxembourg 647.2 ## Macao-China 669.7 ## Mexico 562.0 ## Netherlands 681.7 ## New Zealand 670.9 ## Norway 646.2 ## Poland 636.2 ## Portugal 619.3 ## Russian Federation 625.8 ## Slovak Republic 663.3 ## Spain 652.3 ## Sweden 655.5 ## Switzerland 669.2 ## Thailand 591.4 ## Tunisia 520.1 ## Turkey 611.3 ## United Kingdom 664.2 ## United States 640.7 ## Uruguay 603.4 par(xpd = NA) biplot(prcomp(res)) |
Quite interesting!
It looks like first PCA coordinate is an average over all sub-scales. Thus on the plot above the further left, the better are top performers in all sub-scales.
But the second PCA coordinate differentiates between countries in which top performers are better in 'Space and Shape' [top] versus 'Uncertainty’ [bottom]. Compare this plot with the table above, and you will see that for Czech Republic, Slovak Republic, Russian Federation the 'Space and Shape' is the strongest side of top performers.
On the other side Sweden, USA, Norway, Ireland score higher in 'Uncertainty’.
As you may easily check, that results are pretty similar if you focus on averages or performers from the bottom tail.
Which direction is better? Of course it depends. But since ,,change is the only sure thing'', understanding of uncertainty might be useful.
Thank you, this is interesting and potentially useful, but I run into trouble with the reference to PISA2009lite:
> library(PISA2009lite)
> rownames(res) if (length(find.package(„PISA2009lite”, quiet = TRUE)) == 0) install_github(„PISA2009lite”, „pbiecek”)
# * installing *source* package 'PISA2009lite’ …
# ** data
# *** moving datasets to lazyload DB
# Error : cannot allocate buffer
# ERROR: lazydata failed for package 'PISA2009lite’
# * removing 'C:/Users/Mark/Documents/R/win-library/3.0/PISA2009lite’
And on 2nd attempt:
# Installing github repo(s) PISA2009lite/master from pbiecek
# Downloading PISA2009lite.zip from https://github.com/pbiecek/PISA2009lite/archive/master.zip
# Error in function (type, msg, asError = TRUE) :
# SSL read: error:00000000:lib(0):func(0):reason(0), errno 10054
Having intalled PISA2003lite to lazyload DB, it seems PISA2009lite is too large to be allocated.
I’ve tried to load student2009lite having closed the PISA2003lite project, but even that won’t work on with 12GB RAM (on a new Windows 8, 3.4GHz i7 machine).
Perhaps using student2009dict is a shortcut, but a student2003dict is really required to avoid reliance on student2009lite? Or another approach to select subsets of data from each edition PISA.
@mark: sorry for confusion.
The entry was published two months ago and after that PISA2003lite package has changed a bit. I hope this is what caused the problem.
I’ve updated the blog entry.
The presented examples should be much faster and [most important] should work with current version of PISA2003lite.
Pls let me know if there will be any further problems.
i’m almost done writing code to analyze pisa in monetdb + r. this will prevent ram from overloading and correctly calculate standard errors (as published by oecd). here’s my (unfinished) code..
https://github.com/ajdamico/usgsd/tree/master/Program%20for%20International%20Student%20Assessment
note the first script will download the 2000, 2003, 2006, and 2009 data sets directly into monetdb..no user intervention required. after that, you should be able to analyze it like any other multiply-imputed object in the survey package 🙂
send me a note offline? ajdamico@gmail.com
@Anthony: I was considering SQLite but found out that it is easier for end user to download just one big rda file once. But monetdb looks like interesting alternative to speedup the processing.
Sending the letter in few minutes.
And asdfree.com looks great!
thank you, I’ve sent an e-mail.
I see a couple of changes, so now it does run without error, or dependence on PISA2009lite. 🙂
Is the best approach to install MonetDB, then run through your scripts on github?
https://github.com/ajdamico/usgsd/tree/master/Program%20for%20International%20Student%20Assessment
Dear SmarterPoland,
I got a very similar error, like this when I was installing PISA2009lite
————————————————————–
* installing *source* package 'PISA2009lite’ …
** data
** moving datasets to lazyload DB
Error : cannot allocate buffer
ERROR: lazydata failed for package 'PISA2009lite’
* removing 'C:/root/R/R-2.15.3/library/PISA2009lite’
Error: Command failed (1)
In addition: Warning message:
package ‘’ is not available (for R version 2.15.3)
——————————————————————-
What would you think is the cause of the buffer error? How much is the memory of your laptop and what is the R version of yours? Thank you!
@lightest
With 4GB of RAM I have similar problems on Windows while it is working for OSX/Linux
So I am rewriting it to make the loading process lighter
The workaroud is to download datasets as rda files from this website:
https://github.com/pbiecek/PISA2009lite/tree/master/data
and load them directly to R with the load() function