Skip to main content
article icon

Is Pisa fundamentally flawed?

news | Published in TES magazine on 26 July, 2013 | By: William Stewart

They are the world’s most trusted education league tables. But academics say the Programme for International Student Assessment rankings are based on a ‘profound conceptual error’. So should countries be basing reforms on them?

In less than five months, the most influential set of education test results the world has ever seen will be published. Leaders of the most powerful nations on Earth will be waiting anxiously to find out how they have fared in the latest Programme for International Student Assessment (Pisa).

In today’s increasingly interconnected world, where knowledge is supplanting traditional industry as the key to future prosperity, education has become the main event in the “global race”. And Pisa, the assessment carried out by the Organisation for Economic Cooperation and Development (OECD) every three years, has come to be seen as education’s most recognised and trustworthy measure.

Politicians worldwide, such as England’s education secretary Michael Gove, have based their case for sweeping, controversial reforms on the fact that their countries’ Pisa rankings have “plummeted”. Meanwhile, top-ranked success stories such as Finland have become international bywords for educational excellence, with other ambitious countries queuing up to see how they have managed it.

Pisa 2012 - due to be published on 3 December 2013 - will create more stars, cause even more angst and persuade more governments to spend further billions on whatever reforms the survey suggests have been most successful.

But what if there are “serious problems” with the Pisa data? What if the statistical techniques used to compile it are “utterly wrong” and based on a “profound conceptual error”? Suppose the whole idea of being able to accurately rank such diverse education systems is “meaningless”, “madness”?

What if you learned that Pisa’s comparisons are not based on a common test, but on different students answering different questions? And what if switching these questions around leads to huge variations in the all- important Pisa rankings, with the UK finishing anywhere between 14th and 30th and Denmark between fifth and 37th? What if these rankings - that so many reputations and billions of pounds depend on, that have so much impact on students and teachers around the world - are in fact “useless”?

This is the worrying reality of Pisa, according to several academics who are independently reaching some damning conclusions about the world’s favourite education league tables. As far as they are concerned, the emperor has no clothes.

Perhaps just as worrying are the responses provided when TES put the academics’ arguments to the OECD. On the key issue of whether different countries’ education systems are correctly ranked by Pisa, and whether these rankings are reliable, the answer is less than reassuring.

The sample data used mean that there is “uncertainty”, the OECD admits. As a result, “large variation in single (country) ranking positions is likely”, it adds.

The organisation has always argued that Pisa provides much deeper and more nuanced analysis than mere rankings, offering insights into which education policies work best. But the truth is that, for many, the headline rankings are the start and finish of Pisa and that is where much of its influence lies.

On other important questions, such as whether there is a “fundamental, insoluble mathematical error” at the heart of the statistical model used for Pisa, the OECD has offered no response.

Concerns about Pisa have been raised publicly before. In England, Gove’s repeated references to the country plunging down the Pisa maths, reading and science rankings between 2000 and 2009 have led to close scrutiny of the study’s findings.

Last year, the education secretary received an official reprimand from the UK Statistics Authority for citing the “problematic” UK figures from 2000, which the OECD itself had already admitted were statistically invalid because not enough schools took part.

The statistics watchdog also highlighted further criticisms made by Dr John Jerrim, a lecturer in economics and social statistics at the Institute of Education, University of London. He notes that England’s Pisa fall from grace was contradicted by the country’s scores in the rival Trends in International Mathematics and Science Study, which rose between 1999 and 2007. Jerrim’s paper also suggests that variations in the time of year that students in England took the Pisa tests between 2000 and 2009 could have skewed its results.

The OECD tells TES that this amounts to speculation, and that Jerrim looked only at tests within the UK and did not address Pisa’s main objective, which is to provide a “snapshot comparison” between 15-year- olds in different countries.

But it is the “snapshot” nature of Pisa - the fact that it looks at a different cohort of 15-year-olds every three years - that is one of the chief criticisms levelled at the programme by Harvey Goldstein, professor of social statistics at the University of Bristol in the South West of England.

“I was recommending 10 years ago to the OECD that it should try to incorporate longitudinal data, and it simply hasn’t done that,” Goldstein tells TES. He would like to see Pisa follow a cohort over time, as the government pupil database in England does, so that more causal relationships could be studied.

Delving deeper

Criticisms of Pisa’s scope are important. But the deeper methodological challenges now being made are probably even more significant, although harder to penetrate.

It should be no surprise that they have arisen. A fair comparison between more than 50 different education systems operating in a huge variety of cultures, which allows them to be accurately ranked on simple common measures, was always going to be enormously difficult to deliver in a way that everyone agrees with.

Goldstein notes concerns that questions used in Pisa tests have been culturally or linguistically biased towards certain countries. But he explains that when the OECD has tried to tackle this problem by ruling out questions suspected of bias, it can have the effect of “smoothing out” key differences between countries.

“That is leaving out many of the important things,” he warns. “They simply don’t get commented on. What you are looking at is something that happens to be common. But (is it) worth looking at? Pisa results are taken at face value as providing some sort of common standard across countries. But as soon as you begin to unpick it, I think that all falls apart.”

For Goldstein, the questions that Pisa ends up with to ensure comparability can tend towards the lowest common denominator.

“There is a risk to that,” admits Michael Davidson, head of the OECD’s schools division. “In a way, you can’t win, can you?” Nevertheless, Pisa still finds “big differences” between countries, he says, despite “weeding out what we suspect are culturally biased questions”.

Davidson also concedes that some of the surviving questions still “behave” differently in different countries. And, as we shall see, this issue lies at the heart of some of the biggest claims against Pisa.

Simple imperfection

The Pisa rankings are like any education league tables in that they wield enormous influence, but because of their necessary simplicity are imperfect. Where it could be argued that they differ is in a lack of awareness about these limitations.

The stakes are also high in England’s domestic performance tables, with the very existence of some schools resting on their outcomes. But in England the chosen headline measure - the proportion of students achieving five A*-C GCSEs including English and maths - is easily understandable. Its shortcomings - it comes nowhere near to capturing everything that schools do and encourages a disproportionate focus on students on the C-D grade borderline - are also widely known.

Pisa, by comparison, is effectively a black box, with little public knowledge about what goes on inside. Countries are ranked separately in reading, maths and science, according to scores based on their students’ achievements in special Pisa tests. These are representative rather than actual scores because they have been adjusted to fit a common scale, where the OECD average is always 500. So in the 2009 Pisa assessment, for example, Shanghai finished top in reading with 556, the US matched the OECD average with 500 and Kyrgyzstan finished bottom with 314.

You might think that to achieve a fair comparison - and bearing in mind that culturally biased questions have been “weeded out” - all students participating in Pisa would have been asked to respond to exactly the same questions.

But you would be wrong. For example, in Pisa 2006, about half the participating students were not asked any questions on reading and half were not tested at all on maths, although full rankings were produced for both subjects. Science, the main focus of Pisa that year, was the only subject that all participating students were tested on.

Professor Svend Kreiner of the University of Copenhagen, Denmark, has looked at the reading results for 2006 in detail and notes that another 40 per cent of participating students were tested on just 14 of the 28 reading questions used in the assessment. So only approximately 10 per cent of the students who took part in Pisa were tested on all 28 reading questions.

“This in itself is ridiculous,” Kreiner tells TES. “Most people don’t know that half of the students taking part in Pisa (2006) do not respond to any reading item at all. Despite that, Pisa assigns reading scores to these children.”

People may also be unaware that the variation in questions isn’t merely between students within the same country. There is also between-country variation.

For example, eight of the 28 reading questions used in Pisa 2006 were deleted from the final analysis in some countries. The OECD says that this was because they were considered to be “dodgy” and “had poor psychometric properties in a particular country”. However, in other countries the data from these questions did contribute to their Pisa scores.

In short, the test questions used vary between students and between countries participating in exactly the same Pisa assessment.

The OECD offered TES the following explanation for this seemingly unlikely scenario: “It is important to recognise that Pisa is a system-level assessment and the test design is created with that goal in mind. The Pisa assessment does not generate scores for individuals but instead calculates plausible values for each student in order to provide system aggregates.”

It then referred to an explanation in a Pisa technical report, which notes: “It is very important to recognise that plausible values are not test scores and should not be treated as such. They are random numbers drawn from the distribution of scores that could be reasonably assigned to each individual.” In other words, a large portion of the Pisa rankings is not based on actual student performance at all, but on “random numbers”.

To calculate these “plausible values”, the OECD uses something called the Rasch model. By feeding actual student scores into this statistical “scaling model”, Pisa’s administrators aim to work out a plausible version of what the scores would have been if all students in all countries had answered the same questions.

Inside the black box

The Rasch model is at the heart of some of the strongest criticisms being made of Pisa. It is also the black box within Pisa’s black box: exactly how the model works is something that few people fully understand.

But Kreiner does. He was a student of Georg Rasch, the Danish statistician who gave his name to the model, and has personally worked with it for 40 years. “I know that model well,” Kreiner tells TES. “I know exactly what goes on there.” And that is why he is worried about Pisa.

He says that for the Rasch model to work for Pisa, all the questions used in the study would have to function in exactly the same way - be equally difficult - in all participating countries. According to Kreiner, if the questions have “different degrees of difficulty in different countries” - if, in technical terms, there is differential item functioning (DIF) - Rasch should not be used.

“That was the first thing that I looked for, and I found extremely strong evidence of DIF,” he says. “That means that (Pisa) comparisons between countries are meaningless.”

Of course, as already stated, the OECD does seek to “weed out” questions that are biased towards particular countries. But, as the OECD’s Davidson admits, “there is some variation” in the way that questions work in different countries even after that weeding out. “Of course there is,” he says. “It would be ridiculous to expect every item (question) to behave exactly the same. What we work to do is to minimise that variation.”

But Kreiner’s research suggests that the variation is still too much to allow the Rasch model to work properly. In 2010, he took the Pisa 2006 reading test data and fed them through the Rasch model himself. He said that the OECD’s claims did not stand up because countries’ rankings varied widely depending on the questions used. That meant the data were unsuitable for Rasch and therefore Pisa was “not reliable at all”.

“I am not actually able to find two items in Pisa’s tests that function in exactly the same way in different countries,” Kreiner said in 2011. “There is not one single item that is the same across all 56 countries. Therefore, you cannot use this model.”

The OECD hit back with a paper the same year written by one of its technical advisers, Ray Adams, who argued that Kreiner’s work was based only on analysis of questions in small groups selected to show the most variation. The organisation suggested that when a large pool of questions was used, any variations in rankings would be evened out.

But Kreiner responded with a new paper this summer that broke the 2006 reading questions down in the same groupings used by Pisa. It did not include the eight questions that the OECD admits were “dodgy” for some countries, but it still found huge variations in countries’ rankings depending on which groups of questions were used.

In addition to the UK and Denmark variations already mentioned, the different questions meant that Canada could have finished anywhere between second and 25th and Japan between eighth and 40th. It is, Kreiner says, more evidence that the Rasch model is not suitable for Pisa and that “the best we can say about Pisa rankings is that they are useless”.

According to the OECD, not all Rasch experts agree with Kreiner’s position that the model cannot be used when DIF is present. It also argues that no statistical model will exactly fit the data anyway, to which Kreiner responds: “It is true that all statistical models are wrong. But some models are more wrong than other models and there is no reason to use the worst model.” Kreiner further accuses the OECD of failing to produce any evidence to prove that its rankings are robust.

The organisation says that researchers can request the data. But, more significantly, it has now admitted that there is “uncertainty” surrounding Pisa country rankings and that “large variation in single ranking positions is likely”.

It attributes this to “sample data” rather than the unsuitability of its statistical model. But the variation in possible rankings quoted by the OECD, although smaller than Kreiner’s, may still raise eyebrows. In 2009, the organisation said the UK’s ranking was between 19th and 27th for reading, between 23rd and 31st for maths, and between 14th and 19th for science.

Serious concerns about the Rasch model have also been raised by Dr Hugh Morrison from Queen’s University Belfast in Northern Ireland. The mathematician doesn’t just think that the model is unsuitable for Pisa - he is convinced that the model itself is “utterly wrong”.

Morrison argues that at the heart of Rasch, and other similar statistical models, lies a fundamental, insoluble mathematical error that renders Pisa rankings “valueless” and means that the programme “will never work”.

He says the model insists that when students of the same ability answer the same question - in perfect conditions, when everything else is equal - some students will always answer correctly and some incorrectly. But Morrison argues that, in those circumstances, the students would by definition all give a correct answer or would all give an incorrect one, because they all have the same ability.

“This is my fundamental problem, which no one can solve as far as I can see because there is no way of solving it,” says the academic, who wants to give others the chance to refute his argument mathematically. “I am a fairly cautious mathematician and I am certain this cannot be answered.”

Morrison also contests Rasch because he says the model makes the “impossible” claim of being able to measure ability independently of the questions that students answer. “Consider a GCSE candidate who scores 100 per cent in GCSE foundation mathematics,” he tells TES. “If Einstein were to take the same paper, it seems likely that he, too, would score 100 per cent. Are we to assume that Einstein and the pupil have the same mathematical ability? Wouldn’t the following, more conservative claim, be closer to the truth: that Einstein and the pupil have the same mathematical ability relative to the foundation mathematics test?”

When TES put Morrison’s concerns to the OECD, it replied by repeating its rebuttal of Kreiner’s arguments. But Kreiner makes a different point and argues against the suitability of Rasch for Pisa, whereas Morrison claims that the model itself is “completely incoherent”.

After being forwarded a brief summary of Morrison’s case, Goldstein says that he highlights “an important technical issue”, rather than the “profound conceptual error” claimed by the Belfast academic. But the two do agree in their opinion of the OECD’s response to criticism. Morrison has put his points to several senior people in the organisation, and says that they were greeted with “absolute silence”. “I was amazed at how unforthcoming they were,” he says. “That makes me suspicious.”

Goldstein first published his criticisms of Pisa in 2004, but he says they have yet to be addressed. “Pisa steadfastly ignored many of these issues,” he says. “I am still concerned.”

The OECD tells TES that: “The technical material on Pisa that is produced each cycle seeks to be transparent about Pisa methods. Pisa will always be ready to improve on that, filling gaps that are perceived to exist or to remove ambiguities and to discuss these methodologies publicly.”

But Kreiner is unconvinced: “One of the problems that everybody has with Pisa is that they don’t want to discuss things with people criticising or asking questions concerning the results. They didn’t want to talk to me at all. I am sure it is because they can’t defend themselves.”

What’s the problem?

“(Pisa) has been used inappropriately and some of the blame for that lies with Pisa itself. I think it tends to say too much for what it can do and it tends not to publicise the negative or the weaker aspects.” Professor Harvey Goldstein, University of Bristol, England

“We are as transparent as I think we can be.”

Michael Davidson, head of the OECD’s schools division

“The main point is, it is not up to the rest of the world to show they (the OECD) are wrong. It is up to Pisa to show they are right. They are claiming they are doing something and they are getting a lot of money to do it, and they should support their claims.” Professor Svend Kreiner, University of Copenhagen, Denmark

“There are very few things you can summarise with a number and yet Pisa claims to be able to capture a country’s entire education system in just three of them. It can’t be possible. It is madness.” Dr Hugh Morrison, Queen’s University Belfast, Northern Ireland

Further reading

 

Photo credit: Getty


Subscribe to the magazine

as yet unrated

Comment (11)

  • Having spent time in Finland on an Interskola (http://www.interskola.net/) conference delving into their education system and why Finland so often tops the PISA table it was surprising to find out that in so many ways they did not appear to be any where near the level of UK schools. We looked at primary, secondary and tertiary educational establishments and had lectures about the system from Finnish educational luminaries.

    Our homespun conclusion is that Finland does well in the PISA tests as their education system is geared towards doing well in the PISA tests. Ask most UK teachers, parents or just ordinary citizens what PISA is and they will say it is an Italian dough-based meal with tomato and cheese as a topping!

    Mr.Gove needs to listen to UK educators like us before taking the PISA league table so to heart - we are doing a very, very good job in the UK and we do NOT want to be dragged down to the level of scrapping for places in a test - driven league! Ask Finland or the other 'table-topping' European countries to sit some standard UK GCSE exams and see how they fare...

    Unsuitable or offensive? Report this comment

    17:40
    26 July, 2013

    markjohnnewman

  • In PISA 2015, the proposed way of assessing collaborative problem solving is by testing a group of one pupil and one software agent. This is likely to become another area where the results of the PISA scores will be irrelevant and unrealistic.

    http://www.oecd.org/pisa/pisaproducts/Draft%20PISA%202015%20Collaborative%20Problem%20Solving%20Framework%20.pdf

    Unsuitable or offensive? Report this comment

  • Your site design is flawed and discourages participation. I started by typing my comment, and when I had registered and was redirected to this page, my comment was gone. A waste of time.

    Unsuitable or offensive? Report this comment

  • Very interesting post and I see John Jerrim has responded with his own post on the IoE blog. Here's a link to one of mine which references PISA: http://behrfacts.com/2013/06/12/while-students-sit-their-exams-politicians-tell-them-they-arent-hard-enough-whoisthecleverone/ .

    Unsuitable or offensive? Report this comment

    12:04
    31 July, 2013

    nico_ursman

  • It is revealing that the TES persist in posing a rhetorical question over whether or not OECD PISA is fundamentally flawed. Those in a position to counter Dr Morrison's paper, in which the term "fundamental flaw" arose in relation to PISA's use of the Rasch model, have been silent on the mathematical problem. The matter has therefore been settled - PISA is flawed. What benefit can accrue to the TES in their inexplicable efforts to rescue OECD PISA from a problem of their own making?

    Unsuitable or offensive? Report this comment

    15:17
    11 August, 2013

    Stevemayman

  • markjohnnewman writes above that "Finland does well in the PISA tests as their education system is geared towards doing well in the PISA tests".

    As a Finn I can say that this claim is nonsense, unless it is meant that there is a strong focus on problem solving and problem-based teaching in Mathematics and Science teaching in Finland, which helps Finnish students to perform well in the PISA tests. The Finnish education system was built without trying to be the best one. Furthermore, competition between schools was not part of the plan. Instead the focus has been in creating good schools for all children. However, paradoxically the Finnish students perform well in International Studies and the 2010 McKinsey report rated the Finnish Education System as one of the excellent ones in the world.
    Compared to some Asian education systems which also do very well in the comparisons, the Finnish system is very different without the extensive cramming, rote learning, long schooldays and a lot of homework present in the former. Furthermore, Finnish children start school at the age of seven, years later than in some other countries.

    The Finnish education system was rebuilt and modernized in the 1970s based on the Nordic strategy for achieving equality and excellence in education based on constructing a publicly funded comprehensive school system without selecting, tracking, or streaming students during their common basic education (source Wikipedia).

    The Finnish Ministry of Education attributes its success to "the education system (uniform basic education for the whole age group), highly competent teachers, and the autonomy given to schools."

    If you are interested in why the Finnish education system performs well, then there is a book and articles (in English) about it written by a Finnish professional in education, Pasi Sahlberg (or google “Finnish pisa results”). However, if you are looking for a “silver bullet” then you will likely be disappointed. It is a “total package”, not just a bag of separate tricks. Sahlberg mentions that generally two-thirds of the success (or failure) is due to matters outside of the education system (the teachers and the schools) like family background and motivation to learn. This doesn’t mean that the teachers or the schools won’t have an effect, they do, but it might be less than what you think at the outset. He also thinks that if, for instance, all the teachers in Indiana, USA, were replaced by the great Finnish teachers (for 5 years, and assuming they would become fluent in English) and all other things left as they are, the improvement in the results would be marginal. Furthermore, by the end of that period many of the Finnish teachers would be doing something else than teaching (like their US counterparts, too). Furthermore, he compares the schools to sports teams where the team can either be stronger or weaker than the sum of the individual players (teachers in schools) depending on how well they work as a team. In his opinion teachers in the schools should be given the opportunity and be encouraged to work as teams in order to get best results.

    The OECD report mentions the following points about the Finnish success:
    “political consensus to educate all children together in a common school system; an expectation that all children can achieve at high levels, regardless of family background or regional circumstance; single-minded pursuit of teaching excellence; collective school responsibility for learners who are struggling; modest financial resources that are tightly focused on the classroom and a climate of trust between educators and the community”

    I would like to mention just a couple of additional personal thoughts concerning reading performance:

    The Finnish language is a very regular one in spelling (phonetic), contrary to the English which is "approaching Chinese" in this respect. This makes it difficult for English schoolchildren to learn how to spell it correctly and consequently also difficult to learn to read English text (relative to Finnish). As an example it would be possible to build a speech synthesizer (a device which converts text to speech) with unlimited vocabulary for the Finnish language, but not for the English. For English, the pronunciation of every single word needs to be taught to the synthesizer (but not for Finnish due to the regularity of spelling). Another example is that spelling contests for schoolchildren would be a joke in Finland, due to the regularity of spelling.
    Another thing is that Foreign TV programs are not dubbed but have subtitles, which improves children’s reading routine.

    The Interskola reference cited says the following (among others):

    The Interskola Network seeks, by stimulating international discussion, to explore and promote:
    • education in rural and sparsely populated areas

    FYI, most Finns don't live in rural areas, about 84% live in urban areas. Furthermore, when a foreigner observes the Finnish education system, then in addition to observing what features it has, he/she should be able to see what it doesn’t have compared to other systems. In fact the latter might be as important for the success as the former one.

    Finland has not adopted the standardized testing systems (like the UK GCSE exams) that are common in many other countries. The danger in that kind of systems is that teaching is focused in getting good scores in the test instead of learning well (teaching to the test), in addition to the reported test cheating scandals worldwide (USA, Indonesia). Sahlberg sees these as symptoms of GERM (Global Education Reform Movement) infection.
    Bringing market-style competition into the education system, with the idea that market mechanisms in education would allow equal access to high-quality schooling for all, is another symptom.

    I also happened to read some BBC news about how some UK schools cheat in the UK GCSE exams. This kind of cheating would be unheard-of in Finland (very counter-productive practice geared at getting some “good” test results instead of learning).

    Note that there are tests (exams) in the Finnish schools, too. The difference is that each school and teacher is allowed to create their own tests.

    There was a report in the New York Times about cheating which concluded:
    “Never before have so many had so much reason to cheat. Students’ scores are now used to determine whether teachers and principals are good or bad, whether teachers should get a bonus or be fired,
    whether a school is a success or failure.”

    Finally, complacency, not to mention arrogance, is not a good starting point for improvement in any organization/system.

    Unsuitable or offensive? Report this comment

    13:49
    29 September, 2013

    Pisara

  • Let’s try to analyze the complaints made by the professors using common sense.

    Wikipedia mentions the following about the Rasch model:
    “The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between (a) the respondent's abilities, attitudes or personality traits and (b) the item difficulty. For example, they may be used to estimate a student's reading ability.”

    So at least the writer of the Wikipedia text seems to disagree with Morrison about whether the model can be used for things like estimating students’ reading abilities.

    Morrison argues that if a student has (just enough) the ability to solve a problem, then he will always solve it. Thus, if we take 1000 such students they all will solve the problem contrary to the Rasch model which assumes that only 50% will solve the problem. Morrison may be correct for very simple tasks, like adding or multiplying two numbers: you either know how to do it or you don’t. However, for more complex tasks the Rasch model assumption appears to be more correct.

    As an example, let’s assume that I have a problem to solve, which I can under normal circumstances solve in one minute. Let’s then assume that I am a bit tired, so that my ability to solve the problem has decreased a bit. Thus it may take two minutes for me two solve the problem due to the decreased ability. Morrison claims that I cannot solve the problem at all because my ability has fallen below the threshold needed. IMO he is clearly wrong.

    Furthermore, IMO Morrison is wrong in claiming that Mathematics should be used for proving or disproving whether the Rasch model is applicable or not in Cognitive/Behavioral sciences. Cognitive/Behavioral science is not Mathematics, so trying to solve this kind of problem only by Mathematical means is “useless” using his own words. Mathematics cannot even solve all problems in its own field (see for example “List of unsolved problems in mathematics” or Gödel's incompleteness theorems in Wikipedia).

    It would be interesting to hear Kreiner’s opinion about Morrison’s claims ?.

    Kreiner says that it is “ridiculous” that only half of the students took the reading test and only 40% answered all the 28 questions. This may be true if there is no understandable reason for that, like for instance lack of resources, time or funding. However, the fact that less students participated doesn’t make the results unusable. The result is that the uncertainty was increased. This is similar to an opinion poll with 2500 or 5000 participants. The latter gives more reliable results, but the former is not useless either.

    IMO Kreiner seems to be a bit obsessed in trying to find the “absolute” truth by insisting that the questions should have “equal difficulty” in different countries. It is clear that there is no absolute truth, only the truth within the Pisa framework. For instance, Finland was ranked in two different Competitiveness studies near the top (3rd) in one of them and the 20th in the other. So what’s the absolute truth in this case ? Different criteria produce different outcomes, it’s simple as that. Similarly, if instead the PISA test, the TIMSS is used, the outcome may be different because these tests may measure different things, like problem solving in new situations versus rote learning for instance. There are other countries that perform well in the TIMSS, but not so well in the PISA, for instance Russia.

    Furthermore, it is clear that by including only some subset of the questions, the uncertainty will increase. In addition it seems to me that if only very easy questions were included then even if two countries would originally have similar ranking, the country with less students in the lowest performance category would get a better ranking as a result. A similar change would occur by using only very difficult questions or questions in any other category for that matter. Thus I find it highly suspicious to radically reduce the number of included questions and then using the “result” as a “proof” of some opinions about the Pisa methodology.

    I find it a bit puzzling that Kreiner (or TES ?) claims that only 10% of the students answered all the 28 reading questions. According to the text above, about 50% answered reading questions and of those 40% answered only 14 questions: this means 60% of those answered all 28 questions. Thus this makes 0.6 times 0.5 = 0.3 or 30% of all students answered all the reading questions. So the distribution was:
    - 20% answered only 14 questions
    - 30% answered all the 28 questions
    - 50% didn’t answer any question
    - Which makes 100% in total

    Based on what I wrote above, I think that the decision to reduce the number of reading questions to 14 for 20% of the students is a somewhat questionable idea.

    Thus it is probable that Pisa is not perfect and it could be made better if there is money and resources for that.

    It is interesting to consider some alternative method, like for instance the (National) Matriculation examination used in Finland. There the questions are assigned different amount of points based on the difficulty. For instance answering an easy question correctly a student gets 1 point, for a slightly more difficult question 2 points are assigned etc. until finally a very difficult question might get 10 points assigned. So would this kind of method be fairer ? I don’t think so. Some professors could argue that, say a 5 point question should actually have 6 or 7 points in some countries while 3 or 4 points in some other countries in order to have the “absolute” fairness (the same difficulty in the countries). So probably the “order of the questions” (in difficulty) should be changed in some countries or if that is not possible, then some scaling should be used. However, then some others would argue that this is nonsense and would only produce very artificial results. IMO the Rasch method produces an “average” fairness which is better than some “artificial” fairness set up by a committee.

    I would have expected Kreiner to indicate what method he considers to be the most suitable ? It is much easier to criticize the current method than to come up with a better alternative ?

    Unsuitable or offensive? Report this comment

    13:58
    29 September, 2013

    Pisara

  • In setting out a case against my attack on PISA, Pisara (unwittingly) makes two very interesting points which undermine the PISA measurement methodology.

    Pisara focuses on my account of the profound conceptual paradox at the heart of the PISA model. In making this argument I posit a large sample of testees of the same ability, all providing the correct response to an item or all providing the incorrect response. It is suggested by Pisara that the paradox might have implications for my approach to ability measurement. But this isn’t so for a very simple reason: the very idea of a ascribing a definite ability to a single testee, let alone a sample of 1000, is meaningless because in my account, unmeasured ability cannot take definite values. Measurement, as I see it, renders the indefinite, definite.

    Secondly Pisara invokes Gödel’s incompleteness theorem to argue that mathematical approaches have their limitations. It’s important to remember that the incompleteness theorem applies to formal systems and deals with the relationship between proof and truth in mathematics. It surely isn’t being suggested that humanity eschew mathematics but retain PISA? As it happens I wrote about one of the implications of Gödel’s reasoning for PISA in a letter – see “PISA still has questions to answer” - to the TES of Friday 16th August 2013. The incompleteness theorem shows the difficulties which arise when one forgets the limitations of reasoning. One can only reason (even in mathematics) against a fiduciary framework, namely, the extant customs and conventions of mathematical practice. Mathematical practice is the “background” against which the notion of mathematical truth gets its sense. PISA misconstrues measurement because it is founded (wittingly or unwittingly) on background-independence. The static background-independent measurement adopted by PISA isn’t measurement at all. (Einstein taught us that nothing acts without being acted on.) No advance in IRT or in PISA software can take account of practice. PISA involves an utterly nonsensical approach to measurement. The world would be a better place if PISA would simply admit that it simply cannot do what it claims to do.

    Dr H G Morrison

    Unsuitable or offensive? Report this comment

    4:05
    9 October, 2013

    hgmorrison

  • It appears that Pisara has lost interest in contributing to the commentary almost as suddenly as their delayed (one month) response to the fundamental flaw artile in the TES. Foolishly (s)he offered to solve the fundamental flaw using their own version of common sense. Perhaps (s)he can't cope with the reality that actually "PISA involves an utterly nonsensical approach to measurement."

    Unsuitable or offensive? Report this comment

    16:06
    27 October, 2013

    Stevemayman

  • This article and the comments are so hilarious.
    To summarise:

    1) The OECD has commissioned a statistical survey of student abilities using standard published approaches.

    2) A Belfast mathematician, HG Morrison, claims to have discovered that the original statistical model from the 1960s is broken, and that therefore all use over the last 50 years is bogus.

    3) We don't know if Dr. Morrison is correct, an "eccentric", or is making an obscure but practically irrelevant point.

    4) We can't find (or understand) any peer reviewed published literature on the subject.

    5) We can't be bothered to ask any mathematicians who would know one way or another. Gosh this maths stuff is hard.

    6) Lets publish any article anyway.

    TES. Oh dear.

    The science in the Daily Mail is better sourced (and more "creative" to wit).

    Unsuitable or offensive? Report this comment

    19:29
    3 December, 2013

    buntsai

  • Why don't you attack Professor David Spiegelhalter too? Read this http://understandinguncertainty.org/pisa-statistical-methods-more-detailed-comments.
    Remember that Spiegelhalter is a statistician. He describes Pisa thus:

    The statistical model used to generate the ‘plausible scores’ is demonstrably inadequate.

    Analysis using imputed ('plausible') data is not inherently unsound, provided (as PISA do) the extra sampling error is taken into account. But the vital issue is that the adjustment for imputation is only valid if the model used to generate the plausible values can be considered 'true', in the sense that the generated values are reasonably 'plausible' assessments of what that student would have scored had they answered the questions.

    A simple Rasch model is assumed by PISA, in which questions are assumed to have a common level of difficulty across all countries - questions with clear differences are weeded out as “dodgy”. But in a paper in Psychometrika, Kreiner has shown the existence of substantial Differential Item Functioning” (DIF) - i.e. questions have different difficulty in different countries, and concludes that the “The evidence against the Rasch model is overwhelming.”

    I agree with Svend Kreiner’s view that it is not possible to predict the effect of basing subsequent detailed analysis on plausible values from a flawed model.

    Unsuitable or offensive? Report this comment

    7:20
    4 December, 2013

    Stevemayman

Add your comment

Subscribe to the magazine

Related articles

More Articles

Join TES for free now

Join TES for free now

Four great reasons to join today...

1. Be part of the largest network of teachers in the world – over 2m members
2. Download over 600,000 free teaching resources
3. Get a personalized email of the most relevant resources for you delivered to your inbox.
4. Find out first about the latest jobs in education

Images