Cul ty? A q uest ion of d iffi

TES (Times Education Supplement) - - TALKING POINT -

In just a few days’ time, we will fi­nally know. Af­ter two years of wait­ing, two years of wor­ry­ing and two years of won­der­ing, the re­sults of the new English and maths GCSES will be re­vealed and teach­ers will fi­nally have some clar­ity. The tran­si­tion to th­ese new spec­i­fi­ca­tions – and the new grad­ing sys­tem – has been much crit­i­cised by teach­ers. In one of the most read blogs on the Tes web­site this year, Chris Cur­tis, a head of English, ex­plained the prob­lems caused by teach­ers hav­ing to guess what a 4, 5, 6 or any other grade would look like. How could you de­cide whether to put a stu­dent in for a higher or foun­da­tion pa­per in maths? How could you, in good faith, judge where a stu­dent was in terms of their progress? And how could you re­port progress ac­cu­rately to se­nior lead­er­ship teams? (See “‘All aboard the Titanic catas­tro­phe of the new GCSES’ – an English teacher’s warn­ing from the front­line”, bit.ly/ti­tan­icgc­ses).

“Con­sis­tency” was “miss­ing in ac­tion”, he wrote, and teach­ers flocked to the com­ment sec­tion to add their own woes to the cho­rus. Like Cur­tis, they stressed that the un­cer­tainty around the new “more dif­fi­cult” ex­ams was ham­per­ing their abil­ity to teach and dis­ad­van­tag­ing stu­dents.

Push­ing the bound­aries

Of course, teach­ers never do have ad­vanced clar­ity over grade bound­aries or exam dif­fi­culty. But, in the past, years of ex­pe­ri­ence of the ex­ams meant that teach­ers knew what, for ex­am­ple, a C grade piece of work looked like. This year, though, the com­bi­na­tion of dif­fer­ent exam spec­i­fi­ca­tions and a com­pletely new grad­ing sys­tem has com­pli­cated mat­ters.

Could things have been any dif­fer­ent? In par­tic­u­lar, could the exam boards have pub­lished grade bound­aries in ad­vance to give some cer­tainty at a time of great change?

I would ar­gue not. Grade bound­aries can never be ac­cu­rately re­vealed. And the rea­son why re­veals a lot about ex­ams and how they work – but also a sur­pris­ing amount about how we learn, too.

Put sim­ply, the rea­son why ex­am­in­ers can’t tell you grade bound­aries in ad­vance is that they can’t tell you how dif­fi­cult an exam is in ad­vance.

At first, this might sound a bit ridicu­lous. Af­ter all, as one teacher said to me, “Isn’t that what ex­am­in­ers are paid to do?”

But it turns out that pre­dict­ing the dif­fi­culty of in­di­vid­ual exam ques­tions, or of whole exam pa­pers, is im­pos­si­ble to do with a high de­gree of pre­ci­sion. So pre­dict­ing grade bound­aries is, as a re­sult, im­pos­si­ble, too. This is be­cause rel­a­tively small changes in the sur­face struc­ture of a ques­tion – how it’s worded, for ex­am­ple – can have a huge im­pact on how dif­fi­cult pupils find it. Imag­ine try­ing to write a se­ries of ques­tions that all test whether pupils can add two-digit num­bers, and that are all of equal dif­fi­culty.

Is “10+10” as dif­fi­cult as “80+80”? What about “83+12” or “87+18”?

If you gave all of those ques­tions to the same group of pupils, would you ex­pect the same suc­cess rate for each ques­tion? Prob­a­bly not.

And even smaller changes than that can have a sig­nif­i­cant im­pact: is “11+3” as dif­fi­cult as “3+11’? If you use your fin­gers to do ad­di­tion, per­haps not.

Word prob­lems are equally tricky. Con­sider the fol­low­ing two prob­lems:

A. Joe had three mar­bles. Then Tom gave him five more mar­bles. How many mar­bles does Joe have now?

B. Joe has three mar­bles. He has five mar­bles fewer than Tom. How many mar­bles does Tom have?

One study showed that 97 per cent of pupils got ques­tion A right, but only 38 per cent an­swered ques­tion B cor­rectly (de­tailed in Kevin Durkin and Beatrice Shire’s 1991 book, Lan­guage in Math­e­mat­i­cal Ed­u­ca­tion).

Of course, there is a lot of re­search on why pupils find cer­tain ques­tions harder than oth­ers, and we can use this re­search to make broad pre­dic­tions about dif­fi­culty. But that still doesn’t solve our prob­lem. Even if we are fairly cer­tain that one ques­tion is more dif­fi­cult than an­other, it’s hard to pre­dict how much more dif­fi­cult it is. For ex­am­ple, most peo­ple would pre­dict that pupils would find the word “cat” eas­ier to spell than the word “def­i­nitely”. But by how much?

Sim­i­larly, look at the fol­low­ing ques­tions:

A. Which is big­ger: 3/7 or 5/7?

B. Which is big­ger: 5/7 or 5/9?

Most teach­ers pre­dict, cor­rectly, that more pupils will get ques­tion A right than will get ques­tion B right. But very few can ex­actly pre­dict the per­cent­age of pupils get­ting each right. In one study, 90 per cent of 14-yearolds got the first ques­tion right, but only

15 per cent got the sec­ond one right (as quoted by ed­u­ca­tion­al­ist Dy­lan Wil­iam in his 2014 pub­li­ca­tion, Prin­ci­pled As­sess­ment De­sign).

Most of th­ese ex­am­ples are maths ques­tions, but this prob­lem is, if any­thing, even more acute in other sub­jects. Af­ter all, maths is typ­i­cally thought of as a fairly ob­jec­tive sub­ject where an­swers can be marked as ei­ther right or wrong. Judg­ing the dif­fi­culty of ques­tions is even trick­ier when you have ques­tions that at­tempt to

as­sess the orig­i­nal­ity or the cre­ativ­ity of a pupil’s writ­ing.

For ex­am­ple, the dif­fi­culty of un­seen read­ing tests de­pends to a large ex­tent on the vo­cab­u­lary and back­ground knowl­edge re­quired for com­pre­hen­sion. Most English teach­ers will have sto­ries to tell about how one tricky word in an un­seen text can leave pupils com­pletely flum­moxed. I can re­mem­ber two classes strug­gling with a past GCSE pa­per for which know­ing the mean­ing of the word “glacier” was vi­tal to un­der­stand­ing the text. When they took a past pa­per in which the text was of “equiv­a­lent” read­ing dif­fi­culty but about a more fa­mil­iar topic, they did much bet­ter.

Small change, big im­pact

Why are there such dif­fer­ences in suc­cess rates be­tween ques­tions that are sup­posed to test the same thing? Why do small sur­face changes have a big im­pact?

It is likely to be be­cause we think and rea­son in con­crete ways. All of us, not just young chil­dren, find it hard to trans­fer knowl­edge and skills to new prob­lems. Even if the “deep struc­ture” of a prob­lem stays the same, if enough of the sur­face fea­tures change then we will find that prob­lem more or less chal­leng­ing.

In a low-stakes exam, this is­sue is not quite so sig­nif­i­cant, be­cause you can keep the ques­tions ex­actly the same from one sit­ting to the next.

The pupils tak­ing the exam in 2014 take the same exam as the pupils in 2013, and so their scores can be com­pared di­rectly. You can, there­fore, set grade bound­aries that can be con­sis­tent across time.

With low-stakes tests, you can also trial dif­fer­ent ver­sions of tests with the same pupils, to see just how com­pa­ra­ble they are. A group of pupils might score 55 per cent on one ver­sion, but 60 per cent or 65 per cent on an­other ver­sion.

But this ap­proach clearly won’t work for high-stakes ex­ams, which have to be changed from one year to the next. Ex­am­in­ers who cre­ate high-stakes tests, such as GCSES and A lev­els, are caught in some­thing of a bind.

They have to change the ques­tions from year to year, but do­ing so changes the dif­fi­culty in un­pre­dictable ways. At its sim­plest, that is why grade bound­aries have to change: be­cause the ques­tions change.

So how do we know that a grade 4 this year will be com­pa­ra­ble with a grade 4 next year? Or, in­deed, that a “pass” from last year will be com­pa­ra­ble with a “pass” this year?

The big chal­lenge for ex­am­in­ers is to come up with an ac­cu­rate and pre­cise way of mea­sur­ing ex­actly how dif­fi­cult dif­fer­ent pa­pers are rel­a­tive to each other, so they can cre­ate grade bound­aries that rep­re­sent a con­sis­tent stan­dard from year to year. This is a peren­nial chal­lenge, but when exam spec­i­fi­ca­tions change as well, as they have done this year, that adds more com­plex­ity.

Statis­tics may hold the an­swers. Al­though GCSE ex­am­in­ers can’t trial tests in ad­vance and see how pupils do on them, they can use statis­tics in other ways.

They can wait un­til pupils have taken the exam and see how they per­form and then ad­just ac­cord­ingly. They can also use prior at­tain­ment in­for­ma­tion about the pupils tak­ing the exam so they can com­pare how sim­i­lar pupils from dif­fer­ent year groups per­form on dif­fer­ent tests.

Or they could set stan­dards just by try­ing to judge how hard the exam is, with no help from statis­tics. The his­tory of that, though, does not make such a move ap­peal­ing.

New Zealand tried such a sys­tem in the early part of the cen­tury and found that the num­ber of pupils achiev­ing “ex­cel­lent” in a maths exam var­ied from 5,000 one year to 70 the next.

For this first year of the new GCSES, exam reg­u­la­tor Ofqual has come up with a very spe­cific use of statis­tics that it will em­ploy to en­sure com­pa­ra­bil­ity be­tween the last year group sit­ting the old ex­ams and the first year group sit­ting the new ex­ams. The re­sults of th­ese two year groups will be sta­tis­ti­cally linked at key grad­ing points.

Broadly, the pro­por­tion of pupils get­ting a 4 and above on the new GCSES will match the pro­por­tion who got a C and above on the old ones. Sim­i­larly, the pro­por­tion get­ting a 7 and above will match the pro­por­tion who got an A and above.

The sta­tis­ti­cal link is based on the prior at­tain­ment of the co­hort at key stage 2. So if this year’s co­hort has sim­i­lar prior at­tain­ment to that of last year’s co­hort then about 70 per cent of 16-year-olds will get a 4 or above in English lan­guage and maths. About 16 per cent will get a 7 or above in English lan­guage, and 20 per cent will get a 7 or above in maths.

If the prior at­tain­ment of the co­horts is not the same, then the sta­tis­ti­cal link will still re­main. But the head­line pass rate might, there­fore, rise or fall depend­ing on the change in the pro­file of the co­horts.

So, de­spite all the talk of un­cer­tainty, we do ac­tu­ally know some­thing about how the new grades will work this year. And, ac­tu­ally, we can pre­dict with some con­fi­dence some­thing else, too: if the exam is harder but grades have been set us­ing a link to last year, then it is pos­si­ble that quite low raw marks could lead to quite good grades.

Dumb­ing down?

If this hap­pens, it is en­tirely likely that some news­pa­pers will leap on this as ev­i­dence of “dumb­ing down”. It won’t be. Be­cause it is so hard to know in ad­vance the pre­cise dif­fi­culty of a ques­tion, we can­not rely on the num­ber of raw marks needed to pass as a sign that a test is easy or hard. On a very hard test, a low mark may be very im­pres­sive. On a very easy test, a high mark may not be nearly as im­pres­sive.

An­other use­ful way that statis­tics can con­trib­ute to stan­dard-set­ting is with ref­er­ence tests. As we’ve seen, high-stakes ex­am­in­ers can’t trial ques­tions to find out their dif­fi­culty. But they can look at how pupils per­form on low-stakes ref­er­ence tests where the ques­tions stay ex­actly the same. If, over time, pupils with the same prior at­tain­ment start to do bet­ter on such ques­tions, it’s ev­i­dence that pupils re­ally are learn­ing more at school – and that the pro­por­tion of pupils re­ceiv­ing good grades should in­crease.

Eng­land’s first na­tional ref­er­ence test was held in March this year.

So why do we need all th­ese statis­tics, and why can’t we rely on hu­man judge­ment? Read­ers who are fa­mil­iar with the work of Daniel Kah­ne­man and other be­havioural psy­chol­o­gists will be fa­mil­iar with the an­swers to those ques­tions: at­tempt­ing to set con­sis­tent exam stan­dards us­ing hu­man judge­ment is fiendishly hard, and an ap­proach that uses statis­tics is more re­li­able. This is be­cause, as Kah­ne­man and other re­searchers have found, hu­man judge­ment is prone to all kinds of bi­ases and in­con­sis­ten­cies.

This is why, in­stead of be­ing a rea­son to be­rate the exam boards, the lack of grade bound­aries for the new GCSES should ac­tu­ally be the cat­a­lyst for a much-needed dis­cus­sion about where the do­mains of hu­man judge­ment and statis­tics are best matched in ed­u­ca­tion.

We as­sume that hu­man judge­ment will in­evitably be su­pe­rior to an al­go­rithm or statis­tics. This was cer­tainly the feel­ing when Ofqual held its con­sul­ta­tion in 2014 on how grades should be set. The great ma­jor­ity of the award­ing or­gan­i­sa­tions and sub­ject as­so­ci­a­tions that re­sponded to the con­sul­ta­tion rec­om­mended an ap­proach that used statis­tics. But the ma­jor­ity of schools and teach­ers that re­sponded pre­ferred an ap­proach based on judge­ment.

The dis­con­nect be­tween teach­ers and as­sess­ment ex­perts here is not help­ful for any­one. The risk with the cur­rent changes is that they will lead to fur­ther mis­un­der­stand­ing and con­fu­sion. The op­por­tu­nity is that they will lead to schools and as­sess­ment or­gan­i­sa­tions seek­ing to bridge this gap with bet­ter train­ing and di­a­logue. And in do­ing so, there is po­ten­tial for both groups to dis­cover more not just about how we mea­sure learn­ing, but about how we learn, too. Daisy Christodoulou is di­rec­tor of ed­u­ca­tion at No More Mark­ing and the au­thor of Mak­ing Good Progress? and Seven Myths about Ed­u­ca­tion. She tweets @daisy­christo

Newspapers in English

Newspapers from UK

© PressReader. All rights reserved.