Tuesday, February 10, 2009

The reliability of marking on a psychology degree Christopher Dracup. British Journal of Psychology. London: Nov 1997. Vol. 88 Part 4. pg. 691, 18 pgs

The reliability of marking for the final cohort of students to graduate from the psychology degree scheme in place at the University of Northumbria at Newcastle between 1985 and 1993 was investigated. Inter-marker correlations for some course components were low, but the correlation between students' overall first marks and their overall second marks was .93, a value in keeping with those typically reported for national school examinations. The reliability of a student's overall agreed mark was estimated to be .96 and the standard error of measurement to be about 1 per cent. Further analyses went on to consider the influence of question and option choice on reliability, the representativeness of the cohort studied and the effects of agreeing marks rather than simply averaging first and second marks. Cronbach's alpha was proposed as a means of estimating reliability in the absence of second marking and was used to compare the reliability of first and second markers. The possibility of second marking the work only of those students who were classified as borderline on the basis of their first marks was discussed. The paper concludes with a reminder that reliability does not guarantee validity.

Reliability is a fundamental requirement of any assessment procedure. The greater the reliability of an assessment, the more certain we can be that observed differences between individuals on the assessment are the result of real differences between the individuals on whatever the assessment is measuring rather than the result of random error. Reliability does not guarantee validity. The fact that differences on an assessment do result from real differences between individuals on whatever the assessment is measuring does not guarantee that what it is measuring is what we want it to measure. However, the reliability of a test does set an upper bound on the possible validity of a test. Classical test theory tells us that the correlation between observed scores of individuals on a measurement and their true scores on the variable underlying that measurement is equal to the square root of the reliability coefficient. It follows that the correlation between the observed scores and any criterion variable cannot be greater than this value. Hence the criterion validity of a measurement can not exceed the square root of the reliability coefficient (Gulliksen, 1987, p. 33).

Many factors can contribute to the unreliability of an assessment: the particular sample of questions asked; the timing of the assessment, etc. One contributor, of particular concern to those interested in measuring educational attainment, is marker unreliability. If a marker is inconsistent in the way in which he or she allocates marks to examination answers, then some of the observed differences in the scores of those sitting the examination will not be due to real differences in the quality of their answers, but to the marker's inconsistencies. These issues become particularly important at grade boundaries where a small change in an examinee's score could lead to the award of a lower or higher grade (see Cresswell, 1986a, 1988). If such changes are likely to occur as a result of marker unreliability then we have cause for concern.

The issue of marker reliability has prompted a good deal of research into the marking of national school examinations (e.g. Murphy, 1978, 1979, 1982). Murphy (1982) reported inter-marker correlations as high as 1.00 for O-level mathematics and physics examinations, but as low as .80 for O-level English literature. The median inter-marker correlation of the 24 O- and A-level examinations studied was .93. Rather little research has been carried out by psychologists into the reliability of marking on degree schemes. Two notable exceptions are Laming (1990) and Newstea & Dennis (1994). Laming estimated the reliability of the assessment of the overall performance of two cohorts of students on 'a certain university degree ' from inter-marker reliabilities calculated for each of five pairs of markers (each pair assessed one of the five sections into which the scheme was divided). His analysis used the methods of classical test theory and drew comparisons with the findings of research into the precision of absolute judgments. Newstead & Dennis attacked the issue from a different direction. Rather than studying an actual degree scheme, they asked a number of examiners to assess the answers of six students to a single question: `Is there a language module in the mind?' From the range of the scores awarded to each student's answer, they were able to estimate the standard error of measurement for that question and extrapolate that estimate to the overall performance of students on a degree scheme. The two studies came to rather different conclusions. Laming, whose data relate more closely to an actual scheme, concluded that for one of the two years he considered:


Post a Comment

  © Blogger template Coozie by Ourblogtemplates.com 2008

Back to TOP