Читать книгу The Concise Encyclopedia of Applied Linguistics - Carol A. Chapelle - Страница 148
Raters
ОглавлениеIn order for raters to achieve a common understanding and application of a scale, training is necessary. As the standard for speaking assessment procedures involving high‐stakes decisions is an inter‐rater reliability coefficient of 0.80, some variability among raters is expected and tolerated. Under optimal conditions, the sources of error that can be associated with the use of a scale are expected to be random rather than systematic.
One type of systematic error results from a rater's tendency to assign either harsh or lenient scores. When a pattern is identified in comparison to other raters in a pool, a rater may be identified as negatively or positively biased. Systematic effects with respect to score assignment have been found in association with rater experience and native language background, and also examinee native language background (Ross, 1979; Brown, 1995; Chalhoub‐Deville, 1995; Chalhoub‐Deville & Wigglesworth, 2005; Winke, Gass, & Myford, 2011; Yan, 2014; Yan, Cheng, & Ginther, 2019). Every effort should be made to identify and remove biased raters, as their presence negatively affects the accuracy, utility, interpretability, and fairness of the scores we report (see Wind & Peterson, 2018).
While these findings underscore the importance of rater training (see Yan & Ginther, 2017), its positive effects may be shortlived (Lumley & McNamara, 1995). Raters drift over time, and so the practice of certifying raters once and for all is problematic. The most effective rater training procedures include calibration and regular training sessions.
A more frequent concern raised by studies of rater variability—one that can only be partially addressed by rater training—is whose standard is the most appropriate to apply when developing assessments and scales. Ginther and McIntosh (2018) summarize:
Work on World Englishes (WE)… has challenged the notion of an ideal native‐speaker, long promoted in theoretical and applied linguistics, and helped to legitimize varieties other than standard British or American English. Meanwhile, English as a lingua franca (ELF) scholars like Seidlhofer (2001) and Jenkins (2006) have advocated for a more flexible contact language that could serve as a communicative resource for so‐called “non‐native” and “native” speakers alike. Both traditions have criticized language tests, especially large‐scale ones like TOEFL, that continue to use native English speaker (NES) norms as the basis for items and assessment, despite the fact that non‐native English speakers (NNES) are now the majority (Davidson, 2006). (p. 860)
Dimova (2017) discusses the implications for the inclusion of a broader variety of speakers in relation to both ELF and WE. Ockey, Papageorgiou, and French (2015) and Ockey and French (2016) discuss performance effects on listeners and speakers. Harding (2017) argues, in a review of validity concerns for speaking assessments, that the time has come for listener variables to be considered in construct definitions.