Читать книгу The Concise Encyclopedia of Applied Linguistics - Carol A. Chapelle - Страница 169

Scoring Features

Оглавление

Two important considerations in scoring a writing assessment are (a) designing or selecting a rating scale or scoring rubric and (b) selecting and training people—or, increasingly, machines—to score the written responses. Scoring rubrics can generally be divided into two types: holistic, where raters give a single score based on their overall impression of the writing, and analytic, in which raters give separate scores for different aspects of the writing, such as content, organization, and use of language. A well‐known example of a holistic writing scale is the scale used for the TOEFL iBT® writing test (Educational Testing Service, 2004).

While both scale types have advantages and disadvantages, a holistic scale is generally preferred in situations where a large number of tests need to be scored in a short time, such as in placement testing. On the other hand, for classroom purposes, an analytic scale can provide more useful information to students. Thorough discussions of different types of rating scales can be found in Weigle (2002, chap. 5) and Shaw and Weir (2007, chap. 5).

In analytic scales, there is no consensus about what aspects of writing should be scored. Most analytic scales have at least one subscale for content/ideas, one for organization or rhetorical features, and one or more for aspects of language use. For example, the IELTS has scales for grammatical range and accuracy, lexical range and accuracy, arrangement of ideas, and communicative quality (Shaw & Falvey, 2008). The scale devised by Jacobs, Zinkgraf, Wormuth, Hartfiel, and Hughey (1981), one of the first well‐publicized analytic scales for second language writing, includes the categories of content, organization, vocabulary, language use, and mechanics. The rating scale for the Diagnostic English Language Needs Assessment (DELNA) used to identify English language needs of students at the University of Auckland includes three main categories: fluency, content, and form, each with three subcategories (Knoch, 2009).

Selecting, training, and monitoring raters is a central aspect of writing assessment. Useful procedures for rater training and monitoring can be found in White (1994), Weigle (2002), and Shaw and Weir (2007). Recent research has focused on the effects of rater background and training on scores; see Lumley (2005), Barkaoui (2007), and Eckes (2008) for summaries. This research suggests that training can mitigate but not eliminate differences in rater severity and consistency due to background variables. For this reason, many programs have begun using test analysis tools such as multifaceted Rasch measurement (MFRM) to adjust scores for differences between raters (see McNamara, 1996, for an introduction to MFRM, and Schaefer, 2008, for a review of studies in writing assessment that have used this approach).

Another important development with regards to scoring is the use of automated essay scoring (AES) systems such as e‐rater®, developed by Educational Testing Service (Attali & Burstein, 2006) and IntelliMetric™ and MY Access!® developed by Vantage Learning Systems (Elliott, 2003), in part to contain the costs and time involved in scoring large‐scale writing assessments. Research demonstrates that automated systems are at least as reliable in scoring standard essay tests as humans (see Shermis & Burstein, 2003; Dikli, 2006; and Shermis, 2014, for overviews of automated essay scoring). However, the use of AES systems is controversial; many writing instructors, in particular, are opposed to any machine scoring of writing, while others highlight the speed and reliability of AES systems as advantages. A recent review of the arguments in this area can be found in Ockey (2009).

The Concise Encyclopedia of Applied Linguistics

Подняться наверх