Читать книгу The Concise Encyclopedia of Applied Linguistics - Carol A. Chapelle - Страница 127
Challenges in Testing L2 Pragmatics
ОглавлениеFundamentally, tests of L2 pragmatics have the same requirements and pose the same development challenges as other language tests. They must be standardized to allow comparisons between test takers, they must be reliable to ensure precise measurement, they must be practical so that they do not overtax resources, and, above all, they must allow defensible inferences to be drawn from scores that can inform real‐world decisions (Messick, 1989; Kane, 2006). Some of these requirements are particularly difficult to meet for tests of pragmatics, which probably accounts for their very limited uptake.
Most importantly, practicality is a serious challenge for testing pragmatics. While some instruments in the speech act tradition were designed to be administered online and to allow automatic scoring (Roever, 2005; Itomitsu, 2009; Roever et al., 2014), tests under the interactional competence construct by their very nature include interaction and therefore currently require time and resource‐intensive involvement of a live interlocutor and scoring by raters. Work is underway to assess interaction through the use of intelligent agents backed by automatic speech recognition engines (Suendermann‐Oeft et al., 2017; Litman, Strik, & Lim, 2018) but this work is still in its infancy and requires nothing short of modeling language users' commonsense members' knowledge (Garfinkel, 1967), which is a daunting prospect. While other aspects of pragmatics, especially some pragmalinguistic abilities, are more easily measurable, it would be a case of serious construct underrepresentation to only include them and then argue that “pragmatics” as a whole is being measured. However, it would be much more feasible for tests that already include face‐to‐face speaking components, such as the International English Language Testing System (IELTS) or the American Council on the Teaching of Foreign Languages (ACTFL) oral proficiency interviews (OPI), to alter their tasks, procedures, and rating scales to measure interactional aspects of pragmatics.
The issue of practicality is further complicated by different types of interactional activities making different abilities visible. For example, two test takers discussing a set topic, as in Galaczi's (2014) study, will by necessity demonstrate their management of topical talk and allow conclusions as to relevant abilities, such as extending interlocutor contributions and managing topic changes. However, these abilities are much less transparent in role plays such as Youn's (2013, 2015), which are more suitable for making test takers' ability to do preference organization visible. This raises the specter that a test would need to involve several different interactional activities, compounding the practicality problem, though research will need to show whether conducting separate measurements of different interactional abilities is necessary.
However, even if the practicality issue can be resolved, measuring of interactional aspects of pragmatic competence is not an easy endeavor. Two related challenges are the co‐constructed nature of interaction (Jacoby & Ochs, 1995) and the standardization of the test. While tests need to be standardized to allow comparison between test taker performances, this is chronically difficult for spoken interactions, which have their own dynamic (Heritage, 1984; Kasper, 2006) and can unfold in unpredictable ways. Youn (2013, 2015) was the only one trying to address this problem by providing both the interlocutor and the test taker with an outline of the conversation. This makes the interaction somewhat more predictable and allows better comparison between different test takers, but it arguably distorts the construct since real‐world interactions are not usually scripted.
A significant amount of research is still necessary to understand how generalizable specific instances of role play performances in testing situations are across all possible performances, and to what extent they can be extrapolated to real‐world performances (Kane, 2006; Chapelle, Enright, & Jamieson, 2010). Findings like Ikeda's (2017) about the large degree of overlap between dialogic role play performances and monologue tasks are promising, and so is Okada's (2010) argument that abilities elicited through role plays are also relevant in real‐world interaction (though see Ewald, 2012, and Stokoe, 2013, for differences between role plays and real‐world talk). Still, comprehensive measurement of a complex construct such as interactional competence is one of the big challenges facing testing of L2 pragmatics.
From a test design perspective, it is also important to know what makes items difficult so they can be targeted at test takers at different ability levels. This is a challenge for many pragmatics tests, which tend to not have sufficient numbers of difficult items, and it is true for tests in the speech act tradition and assessing interactional competence. For example, Roever et al.'s (2014) battery was overall easy for test takers, and so were Youn's (2013, 2015) and Ikeda's (2017) instruments. We know relatively little about what makes items or tasks difficult, though Roever (2004) put forward some suggestions for pragmalinguistically oriented tests. For measures of interactional competence, it might be worth trying interactional tasks that require orientation to conflicting social norms, for example, managing status incongruent talk as a student interacting with a professor under institutional expectations of initiative (Bardovi‐Harlig & Hartford, 1993), or in a workplace situation persuading one's boss to remove his son from one's project team (Ross, 2017). However, much more research is needed here as well.
A challenge specific to tests using sociopragmatic judgment is establishing a baseline. Put simply, testers need a reliable way to determine correct and incorrect test taker responses. The usual way to do so is to use a native‐speaker standard and this has been shown to work well for binary judgments of correct/incorrect, appropriate/inappropriate, and so on (Bardovi‐Harlig & Dörnyei, 1998; Schauer, 2006). However, native‐speaker benchmarking is much more problematic when it comes to preference judgments. For example, in Matsumura's (2001) benchmarking of his multiple‐choice items on the appropriateness of advice, there was not a single item where 70% of a native‐speaker benchmarking group (N = 71) agreed on the correct response, and only 2 items (out of a pretest and posttest total of 24) where more than 60% of native speakers agreed. On 10 items, the most popular response option was chosen by less than half the native‐speaker group. Roever et al. (2014) found stronger NS agreements with all their items showing at least 50% agreement among NS, and they assigned 2 points for test taker responses that were chosen by the largest NS group and 1 point for responses chosen by the next 2 largest groups, provided they were at least 10% of the NS sample. This scoring approach tried to take into account NS preference but the point distribution is essentially a tester decision with little empirical basis.
Finally, tests of sociopragmatics have often been designed contrastively for a pair of languages, for example, native Japanese speakers learning English (Hudson et al., 1995), native English speakers learning Japanese (Yamashita, 1996), native English speakers learning Korean (Ahn, 2005), or native Chinese speakers learning English (Liu, 2006). This necessarily lowers the practicality of tests, as well as the likelihood that they will eventually become part of large‐scale international test batteries (like TOEFL or IELTS). Roever (2005) did not limit his test taker population to a specific L1, and used differential item functioning to show that there were some L1 effects but that they were generally minor (Roever, 2007), indicating that limiting pragmatics tests to a specific population is not a necessity.