Читать книгу Making Classroom Assessments Reliable and Valid - Robert J Marzano - Страница 7
Оглавлениеchapter 1
Discussing the Classroom Assessment Paradigm for Validity
Validity is certainly the first order of business when researchers or educators design CAs. The concept of validity has evolved over the years into a multifaceted construct. As mentioned previously, the initial conception of a test’s validity was that it measures what it purports to measure. As Henry E. Garrett (1937) notes, “the fidelity with which [a test] measures what it purports to measure” (p. 324) is the hallmark of its validity. By the 1950s, though, important distinctions emerged about the nature and function of validity. Samuel Messick (1993) explains that since the early 1950s, validity has been thought of as involving three major types: (1) criterion-related validity, (2) construct validity, and (3) content validity.
While the three types of validity have unique qualities, these distinctions are made more complex by virtue of the fact that one can examine validity from two perspectives. John D. Hathcoat (2013) explains that these perspectives are (1) the instrumental perspective and (2) the argument-based perspective. Validity in general—and the three different types in particular—look quite different depending on the perspective. This is a central theme of this chapter, and I make a case for the argument-based perspective as superior, particularly as it relates to CAs. The chapter also covers the following topics.
■ Standards as the basis of CA validity
■ Dimensionality
■ Measurement topics and proficiency scales
■ The rise of learning progressions
■ The structure of proficiency scales
■ The school’s role in criterion-related validity
■ The nature of parallel assessments
■ The measurement process
I begin by discussing the instrumental perspective and its treatment of the three types of validity.
The Instrumental Perspective
The instrumental perspective focuses on the test itself. According to Hathcoat (2013), this has been the traditional perspective in measurement theory: a specific test is deemed valid to one degree or another. All three types of validity, then, are considered aspects of a specific test that has been or is being developed within the instrumental perspective. A test possesses certain degrees of the three types of validity.
For quite some time, measurement experts have warned that the instrumental perspective invites misinterpretations of assessments. For example, in his article “Measurement 101: Some Fundamentals Revisited,” Frisbie (2005) provides concrete examples of the dangers of a literal adherence to an instrumental perspective. About validity, he notes, “Validity is not about instruments themselves, but it is about score interpretations and uses” (p. 22). In effect, Frisbie notes that it is technically inaccurate to refer to the validity of a particular test. Instead, discussion should focus on the valid use or interpretation of the scores from a particular test. To illustrate the lack of adherence to this principle, he offers examples of inaccurate statements about validity from published tests:
1. “You can help ensure that the test will be valid and equitable for all students.” (From an examiner’s manual for a statewide assessment program, 2005)
2. “Evidence of test validity … should be made publicly available.” (From a major publication of a prominent testing organization, 2002)
3. “In the assessment realm, this is referred to as the validity of the test.” (From an introductory assessment textbook, 2005)
4. “[Test name] has proven itself in use for more than 50 years as a … valid test.” (From the website of a prominent test publisher, 2005)
5. “Such efforts represent the cornerstone of test validity.” (From the technical manual of a prominent achievement test, 2003). (Frisbie, 2005, p. 22)
Challenges like that presented by Frisbie notwithstanding, the instrumental perspective still dominates.
Criterion-related validity, construct validity, and content validity contain certain requirements if a test is deemed valid from the instrumental perspective. To establish criterion-related validity for an assessment from the instrumental perspective, a researcher typically computes a correlation coefficient between the newly developed test and some other assessment considered to already be a valid measure of the topic. This second assessment is referred to as the criterion measure; hence the term criterion-related validity. A test is considered valid for any criterion it predicts accurately (Guilford, 1946).
The major problem with criterion-related validity is that it is difficult in some cases to identify an appropriate criterion measure. Citing the work of Roderick M. Chisholm (1973), Hathcoat (2013) exemplifies the criterion problem using the ability to determine the quality of apples:
If we wish to identify apple quality then we need a criterion to distinguish “good” apples from “bad” apples. We may choose to sort apples into different piles based upon their color, though any criterion is adequate for this example. The problem arises whenever we ask whether our criterion worked in that color actually separated good apples from bad apples. How can we investigate our criterion without already knowing something about which apples are good and bad? (pp. 2–3)
In effect, identifying an appropriate criterion measure renders criterion-related validity very difficult for test designers in general and for classroom teachers in particular.
Construct validity became prominent about halfway through the 1900s. According to Hathcoat (2013), a seminal article in 1955 by Lee J. Cronbach and Paul E. Meehl led to a focus on construct validity. Hathcoat (2013) notes that “Cronbach and Meehl were concerned about situations wherein a target domain and/or a relevant criterion remained ambiguous” (p. 3).
Cronbach and Meehl (1955) were saying that construct validity must be established for any type of content for which it is difficult to find a criterion measure (as cited in Hathcoat, 2013). For example, where it is rather easy to find a criterion measure for content like fifth-grade geometry, it is quite difficult to find criterion measures for content like students’ abilities to apply knowledge in unique situations or students’ abilities to make good decisions. Any instrument designed to measure these topics must establish construct validity evidence of what such an ability entails.
In the middle of the 20th century, around the same time Cronbach and Meehl (1955) established the need for construct validity, statistical procedures became readily available that allowed psychometricians to induce the nature of an otherwise ambiguous construct. One such statistical procedure is factor analysis, which mathematically provides evidence that specific items on a test measure the same construct. (For a technical discussion of factor analysis, see Kline, 1994.) This type of analysis is also beyond the resources of the typical classroom teacher. In fact, from the perspective of the classroom teacher, construct validity is probably more a function of the standard that he or she is trying to assess than the assessment he or she is designing. For example, assume a teacher is trying to design an assessment for the following standard: “Students will be able to work effectively in cooperative teams.” Construct validity would address the extent to which this standard represents a definable set of knowledge and skill—something that could actually be taught and measured.
For criterion-related validity and construct validity, the classroom teacher has few, if any, resources to address them. However, the classroom teacher can address content validity, which basically reaffirms the early definition of validity—the test measures what it is purported to measure. For the classroom teacher, this simply involves ensuring that the CA addresses the content in the standard that is the focus of instruction and assessment.
In summary, from the instrumental perspective, the classroom teacher has limited or no control over two of the three types of validity associated with CAs he or she is designing. However, from the argument-based perspective, the teacher has some control over all three types of validity.
The Argument-Based Perspective
The argument-based perspective of validity is relatively new. Although it can be traced back to work in the 1970s and 1980s around the importance of test interpretation articulated by Messick (1975, 1993), the argument-based approach became popular because of a series of works by Michael T. Kane (1992, 2001, 2009). At its core, argument-based validity involves an interpretive argument that “lays out the network of inferences leading from the test scores to the conclusions to be drawn and any decisions to be based on these conclusions” (Kane, 2001, p. 329).
From the instrumental perspective, it is the assessment itself that possesses a specific type of validity (criterion, construct, or content). In contrast, from the argument-based perspective, validity is a function of how the data generated from the assessment are used to craft an argument regarding a particular student’s knowledge or skill. This type of validity applies nicely to the classroom teacher.
From the argument-based perspective, then, criterion-related validity for a CA is determined by a teacher’s ability to use data from the assessment to predict students’ performance on interim assessments and end-of-course assessments. If students do well on the CAs for a particular topic, they should also do well on the more formal assessments on that topic designed outside of the classroom.
Construct validity for CAs is determined by the extent to which a teacher can use data from these assessments to identify specific knowledge and skills that should be directly taught. If a teacher can translate scores on the CAs into specific types of instruction for specific students on specific content, then the information generated from the CAs is judged to have construct validity.
From the argument-based perspective, content validity for CAs is determined by a teacher’s ability to use the information generated from CAs as evidence regarding students’ current knowledge and skill on a specific topic. If the teacher can use the scores on the CAs to determine what content students know and what content they don’t know on a specific progression of knowledge, then the information generated from the CAs is judged to have content validity.
The distinction between the instrumental and argument-based perspectives is critical to establishing the validity of CAs. Table 1.1 summarizes these differences.
Table 1.1: CA Validity From the Instrumental and Argument-Based Perspectives
Validity Type | Instrumental Perspective | Argument-Based Perspective |
Criterion-Related Validity | Scores on a specific CA are correlated highly with scores on some external assessment of the content already established as valid. | The information provided by a set of CAs can be interpretable in terms of how well students might perform on interim and end-of-year assessments. |
Construct Validity | Based on statistical analysis, the items on a particular CA are highly correlated for a particular topic. | The information provided by a set of CAs can be interpretable in terms of specific knowledge or skill that can be directly taught. |
Content Validity | The scores on a specific CA clearly measure specific content. | The information provided by a set of CAs can be interpreted in terms of students’ status on an explicit progression of knowledge. |
The argument-based perspective is perfectly suited for classroom teachers, and classroom teachers are the perfect individuals to generate Kane’s (1992, 2001, 2009) network of inferences leading to conclusions and decisions from the scores generated from CAs. To do this effectively, though, teachers must utilize standards as the basis for designing tests.
Standards as the Basis of CA Validity
As the discussion in the introduction illustrates, CAs have an advantage over traditional assessments in that they typically have a narrow focus. Also, within K–12 education, the topics on which CAs should focus have been articulated in content standards. This would seem to make the various types of validity relatively easy for classroom teachers, and it does so if standards are used wisely. Unfortunately, state standards usually require a great deal of interpretation and adaptation to be used effectively in guiding the development of CAs. Their interpretations make all the difference in the world in terms of the utility of standards. As Schneider et al. (2013) note, the way educators interpret state standards plays a major role in assessment development.
Now we consider the standards movement, as well as the problem with standards.
The Standards Movement
The K–12 standards movement in the United States has a long, intriguing history. (For a detailed discussion, see Marzano & Haystead, 2008; Marzano et al., 2013). Arguably, the standards movement started in 1989 at the first education summit when the National Education Goals Panel (1991, 1993) set national goals for the year 2000. Millions of dollars were made available to develop sample national standards in all the major subject areas. States took these national-level documents and created state-level versions. Probably the most famous attempts to influence state standards at the national level came in the form of the CCSS and the Next Generation Science Standards (NGSS). States have continued to adapt national-level documents such as these to meet the needs and values of their constituents.
The Problem With Standards
While it might appear that standards help teachers design valid CAs (at least in terms of content validity), this is not necessarily the case. In fact, in many situations, state standards make validity of all types problematic to achieve. To illustrate, consider the following Common Core State Standard for eighth-grade reading: “Determine the meaning of words and phrases as they are used in a text, including figurative, connotative, and technical meanings; analyze the impact of specific word choices on meaning and tone, including analogies or allusions to other texts” (RI.8.4; NGA & CCSSO, 2010a, p. 39).
While this standard provides some direction for assessment development, it contains a great deal of content. Specifically, this standard includes the following information and skills.
■ Students will understand what figurative, connotative, and technical meanings are.
■ Students will be able to identify specific word choices an author made.
■ Students will be able to analyze the impact of specific word choices.
■ Students will understand what tone is.
■ Students will understand what an analogy is.
■ Students will understand what an allusion is.
■ Students will be able to analyze analogies and allusions.
The volume of discrete pieces of content in this one standard creates an obvious problem of too much content. As mentioned previously, in their analysis of the CCSS, Marzano et al. (2013) identify seventy-three standard statements for eighth-grade English language arts, as articulated in the CCSS. If one makes a conservative assumption that each of those statements contains about seven component skills like those listed previously, this would mean that an eighth-grade teacher is expected to assess 365 specific pieces of content for ELA alone in a 180-day school year. According to Marzano and colleagues (2013), the same pattern can be observed in many state standards documents.
Given the fact that it is virtually impossible to teach all the content embedded in national or state standards for a given subject area, a teacher must unpack standards to identify what will be assessed within a system of CAs. Ideally, the district or school does this unpacking. Tammy Heflebower, Jan K. Hoegh, and Phil Warrick (2014) explain how a school or district can lead a systematic effort to identify between fifteen and twenty-five essential topics that should be the focus of CAs. Briefly, the process involves prioritizing standards and the elements within those standards that are absolutely essential to assess. When identifying and articulating essential topics in standards, schools and districts must be cognizant of their dimensionality.
Dimensionality
In general, there should be one topic addressed in a CA. Parkes (2013) explains that this is a foundational concept in measurement theory: “any single score from a measurement is to represent a single quality” (p. 107). This is technically referred to as making a CA unidimensional (technically stated, a unidimensional test “measures only one dimension or only one latent trait” [AERA et al., 2014, p. 224]). The notion that unidimensionality is foundational to test theory can be traced back to the middle of the 1900s. For example, in a foundational article on measurement theory in 1959, Frederic M. Lord notes that a test is a “collection of tasks; the examinee’s performance on these tasks is taken as an index of [a student’s] standing along some psychological dimension” (p. 473). Over forty years later, David Thissen and Howard Wainer (2001) explain:
Before the responses to any set of items are combined into a single score that is taken to be, in some sense, representative of the responses to all of the items, we must ascertain the extent to which the items “measure the same thing.” (p. 10)
Without unidimensionality, a score on a test is difficult to interpret. For example, assume that two students receive a score of 70 on the same test, but that test measures two dimensions. This is depicted in figure 1.1.
Note: Black = patterns; gray = data analysis. Total possible points for black (patterns) = sixty; total possible points for gray (data analysis) = forty.
Figure 1.1: Two students’ scores on a two-dimensional test.