Читать книгу Making Classroom Assessments Reliable and Valid - Robert J Marzano - Страница 6
Оглавлениеintroduction
The Role of Classroom Assessment
Classroom assessment has been largely ignored in the research and practice of assessment theory. This is not to say that it has been inconsequential to classroom practice. To the contrary, the topic of classroom assessment has become more and more popular in the practitioner literature. For example, the book Classroom Assessment: What Teachers Need to Know is in its eighth edition (Popham, 2017). Many other publishers continue to release books on the topic. This trend notwithstanding, technical literature in the 20th century has rarely mentioned classroom assessment. As James McMillan (2013b) notes:
Throughout most of the 20th century, the research on assessment in education focused on the role of standardized testing …. It was clear that the professional educational measurement community was concerned with the role of standardized testing, both from a large-scale assessment perspective as well as with how teachers used test data for instruction in their own classrooms. (p. 4)
As evidence, McMillan (2013b) notes that an entire issue of the Journal of Educational Measurement that purported to focus on state-of-the-art testing and instruction did not address teacher-made tests. Additionally, the first three editions of Educational Measurement (Lindquist, 1951; Linn, 1993; Thorndike, 1971)—which are designed to summarize the state of the art in measurement research, theory, and practice—paid little if any attention to classroom assessment. Finally, both editions of The Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999, 2014)—which, as their titles indicate, are designed to set standards for testing in both psychology and education—made little explicit reference to classroom assessment. It wasn’t until the fourth edition in the first decade of the 21st century (Brennan, 2006) that a chapter was included addressing classroom assessment.
Most recently, the SAGE Handbook of Research on Classroom Assessment made a stand for the rightful place of classroom assessment: “This book is based on a single assertion: Classroom assessment (CA) is the most powerful type of measurement in education that influences student learning” (McMillan, 2013a, p. xxiii). Throughout this text, I take the same perspective. I also use the convention of referring to classroom assessment as CA. Since the publication of the SAGE Handbook, this abbreviation is now the norm in many technical discussions of classroom assessment theory. My intent is for this book to be both technical and practical.
What, then, is the place of CAs in the current K–12 system of assessment, and what is their future? This resource attempts to lay out a future for CA that will render it the primary source of evidence regarding student learning; this would stand in stark contrast to the current situation in which formal measurements of students are left to interim assessments, end-of-course assessments, and state assessments. In this introduction, I will discuss several topics with regard to CAs.
■ The curious history of large-scale assessments
■ The place of classroom assessment
■ Reliability and validity at the heart of the matter
■ The need for new paradigms
■ The large-scale assessment paradigm for reliability
■ The new CA paradigm for reliability
■ The large-scale assessment paradigm for validity
■ The new CA paradigm for validity
Before delving directly into the future of CA, it is useful to consider the general history of large-scale assessments in U.S. education since it is the foundation of current practices in CA.
The Curious History of Large-Scale Assessments
The present and future of CA are intimately tied to the past and present of large-scale assessments. In 2001, educational measurement expert Robert Linn published “A Century of Standardized Testing: Controversies and Pendulum Swings.” Linn notes that the original purpose of large-scale assessment was comparison and began in the 19th century.
Educators commonly refer to J. M. Rice as the inventor of the comparative large-scale assessment. This assignment is based on his 1895 assessment of the spelling ability of some thirty-three thousand students in grades 4 through 12 for which comparative results were reported (Engelhart & Thomas, 1966). However, assessments that educators administered to several hundred students in seventeen schools in Boston and one school in Roxbury in 1845 predated this comparative large-scale assessment. Because of this, Horace Mann (who initiated the effort) deserves credit as the first to administer large-scale tests. Lorrie A. Shepard (2008) elaborates on the contribution of Horace Mann, noting:
In 1845, Massachusetts State Superintendent of Instruction, Horace Mann, pressured Boston school trustees to adopt written examinations because large increases in enrollments made oral exams unfeasible. Long before IQ tests, these examinations were used to classify pupils … and to put comparative information about how schools were doing in the hands of state-level authority. (p. 25)
Educators designed these early large-scale assessments to help solve perceived problems within the K–12 system. For example, in 1909, Leonard P. Ayres published the book Laggards in Our Schools: A Study of Retardation and Elimination in City School Systems. Despite the book’s lack of sensitivity to labeling large groups of students in unflattering ways, it brought attention to the problems associated with repeated retention of students in grade levels. This helped buttress the goal of reformers who wanted to develop programs that would mitigate failure.
The first half of the 20th century was not a flattering era for large-scale assessments. They focused on natural intelligence, and educators used them to classify examinees. To say the least, this era did not represent the initial or current intent of large-scale assessment. I address this period in more detail shortly.
By the second half of the 20th century, educators began to use large-scale assessments more effectively. Such assessments were a central component of James Bryant Conant’s (1953) vision of schools designed to provide students with guidance as to appropriate career paths and support in realizing related careers.
The use of large-scale assessment increased dramatically in the 1960s. According to Shepard (2008), the modern era of large-scale assessment started in the mid-1960s: “Title I of the Elementary and Secondary Education Act (ESEA) of 1965 launched the development of the field of educational evaluation and the school accountability movement” (p. 26). Shepard (2008) explains that it was the ESEA mandate for data with which to scrutinize the reform efforts that compelled the research community to develop more finely tuned evaluation tools: “The American Educational Research Association began a monograph series in 1967 to disseminate the latest thinking in evaluation theory, and several educational evaluation organizations and journals date from this period” (p. 26).
The National Assessment of Educational Progress (NAEP) began in 1969 and “was part of the same general trend toward large-scale data gathering” (Shepard, 2008, p. 27). However, researchers and policymakers designed NAEP for program evaluation as opposed to individual student performance evaluation.
The need to gather and utilize data about individual students started minimum competency testing in the United States. This spread quickly, and by 1980 “all states had a minimum competency testing program or a state testing program of some kind” (Shepard, 2008, p. 31). But this, too, ran aground because of the amount of time and resources necessary for large-scale competency tests.
The next wave of school reform was the “excellence movement” spawned by the high visibility report A Nation at Risk (National Commission on Excellence in Education, 1983). It cited low standards and a watered-down curriculum as reasons for the lackluster performance of U.S. schools. It also faulted the minimum competency movement, noting that focusing on minimum requirements distracted educators from the more noble and appropriate goal of maximizing students’ competencies.
Fueled by these criticisms, researchers and policymakers focused on the identification of rigorous and challenging standards for all students in the core subject areas. Standards work in mathematics set the tone for the reform:
Leading the way, the National Council of Teachers of Mathematics report on Curriculum and Evaluation Standards for School Mathematics (1989) expanded the purview of elementary school mathematics to include geometry and spatial sense, measurement, statistics and probability, and patterns and relationships, and at the same time emphasized problem solving, communication, mathematical reasoning, and mathematical connections rather than computation and rote activities. (Shepard, 2008, p. 35)
By the early 1990s, virtually every major academic subject area had sample standards for K–12 education.
Shepard (2008) notes that standards-based reform, begun in the 1990s, “is the most enduring of test-based accountability reforms” (p. 37). However, she also cautioned that the version of this reform enacted in No Child Left Behind (NCLB) “contradicts core principles of the standards movement” mostly because the assessments associated with NCLB did not place ample focus on the application and use of knowledge reflected in the standards researchers developed (Shepard, 2008, p. 37). Also, the accountability system that accompanied NCLB focused on rewards and punishments.
The beginning of the new century saw an emphasis on testing that was highly focused on standards. In 2009, the National Governors Association Center for Best Practices (NGA) and the Council of Chief State School Officers (CCSSO) partnered in “a state-led process that [drew] evidence and [led] to development and adoption of a common core of state standards … in English language arts and mathematics for grades K–12” (as cited in Rothman, 2011, p. 62). This effort, referred to as the Common Core State Standards (CCSS), resulted in the establishment of two state consortia that were tasked with designing new assessments aligned to the standards. One consortium was the Partnership for Assessment of Readiness for College and Careers (PARCC); the other was the Smarter Balanced Assessment Consortium (SBAC):
Each consortium planned to offer several different kinds of assessments aligned to the CCSS, including year-end summative assessments, interim or benchmark assessments (used throughout the school year), and resources that teachers could use for formative assessment in the classroom. In addition to being computer-administered, these new assessments would include performance tasks, which require students to demonstrate a skill or procedure or create a product. (Marzano, Yanoski, Hoegh, & Simms, 2013, p. 7)
These efforts are still under way although with less widespread use than in their initiation.
Next, I discuss previous abuses of large-scale assessments that occurred in the first half of the 20th century (Houts, 1977). To illustrate the nature and extent of these abuses, consider the first intelligence test usable for groups that Alfred Binet developed in 1905. It was grounded in the theory that intelligence was not a fixed entity. Rather, educators could remediate low intelligence if they identified it. As Leon J. Kamin (1977) notes in his book on the nature and use of his IQ test, Binet includes a chapter, “The Training of Intelligence,” in which he outlines educational interventions for those who scored low on his test. There was clearly an implied focus on helping low-performing students. It wasn’t until the Americanized version of the Stanford-Binet test (by Lewis M. Terman, 1916) that the concept of IQ solidified as a fixed entity with little or no chance of improvement. Consequently, educators would use the IQ test to identify students with low intelligence so they could monitor and deal with them accordingly. Terman (1916) notes:
In the near future intelligence tests will bring tens of thousands of these high-grade defectives under the surveillance and protection of society. This will ultimately result in curtailing the reproduction of feeble-mindedness and in the elimination of an enormous amount of crime, pauperism, and industrial inefficiency. It is hardly necessary to emphasize that the high-grade cases, of the type now so frequently overlooked, are precisely the ones whose guardianship it is most important for the State to assume. (pp. 6–7)
The perspective that Lewis Terman articulated became widespread in the United States and led to the development of Arthur Otis’s (one of Terman’s students) Army Alpha test. According to Kamin (1977), performance scores for 125,000 draftees were analyzed and published in 1921 by the National Academy of Sciences, titled Memoirs of the National Academy of Sciences: Psychological Examining in the United States Army (Yerkes, 1921). The report contains the chapter “Relation of Intelligence Ratings to Nativity,” which focuses on an analysis of about twelve thousand draftees who reported that they were born outside of the United States. Educators assigned a letter grade from A to E for each of the draftees, and the distribution of these letter grades was analyzed for each country. The report notes:
The range of differences between the countries is a very wide one …. In general, the Scandinavian and English speaking countries stand high in the list, while the Slavic and Latin countries stand low … the countries tend to fall into two groups: Canada, Great Britain, the Scandinavian and Teutonic countries … [as opposed to] the Latin and Slavic countries. (Yerkes, 1921, p. 699)
Clearly, the perspective regarding intelligence has changed dramatically and large-scale assessments have come a long way in their use of scores on tests since the early part of the 20th century. Yet even now, the mere mention of the terms large-scale assessment or standardized assessment prompts criticisms to which assessment experts must respond (see Phelps, 2009).
The Place of Classroom Assessment
An obvious question is, What is the rightful place of CA? Discussions regarding current uses of CA typically emphasize their inherent value and the advantages they provide over large-scale assessments. For example, McMillan (2013b) notes:
It is more than mere measurement or quantification of student performance. CA connects learning targets to effective assessment practices teachers use in their classrooms to monitor and improve student learning. When CA is integrated with and related to learning, motivation, and curriculum it both educates students and improves their learning. (p. 4)
Bruce Randel and Tedra Clark (2013) explain that CAs “play a key role in the classroom instruction and learning” (p. 161). Susan M. Brookhart (2013) explains that CAs can be a strong motivational tool when used appropriately. M. Christina Schneider, Karla L. Egan, and Marc W. Julian (2013) identify CA as one of three components of a comprehensive assessments system. Figure I.1 depicts the relationship among these three systems.
Figure I.1: The three systems of assessment.
As depicted in figure I.1, CAs are the first line of data about students. They provide ongoing evidence about students’ current status on specific topics derived from standards. Additionally, according to figure CAs should be the most frequently used form of assessments.
Next are interim assessments. Schneider and colleagues (2013) describe them as follows: “Interim assessments (sometimes referred to as benchmark assessments) are standardized, periodic assessments of students throughout a school year or subject course” (p. 58).
Year-end assessments are the least frequent type of assessments employed in schools. Schneider and colleagues (2013) describe them in the following way:
States administer year-end assessments to gauge how well schools and districts are performing with respect to the state standards. These tests are broad in scope because test content is cumulative and sampled across the state-level content standards to support inferences regarding how much a student can do in relation to all of the state standards. Simply stated, these are summative tests. The term year-end assessment can be a misnomer because these assessments are sometimes administered toward the end of a school year, usually in March or April and sometimes during the first semester of the school year. (p. 59)
While CAs have a prominent place in discussions about comprehensive assessments, they have continually exhibited weaknesses that limit their use or, at least, the confidence in their interpretation. For example, Cynthia Campbell (2013) notes the “research investigating evaluation practices of classroom teachers has consistently reported concerns about the adequacy of their assessment knowledge and skill” (p. 71). Campbell (2013) lists a variety of concerns about teachers’ design and use of CAs, including the following.
■ Teachers have little or no preparation for designing and using classroom assessments.
■ Teachers’ grading practices are idiosyncratic and erratic.
■ Teachers have erroneous beliefs about effective assessment.
■ Teachers make little use of the variety of assessment practices available.
■ Teachers don’t spend adequate time preparing and vetting classroom assessments.
■ Teachers’ evaluative judgments are generally imprecise.
Clearly, CAs are important, and researchers widely acknowledge their potential role in the overall assessment scheme. But there are many issues that must be addressed before CAs can assume their rightful role in the education process.
Reliability and Validity at the Heart of the Matter
Almost all problems associated with CAs find their ultimate source in the concepts of reliability and validity. Reliability is generally described as the accuracy of a measurement. Validity is generally thought of as the extent to which an assessment measures what it purports to measure.
Reliability and validity are related in a variety of ways (discussed in depth in subsequent chapters). Even on the surface, though, it makes intuitive sense that validity is probably the first order of business when designing an assessment; if a test doesn’t measure what it is supposed to measure, it is of little use. However, even if a test is designed with great attention to its validity, its reliability can render validity a moot point.
An assessment’s validity can be limited or mediated by its reliability (Bonner, 2013; Parkes, 2013). For example, imagine you were trying to develop an instrument that measures weight. This is a pretty straightforward construct, in that weight is defined as the amount of gravitational pull on an object or the force on an object due to gravity. With this clear goal in mind, you create your own version of a scale, but unfortunately, it gives different measurements each time an object is placed on it. You put an object on it, and it indicates that the object weighs one pound. You take it off and put it on again, and it reads one and a half pounds. The third time, it reads three-quarters of a pound, and so on. Even though the measurement device was focused on weight, the score derived from the measurement process is so inaccurate (imprecise or unreliable) that it cannot be a true measure of weight. Hence, your scale cannot produce valid measures of weight even though you designed it for that specific purpose. Its reliability has limited its validity. This is probably the reason that reliability seems to receive the majority of the attention in discussions of CA. If a test is not reliable, its validity is negated.
The Need for New Paradigms
For CAs to take their rightful place in the assessment triad depicted in figure they must be both valid and reliable. This is not a new or shocking idea; reliability and validity for CAs must be thought of differently from how they are with large-scale assessments.
Large-scale assessments are so different from CAs in structure and function that the paradigms for validity and reliability developed for large-scale assessments do not apply well to CAs. There are some who argue that they are so different from large-scale assessments that they should be held to a different standard than large-scale assessments. For example, Jay Parkes (2013) notes, “There have also been those who argue that CAs … have such strong validity that we should tolerate low reliability” (p. 113).
While I believe this is a defensible perspective, in this book, I take the position that we should not simply ignore psychometric concepts related to validity and reliability. Rather, we should hold CAs accountable to high standards relative to both validity and reliability, but educators should reconceptualize the standards and psychometric constructs on which these standards are based in order to fit the unique environment of the classroom. I also believe that technical advances in CA have been hindered because of the unquestioned adherence to the measurement paradigms developed for large-scale assessments.
The Large-Scale Assessment Paradigm for Reliability
Even though validity is the first order of business when designing an assessment, I begin with a discussion of reliability because of the emphasis it receives in the literature on CAs. At its core, reliability refers to the accuracy of a measurement, where accuracy refers to how much or how little error exists in an individual score from an assessment. In practice, though, large-scale assessments represent reliability in terms of scores for groups of students as opposed to individual students. (For ease of discussion, I will use the terms large-scale and traditional as synonyms throughout the text.) As we shall see in chapter 4 (page 83), the conceptual formula for reliability in the large-scale assessment paradigm is based on differences in scores across multiple administrations of a test. Consider table I.1 to illustrate the traditional concept of reliability.
The column Initial Administration reports the scores of ten students for the first administration of a specific test. (For ease of discussion, the scores are listed in rank order.) The next column, Second Administration (A), and the first represent a pattern of scores that indicate relatively high reliability for the test in question.
Table I.1: Three Administrations of the Same Test
To understand this pattern, one must imagine that the second administration happened right after the initial administration, but somehow students forgot how they answered the items the first time. In fact, it’s best to imagine that students forgot they took the test in the first place. Although this is impossible in real life, it is a basic theoretical underpinning of the traditional concept of reliability—the pattern of scores that would occur across students over multiple replications of the same assessment. Lee J. Cronbach and Richard J. Shavelson (2004) explain this unusual assumption in the following way:
If, hypothetically, we could apply the instrument twice and on the second occasion have the person unchanged and without memory of his first experience, then the consistency of the two identical measurements would indicate the uncertainty due to measurement error. (p. 394)
If a test is reliable, one would expect students to get close to the same scores on the second administration of the test as they did on the first. As depicted in Second Administration (A), this is basically the case. Even though only two students received exactly the same score, all scores in the second administration were very close to their counterparts in the first.
If a test is unreliable, however, one would expect students to receive scores on the second administration that are substantially different from those they received on the first. This is depicted in the column Second Administration (B). Notice that students’ scores vary greatly from their first on this hypothetical administration.
Table I.1 demonstrates the general process at a conceptual level of determining reliability from a traditional perspective. If the pattern of variation in scores among students is the same from one administration of a test to another, then the test is deemed reliable. If the pattern of variation changes from administration to administration, the test is not considered reliable. Of course, administrations of the same tests to the same students without students remembering their previous answers don’t occur in real life. Consequently, measurement experts (called psychometricians) have developed formulas that provide reliability estimates from a single administration of a test. I discuss this in chapter 3 (page 59).
Next, we consider the equation for a single score, as well as the reliability coefficient.
The Equation for a Single Score
While the large-scale paradigm considers reliability from the perspective of a pattern of scores for groups of students across multiple test administrations, it is also based on the assumption that scores for individual students contain some amount of error. Error may be due to careless mistakes on the part of students, on the part of those administering and scoring the test, or both. Such an error is referred to as a random measurement error, and that is an anticipated part of any assessment (Frisbie, 1988). Random error can either increase the score a student receives (referred to as the observed score) or decrease the score a student receives. To represent this, the conceptual equation for an individual score within the traditional paradigm is:
Observed score = true score + error score
The true score is the score a test taker would receive if there were no random errors from the test or the test taker. In effect, the equation implies that when anyone receives a score on any type of assessment, there is no guarantee that the score the test taker receives (for example, the observed score) is the true score. The true score might be slightly or greatly higher or lower than the observed score.
The Reliability Coefficient
The reliability of an assessment from the traditional perspective is commonly expressed as an index of reliability—also referred to as the reliability coefficient (Kelley, 1942). Such a coefficient ranges from a 0.00 to a 1.00, with 1.00 meaning there is no random error operating in an assessment, and 0.00 indicating that the test scores completely comprise random error. While there are no published tests with a reliability of 1.00 (simply because it’s impossible to construct such a test), there are also none published with a reliability even remotely close to 0.00. Indeed, David A. Frisbie (1988) notes that most published tests have reliabilities of about 0.90, but most teacher-designed tests have much lower reliabilities of about 0.50. Others have reported higher reliabilities for teacher-designed assessments (for example, Kinyua & Okunya, 2014). Leonard S. Feldt and Robert L. Brennan (1993) add a cautionary note to the practice of judging an assessment from its reliability coefficient:
Although all such standards are arbitrary, most users believe, with considerable support from textbook authors, that instruments with coefficients lower than 0.70 are not well suited to individual student evaluations. Although one may quarrel with any standard of this sort, many knowledgeable test users adjust their level of confidence in measurement data as a hazy function of the magnitude of the reliability coefficient. (p. 106)
As discussed earlier, the reliability coefficient tells us how much a set of scores for the same students would differ from administration to administration, but it tells us very little about the scores for individual students. The only way to examine the precision of individual scores is to calculate a confidence interval around the observed scores. Confidence intervals are described in detail in technical note I.1 (page 110), but conceptually they can be illustrated rather easily. To do so, table I.2 depicts the 95 percent confidence interval around an observed score of seventy-five out of one hundred points for tests with reliabilities ranging from 0.55 to 0.85.
Table I.2: Ninety-Five Percent Confidence Intervals for Observed Score of 75
Note: The standard deviation of this test was 8.33 and the upper and lower limits have been rounded.
Table I.2 depicts a rather disappointing situation. Even when a test has a reliability of 0.85, an observed score of 75 has a 95 percent confidence interval of 69 to 81. When the reliability is as low as 0.55, then that confidence interval is between 64 and 86. From this perspective, CAs appear almost useless in that they have so much random error associated with them. Fortunately, there is another perspective on reliability to use to render CAs more precise and, therefore, more useful.
The New CA Paradigm for Reliability
As long as the reliabilities of CAs are determined using the coefficients of reliability based on formulas that examine the difference in patterns of scores between students, there is little chance of teachers being able to demonstrate the precision of their assessments for individual students. These traditional formulas typically require a great many items and a great many examinees to use in a meaningful way. Classroom teachers usually have relatively few items on their tests (which are administered to relatively few students).
This problem is solved, however, if we consider CAs in sets administered over time. The perspective of reliability calculated from sets of assessments administered over time has been in the literature for decades (see Rogosa, Brandt, & Zimowski, 1982; Willett, 1985, 1988). Specifically, a central tenet of this book is that one should examine reliability of CAs from the perspective of groups of assessments on the same topic administered over time (as opposed to a single assessment at one point in time). To illustrate, consider the following five scores, each from a separate assessment, on the same topic, and administered to a specific student over time (such as a grading period): 71, 75, 81, 79, 84.
We must analyze the pattern that these scores exemplify to determine the reliability or precision of the student’s scores across the set. This requires a new foundational equation from the one used in traditional assessment. That new equation must account for the timing of an assessment. The basic equation for analyzing student learning over time is:
Observed score = time of assessment (true score) + error
The part of the equation added to the basic equation from traditional assessment is that the true score for a particular student on a particular test is at a particular time. A student’s true score, then, changes from assessment to assessment. Time is now a factor in any analysis of the reliability of CAs, and there is no need to assume that students have not changed from assessment to assessment.
As we administer more CAs to a student on the same topic, we have more evidence about the student’s increasing true score. Additionally, we can track the student’s growth over time. Finally, using this time-based approach, the pattern of scores for an individual student can be analyzed mathematically to compile the best estimates of the student’s true scores on each of the tests in the set. Consider figure I.2.
Figure I.2: Linear trend for five scores over time from an individual student.
Note that there are five bars and a line cutting across those bars. The five vertical bars represent the individual student’s observed scores on five assessments administered on one topic over a given period of time (let’s say a nine-week grading period).
Normally, an average of these five scores is computed to represent the student’s final score for the grading period. In this case, the average of the five scores is 78. This doesn’t seem to reflect the student’s learning, however, because three of the observed scores were higher than this average. Alternatively, the first four scores might be thought of as formative practice only. In this case, the last score of 84 is considered the summative, and it would be the only one reported. But if we consider this single final assessment in isolation, we also must consider the error associated with it. As shown in table I.2, even if the assessment had a reliability coefficient of 0.85, we would have to add and subtract six points to be surer of the student’s true score. That range of scores within the 95 percent confidence interval would be 78 to 90.
Using the new paradigm for CAs and the new time-based equation, estimates of the true score on each assessment can be made. This is what the line cutting through the five bars represents. The student’s observed score on the first test was 71, but the estimated true score was 72. The second observed score was 75, as was the estimated true score, and so on.
We consider how this line and others are computed in depth in chapter 4 (page 83), but here the point is that analyzing sets of scores for the same student on the same topic over time allows us to make estimations of the student’s true scores as opposed to using the observed scores only. When we report a final summative score for the student, we can do so with much more assuredness. In this case, the observed final score of 84 is the same as the predicted score, but now we have the evidence of the previous four assessments to support the precision of that summative score.
This approach also allows us to see how much a student has learned. In this case, the student’s first score was 71, and his last score was 84, for a gain of thirteen points. Finally, chapter 3 (page 59) presents ways that do not rely on complex mathematical calculations to make estimates of students’ true scores across a set of assessments. I address the issue of measuring student growth in chapters 3 and 4. This book also presents formulas that allow educators to program readily available tools like Excel to perform all calculations.
The Large-Scale Assessment Paradigm for Validity
The general definition for the validity of an assessment is that it measures what it is designed to measure. For large-scale assessments, this tends to create a problem from the outset since most large-scale assessments are designed to measure entire subject areas for a particular grade level. For example, a state test in English language arts (ELA) at the eighth-grade level is designed to measure all the content taught at that level. A quick analysis of the content in eighth-grade ELA demonstrates the problem.
According to Robert J. Marzano, David C. Yanoski, Jan K. Hoegh, and Julia A. Simms (2013), there are seventy-three eighth-grade topics for ELA in the CCSS. Researchers and educators refer to these as elements. Each of these elements contains multiple embedded topics, which means that a large-scale assessment must have multiple sections to be considered a valid measure of those topics.
Of course, sampling techniques would allow large-scale test designers to address a smaller subset of the seventy-three elements. However, validity is still a concern. To cover even a representative sample of the important content would require a test that is too long to be of practical use. As an example, assume that a test was designed to measure thirty-five (about half) of the seventy-three ELA elements for grade 8. Even if each element had only five items, the test would still contain 175 items, rendering it impractical for classroom use.
The New CA Paradigm for Validity
Relative to validity, CAs have an advantage over large-scale assessments in that they can and should be focused on a single topic (technically referred to as a single dimension). In fact, making assessments highly focused in terms of the content they address is a long-standing recommendation from the assessment community to increase validity (see Kane, 2011; Reckase, 1995). This makes intuitive sense. Since CAs will generally focus on one topic or dimension over a relatively short period, teachers can more easily ensure that they have acceptable levels of validity. Indeed, recall from the previous discussion that some measurement experts contend that CAs have such high levels of validity that we should not be concerned about their seemingly poor reliability.
The aspect of CA validity that is more difficult to address is that all tests within a set must measure precisely the same topic and contain items at the same levels of difficulty. This requirement is obvious if one examines the scores depicted in figure I.2. If these scores are to truly depict a given student’s increase in his or her true score for the topic being measured, then educators must design the tests to be as identical as possible. If for example, the fourth test in figure I.2 is much more difficult than the third test, a given student’s observed score on that fourth test will be lower than the score on the third test even though the student’s true score has increased (the student has learned relative to the topic of the tests).
Sets of tests designed to be close to one another in the topic measured and the levels of difficulty of the items are referred to as parallel tests. In more technical terms, parallel tests measure the same topic and have the same types of items both in format and difficulty levels. I address how to design parallel tests in depth in chapters 2 and 3 (pages 39 and 59, respectively). Briefly, though, the more specific teachers are regarding the content students are to master and the various levels of difficulty, the easier it is for them to design parallel tests. To do this, a teacher designing a test must describe in adequate detail not only the content that demonstrates proficiency for a specific standard but also simpler content that will be directly taught and is foundational to demonstrating proficiency. Additionally, it is important to articulate what a student needs to know and do to demonstrate competence beyond the target level of proficiency. To illustrate, consider the following topic that might be the target for third-grade science.
Students will understand how magnetic forces can affect two objects not in contact with one another.
To make this topic clear enough that teachers can design multiple assessments that are basically the same in terms of the content and its levels of difficulty, it is necessary to expand this to a level of detail depicted in table I.3, which provides three levels of content for the topic. The target level clearly describes what students must do to demonstrate proficiency. The basic level identifies important, directly taught vocabulary and basic processes. Finally, the advanced level describes a task that demonstrates students’ ability to apply the target content.
Table I.3: Three Levels of Difficulty for Topic
Level of Content | Content |
Advanced | Students will design a device that uses magnets to solve a problem. For example, students will be asked to identify a problem that could be solved using the attracting and repelling qualities of magnets, and create a prototype of design. |
Target | Students will learn how magnetic forces can affect two objects not in contact with one another. For example, students will determine how magnets interact with other objects (including different and similar poles of other magnets), and experiment with variables that affect these interactions (such as orientation of magnets and distance between material or objects). |
Basic | Students will recognize or recall specific vocabulary, such as attraction, bar magnet, horseshoe magnet, magnetic field, magnetic, nonmagnetic, north pole, or south pole. Students will perform basic processes, such as: • Explain that magnets create areas of magnetic force around them • Explain that magnets always have north and south poles • Provide example of magnetic and nonmagnetic materials • Explain how two opposite poles interact (attracting) and two opposite poles interact (repelling) • Identify variables that affect strength of magnetic force (for example, distance between objects, or size) |
Source: Adapted from Simms, 2016.
The teacher now has three levels of content, all on the same topic, that provide specific directions on how to create classroom assessments on the same topic and the same levels of difficulty. I discuss how classroom teachers can do this in chapter 2 (page 39).
What to Expect in This Book
Teachers and administrators for grades K–12 will learn how to revamp the concepts of validity and reliability so they match the technical advances made in CA, instead of matching large-scale assessment’s traditional paradigms for validity and reliability. This introduction lays the foundation. It introduces the new validity and reliability paradigms constructed for CAs. Chapters 1–5 describe these paradigms in detail. Chapter 1 covers the new CA paradigm for validity, noting the qualities of three major types of validity and two perspectives teachers can take regarding classroom assessments. Chapter 2 then conveys the variety of CAs that teachers can use to construct parallel assessments, which measure students’ individual growth. Chapter 3 addresses the new CA paradigm for reliability and how it shifts from the traditional conception of reliability; it presents three mathematical models of reliability. Then, chapter 4 expresses how to measure groups of students’ comparative growth and what purposes this serves. Finally, chapter 5 considers helpful changes to report cards and teacher evaluations based on the new paradigms for CAs. The appendix features formulas that teachers, schools, and districts can use to compute the reliability of CAs in a manner that is comparable to the level of precision offered by large-scale assessments.