Читать книгу The New Art and Science of Classroom Assessment - Robert J Marzano - Страница 7
ОглавлениеINTRODUCTION
The New Paradigm for Classroom Assessment
This book is about a paradigm shift in the way teachers use and interpret assessments in the classroom. It is also about increasing the rigor and utility of classroom assessments to a point where educators view them as a vital part of a system of assessments that they can use to judge the status and growth of individual students. This is a critical point. If we are to assess students in the most accurate and useful ways, then we must think in terms of merging the information from classroom assessments with other types of assessments. Figure I.1 shows the complete system of assessments a school should use.
Source: Marzano, 2018, p. 6.
Figure I.1: The three systems of assessment.
Perhaps the most visible of the three types of assessments in figure I.1 is year-end assessments. M. Christine Schneider, Karla L. Egan, and Marc W. Julian (2013) describe year-end assessments as follows:
States administer year-end assessments to gauge how well schools and districts are performing with respect to the state standards. These tests are broad in scope because test content is cumulative and sampled across the state-level content standards to support inferences regarding how much a student can do in relation to all of the state standards. Simply stated, these are summative tests. The term year-end assessment can be a misnomer because these assessments are sometimes administered toward the end of the year, usually March or April and sometimes during the first semester of the school year. (p. 59)
The next level of assessment in the model in figure I.1 is interim assessments. Schneider and colleagues (2013) describe them as follows: “Interim assessments (sometimes referred to as benchmark assessments) are standardized, periodic assessments of students throughout a school year or subject course” (p. 58).
Professional test makers typically design both types of assessments, and they include the psychometric properties educators associate with high reliability and validity, as defined in large-scale assessment theory. As its name indicates, large-scale assessment theory focuses on tests that are administered to large groups of students like year-end state tests. As we indicate in figure I.1, the most frequent type of assessment is classroom assessment. Unfortunately, some educators assume they can’t use classroom assessments to make decisions about individual students because the assessments do not exhibit the same psychometric properties as the externally designed assessments. While this observation has intuitive appeal, it is actually misleading; in this book we assert that classroom assessments can actually be more precise than external assessments when it comes to examining the performance of individual students.
This chapter outlines the facts supporting our position. In the remaining chapters, we fill in the details about how educators can design and use classroom assessments to fulfill their considerable promise.
It is important to remember that all three types of assessment we depict in figure I.1 have important roles in the overall process of assessing students. To be clear, we are not arguing that educators should discontinue or discount year-end and interim assessments in lieu of classroom assessments. We are asserting that of the three types of assessment, classroom assessments should be the most important source of information regarding the status and growth of individual students.
We begin by discussing the precision of externally designed assessments.
The Precision of Externally Designed Assessments
Externally designed assessments, like year-end and interim assessments, typically follow the tenets of classical test theory (CTT), which dates back at least to the early 1900s (see Thorndike, 1904). At its core, CTT proposes that all assessments contain a certain degree of error, as the following equation shows.
Observed Score = True Score + Error Score
This equation indicates that the score a test taker receives (the observed score) on any type of assessment comprises two components—a true component and an error component. The true component (the true score) is what a test taker would receive under ideal conditions—the test is perfectly designed and the situation in which students take the test is optimal. The error component (the error score) is also a part of the observed score. This component represents factors that can artificially inflate or deflate the observed score. For example, the test taker might guess correctly on a number of items that would artificially inflate the observed score, or the test taker might misinterpret a few items for which he or she actually knows the correct answers, which would artificially deflate the observed score.
Probably the most important aspect of error is that it makes observed scores imprecise to at least some degree. Stated differently, the scores that any assessment generates will always contain some amount of error. Test makers report the amount of error one can expect in the scores on a specific test as a reliability coefficient. Such coefficients range from a low of 0.00 to a high of 1.00. A reliability of 0.00 indicates that the scores that a specific assessment produces are nothing but error. If students took the same test again sometime after they had completed it the first time, they would receive completely different scores. A reliability of 1.00 indicates that scores on the test are perfectly accurate for each student. This means that the scores contain no errors. If students took the same test right after they had completed it the first time, they would receive precisely the same scores. Fortunately, no externally designed assessments have a reliability of 0.00. Unfortunately, no externally designed assessments have a reliability of 1.00—simply because it is impossible to construct such a test.
Most externally designed assessments have reliabilities of about 0.85 or higher. Unfortunately, even with a relatively high reliability, the information a test provides about individuals has a great deal of error in it, as figure I.2 shows.
Note: The standard deviation of this test was 15, and the upper and lower limits have been rounded.
Figure I.2: Reliabilities and 95 percent confidence intervals.
Figure I.2 depicts the degree of precision of individual students’ scores across five levels of reliability: 0.45, 0.55, 0.65, 0.75, and 0.85. These levels represent the range of reliabilities one can expect for assessments students will see in K–12 classrooms. At the low end are assessments with reliabilities of 0.45. These might be hastily designed assessments that teachers create. At the high end are externally designed assessments with reliabilities of 0.85 or even higher. The second column represents the observed score, which is 70 in all situations. The third and fourth columns represent the lower limit and upper limit of a band of scores into which we can be 95 percent sure that the true score falls. The range represents the size of the 95 percent confidence interval.
The pattern of scores in figure I.2 indicates that as reliability goes down, one has less and less confidence in the accuracy of the observed score for an individual student. For example, if the reliability of an assessment is 0.85, we can be 95 percent sure that the student’s true score is somewhere between eleven points lower than the observed score and eleven points higher than the observed score, for a range of twenty-two points. However, if the reliability of an assessment is 0.55, we can be 95 percent sure that the true score is anywhere between twenty points lower than the observed score and twenty points higher than the observed score.
These facts have massive implications for how we design and interpret assessments. Consider the practice of using one test to determine if a student is competent in a specific topic. If the test has a reliability of 0.85, an individual student’s true score could be eleven points higher or lower than the observed score. If the test has a reliability of 0.55, an individual student’s true score could be twenty points higher or lower than the observed score. Making the situation worse, in both cases we are only 95 percent sure the true score is within the identified lower and upper limits. We cannot overstate the importance of this point. All too often and in the name of summative assessment, teachers use a single test to determine if a student is proficient in a specific topic. If a student’s observed score is equal to or greater than a set cut score, teachers consider the student to be proficient. If a student’s score is below the set cut score, even by a single point, teachers consider the student not to be proficient.
Examining figure I.2 commonly prompts the question, Why are assessments so imprecise regarding the scores for individual students even if they have relatively high reliabilities? The answer to this question is simple. Test makers designed and developed CTT with the purpose of scoring groups of students as opposed to scoring individual students. Reliability coefficients, then, tell us how similar or different groups of scores would be if students retook a test. They cannot tell us about the variation in scores for individuals. Lee J. Cronbach (the creator of coefficient alpha, one of the most popular reliability indices) and his colleague Richard J. Shavelson (2004) strongly emphasize this point when they refer to reliability coefficients as “crude devices” (p. 394) that really don’t tell us much about individual test takers.
To illustrate what reliability coefficients tell us, consider figure I.3.
Source: Marzano, 2018, p. 62.
Figure I.3: Three administrations of the same test.
Figure I.3 illustrates precisely what a traditional reliability coefficient means. The first column, Initial Administration, reports the scores of ten students on a specific test. The second column, Second Administration (A), represents the scores from the same students after they have taken the test again. But before students took the test the second time, they forgot that they had taken it the first time, so the items appear new to them. While this cannot occur in real life and seems like a preposterous notion, it is, in fact, a basic assumption underlying the reliability coefficient. As Cronbach and Shavelson (2004) note:
If, hypothetically, we could apply the instrument twice and on the second occasion have the person unchanged and without memory of his first experience, then the consistency of the two identical measurements would indicate the uncertainty due to measurement error. (p. 394)
The traditional reliability coefficient simply tells how similar the score set is between the first and second test administrations. In figure I.3, the scores on the first administration and the second administration (A) are quite similar. Student 1 receives a 97 on the first administration and a 98 on the second administration; student 2 receives a 92 and a 90 respectively, and so on. There were some differences in scores but not much. The last row of the table shows the correlation between the initial administration and the second administration. That correlation (0.96) is, in fact, the reliability coefficient, and it is quite high.
But let’s now consider another scenario, as we depict in the last column of figure I.3, Second Administration (B). In this scenario, students receive very different scores on the second administration. Student 1 receives a score of 97 on the first administration and a score of 82 on the second; student 2 receives a 92 and 84 respectively. If the second administration of the test produces a vastly different pattern of scores, we would expect the correlation between the two administrations (or the reliability coefficient) to be quite low, which it is. The last row of the table indicates that the reliability coefficient is 0.32.
So how can educators obtain precise scores for individual students using classroom assessments? The answer to this question is that they can design multiple assessments and administer them over time.
Multiple Assessments
The preceding discussion indicates that as long as we think of tests as independent events, the scores from which educators must interpret in isolation, there is little hope for precision at the individual student level. However, if one changes the perspective from a single assessment to multiple assessments administered and interpreted over time, then it becomes not only possible but relatively straightforward to generate a relatively precise summary score for individuals.
To illustrate, consider the following five scores for an individual student on a specific topic gathered over the course of a grading period.
70, 72, 75, 77, 81
We have already discussed that any one of these scores in isolation probably does not provide a great deal of accuracy. Recall from figure I.2 (page 3) that even if all test reliabilities were 0.85, we would have to add and subtract about eleven points to compute an interval score into which we are 95 percent sure the true score actually falls. But if we consider the pattern of these scores, we can have a relatively high degree of confidence in the scores, particularly as more time passes and we collect more scores.
This pattern is clear that over time, the student’s scores have been gradually increasing. This makes intuitive sense. If the student is learning and the assessments are accurate, we would expect to see the scores continually go up. The more scores that precede any given score, the more one can judge the accuracy of that score. In the previous series, the first score is 70. In judging its accuracy, we would have to treat it like an individual assessment—we wouldn’t have much confidence in its accuracy. But with the second score of 72, we now have two data points. Since we can reasonably assume that the student is learning, it makes sense that his or her score would increase. We now have more confidence in the score of 72 than we did with the single score of 70. By the time we have the fifth score of 81, we have amassed a good deal of antecedent information with which to judge its accuracy. Although we can’t say that 81 is precisely accurate, we can say the student’s true score is probably close to it. In subsequent chapters, we present techniques for specifying the accuracy of this final score of 81.
It’s important to note that some data patterns would indicate a lack of accuracy in the test scores. To illustrate, consider the following pattern of scores.
70, 76, 65, 82, 71
Assuming that the student who exhibited these scores is learning over time, the pattern doesn’t make much sense. The student began and ended the grading period with about the same score. In between, the student exhibited some scores that were significantly higher and some scores that were significantly lower. This pattern implies that there was probably a great deal of error in the assessments. (Again, we discuss how to interpret such aberrant patterns in subsequent chapters.) This scenario illustrates the need for a new view of summative scores.
The New View of Summative Scores
The practice of examining the mounting evidence that multiple assessments provide is a veritable sea change in the way we think of summative assessments for individual students. More specifically, we have seen school leaders initiate policies in which they make a sharp distinction between formative assessments and summative assessments. Within these policies, educators consider formative assessments as practice only, and they do not record scores from these assessments. They consider summative tests as the “real” assessments, and the scores from them play a substantive role in a student’s final grade.
As the previous discussion illustrates, this makes little sense for at least two reasons. First, the single score educators derive from the summative assessment is not precise enough to support absolute decisions about individual students. Second, not recording formative scores is tantamount to ignoring all the historical assessment information that teachers can use to estimate a student’s current status. We take the position that educators should use the terms formative and summative scores, as opposed to formative and summative assessments, to meld the two types of assessments into a unified continuum.
Also, teachers should periodically estimate students’ current summative scores by examining the pattern of the antecedent scores. We describe this process in depth in chapter 6 (page 91). Briefly, though, consider the pattern of five scores we described previously: 70, 72, 75, 77, 81. A teacher could use this pattern to assign a current summative score without administering another assessment. The pattern clearly indicates steady growth for the student and makes the last score of 81 appear quite reasonable.
The process of estimating a summative score as opposed to relying only on the score from a single summative test works best if the teacher uses a scale that automatically communicates what students already know and what they still have to learn. A single score of 81 (or 77 or pretty much any score on a one hundred–point scale) doesn’t communicate much about a student’s knowledge of specific content. However, a score on a proficiency scale does and greatly increases the precision with which a teacher can estimate an individual student’s summative score.
The Need for Proficiency Scales
We discuss the nature and function of proficiency scales in depth in chapter 3. For now, figure I.4 provides an example of a proficiency scale.
Figure I.4: Sample proficiency scale for fourth-grade science.
Notice that the proficiency scale in figure I.4 has three levels of explicit content. It is easiest to understand the nature of a proficiency scale if we start with the content at the score 3.0 level. It reads, The student will explain how vision (sight) is a product of light reflecting off objects and entering the eye. This is the desired level of expertise for students. When students can demonstrate this level of competence, teachers consider them to be proficient.
Understanding the score 2.0 content is necessary to demonstrate competency on the score 3.0 content, which teachers will directly teach to students. In the proficiency scale, score 2.0 content reads, The student will recognize or recall specific vocabulary (for example, brain, cone, cornea, image, iris, lens, light, optic nerve, perpendicular angle, pupil, reflection, retina, rod, sight, dilate) and perform basic processes, such as:
• Describe physical changes that happen in the eye as a reaction to light (for example, the pupil dilates and contracts)
• Trace the movement of light as it moves from a source, reflects off an object, and enters the eye
• Diagram the human eye and label its parts (cornea, iris, pupil, lens, retina, optic nerve)
• Describe the function of rods and cones in the eye
• Recognize that the optic nerve carries information from both eyes to the brain, which processes the information to create an image
The score 4.0 content requires students to make inferences and applications that go above and beyond the score 3.0 content. In the proficiency scale, it reads, In addition to score 3.0 performance, the student will demonstrate in-depth inferences and applications that go beyond what was taught. For example, the student will explain how distorted light impacts vision (for example, explain why a fish in clear water appears distorted due to light refraction). The example provides one way in which the student might demonstrate score 4.0 performance.
The other scores in the scale do not contain new content but do represent different levels of understanding relative to the content. For example, score 1.0 means that with help, the student has partial understanding of some of the simpler details and processes and some of the more complex ideas and processes. And score 0.0 means that even with help, the student demonstrates no understanding or skill. The scale also contains half-point scores, which signify achievement between two whole-point scores. Again, we address proficiency scales in depth in chapter 2 (page 25).
With a series of scores on a proficiency scale as opposed to a one hundred–point scale, a teacher can more accurately estimate a summative score using antecedent formative scores. This is because we can reference a score on a proficiency scale to a continuum of knowledge, regardless of the test format. A score of 3.0 on a test means that the student has demonstrated competence regardless of the type of test. This is not the case with the one hundred–point scale. For example, a teacher can only interpret a score of 85 in terms of levels of knowledge if he or she examines the items on the test. This characteristic of proficiency scales suits them well for examining trends in learning. To illustrate, consider the following pattern of proficiency scale scores for a student on a specific topic.
1.0, 2.0, 2.0, 3.0, 2.5
The first score of 1.0 indicates that in the beginning of the grading period, the student demonstrates little knowledge of the topic on his or her own but with help, should have some understanding of the score 2.0 and 3.0 content. By the time the next assessment occurs, the student seems to have a solid knowledge of the score 2.0 content, which carries on into the third assessment. Such content involves basic information the teacher directly teaches. The fourth assessment sees a big jump in understanding, indicating that the student knows the score 2.0 and 3.0 content. However, on the final assessment, the student score of 2.5 indicates a solid understanding of the score 2.0 content but only partial understanding of the score 3.0 content. Even though this student’s pattern does not show growth across every assessment, it still provides enough evidence for the teacher to assign a summative score of at least 2.5. Proficiency scales make the new paradigm for classroom assessments concrete and viable.
This Book
Chapter 1, “The Assessment-Friendly Curriculum,” provides evidence for the claim that virtually every state’s standards simply contain too much content to effectively assess, let alone teach. Consequently, classroom educators must identify the critical content within the standards to explicitly teach and measure in order to determine students’ current status as well as their growth. Chapter 2, “Proficiency Scales,” points out that it’s not enough to identify specific learning targets for students relative to each topic. To measure student growth, teachers must develop well-defined continua of knowledge for each topic. These continua form the basis for designing scales teachers can use to develop assessments and plan instruction. Chapter 3, “Parallel Assessments,” not only describes the defining characteristics of parallel assessments in detail but also provides specific guidelines about how to create such assessments. In addition, it describes how to score parallel assessments. Chapter 4, “The Measurement Process and Different Types of Assessments,” presents a way of viewing classroom assessment and scoring as a seamless and united endeavor that represents the new paradigm of classroom assessment. Chapter 5, “Summative Scores,” describes techniques that allow teachers to determine the level of precision they can assign to scores for individual students. Some of these techniques require the aid of technology, and some do not. Chapter 6, “Non-Subject-Specific Skills,” addresses subject areas such as cognitive skills and metacognitive skills. These skills are commonly mentioned in standards documents but do not fit into any one subject area. Chapter 7, “Record Keeping and Reporting,” addresses not only how teachers can efficiently keep records of scores from classroom assessments but also how to transform those scores into report cards that demonstrate each student’s status and growth.
Finally, note that this book does not address the technical and psychometric issues that accompany the recommendations that we make. For a thorough discussion of these matters, the reader should consult Making Classroom Assessments Reliable and Valid (Marzano, 2018).
If classroom assessments are to fulfill their bright promise, educators must recognize that large-scale assessment theory is not the appropriate tool for designing and administering teacher-designed assessments. Rather, educators must employ a new theory base specific to the classroom. This book presents that theory.