Читать книгу Formative Assessment & Standards-Based Grading - Robert J Marzano - Страница 9

Оглавление

Chapter 1

RESEARCH AND THEORY

Assessment and grading are two of the most talked about and sometimes misunderstood aspects of K–12 education. Formative Assessment and Standards-Based Grading seeks to bring some clarity to one particular type of assessment—formative—and explore through recommendations how it interacts with traditional and nontraditional grading practices. In this chapter, we review the research and theory that underpin these recommendations. We begin by discussing feedback, the practice in which both assessment and grading have their roots.

Feedback

The topic of feedback and its effect on student achievement is of great interest to researchers and practitioners. In fact, studies on the relationship between the two are plentiful and span about three decades. In an effort to operationally define feedback, researchers John Hattie and Helen Timperley (2007) explained that its purpose is “to reduce discrepancies between current understandings and performance and a goal” (p. 86). Researcher Valerie Shute (2008) said feedback is “information communicated to the learner that is intended to modify his or her thinking or behavior for the purpose of improving learning” (p. 154).

Feedback can be given formally or informally in group or one-on-one settings. It can take a variety of forms. As the preceding definitions illustrate, its most important and dominant characteristic is that it informs the student, the teacher, and all other interested parties about how to best enhance student learning.

Table 1.1 (page 4) presents the results from a variety of studies on feedback. The first column lists the major studies that have been conducted since 1976. The last three columns are related. Critical to understanding exactly how they are related are the concepts of meta-analysis and effect size (ES). Appendix B (page 153) explains the concepts of meta-analysis and effect size in some depth. Briefly though, meta-analysis is a research technique for quantitatively synthesizing a series of studies on the same topic. For example, as table 1.1 indicates, Kluger and DeNisi (1996) synthesized findings from 607 studies on the effects of feedback interventions. Typically, meta-analytic studies report their findings in terms of average ESs (see the ES column in table 1.1). In the Kluger and DeNisi meta-analysis, the average ES is 0.41. An effect size tells you how many standard deviations larger (or smaller) the average score for a group of students who were exposed to a given strategy (in this case, feedback) is than the average score for a group of students who were not exposed to a given strategy (in this case, no feedback). In short, an ES tells you how powerful a strategy is; the larger the ES, the more the strategy increases student learning.

Table 1.1 Research Results for Feedback


a Reported in Fraser, Walberg, Welch, & Hattie, 1987.

b Reported in Hattie & Timperley, 2007.

c Feedback was embedded in general metacognitive strategies.

d The dependent variable was engagement.

e Reported in Hattie, 2009.

ESs are typically small numbers. However, small ESs can translate into big percentile gains. For example, the average ES of 0.41 calculated by Kluger and DeNisi (1996) translates into a 16 percentile point gain (see appendix B, page 153, for a detailed description of ESs and a chart that translates ES numbers into percentile gains). Another way of saying this is that a student at the 50th percentile in a class where feedback was not provided (an average student in that class) would be predicted to rise to the 66th percentile if he or she were provided with feedback.

Hattie and Timperley (2007) synthesized the most current and comprehensive research in feedback and summarized findings from twelve previous meta-analyses, incorporating 196 studies and 6,972 ESs. They calculated an overall average ES of 0.79 for feedback (translating to a 29 percentile point gain). As shown by Hattie (2009), this is twice the average ES of typical educational innovations. One study by Stuart Yeh (2008) revealed that students who received feedback completed more work with greater accuracy than students who did not receive feedback. Furthermore, when feedback was withdrawn from students who were receiving it, rates of accuracy and completion dropped.

Interestingly, though the evidence for the effectiveness of feedback has been quite strong, it has also been highly variable. For example, in their analyzing of more than six hundred experimental/control studies, Kluger and DeNisi (1996) found that in 38 percent of the studies they examined, feedback had a negative effect on student achievement. This, of course, raises the critically important questions, What are the characteristics of feedback that produce positive effects on student achievement, and what are the characteristics of feedback that produce negative effects? In partial answer to this question, Kluger and DeNisi found that negative feedback has an ES of negative 0.14. This translates into a predicted decrease in student achievement of 6 percentile points. In general, negative feedback is that which does not let students know how they can get better.

Hattie and Timperley (2007) calculated small ESs for feedback containing little task-focused information (punishment = 0.20; praise = 0.14) but large ESs for feedback that focused on information (cues = 1.10, reinforcement = 0.94). They argued that feedback regarding the task, the process, and self-regulation is often effective, whereas feedback regarding the self (often delivered as praise) typically does not enhance learning and achievement. Operationally, this means that feedback to students regarding how well a task is going (task), the process they are using to complete the task (process), or how well they are managing their own behavior (self-regulation) is often effective, but feedback that simply involves statements like “You’re doing a good job” has little influence on student achievement. Hattie and Timperley’s ultimate conclusion was:

Learning can be enhanced to the degree that students share the challenging goals of learning, adopt self-assessment and evaluation strategies, and develop error detection procedures and heightened self-efficacy to tackle more challenging tasks leading to mastery and understanding of lessons. (p. 103)

Assessment

In K–12 classrooms, the most common form of feedback is an assessment. While the research and theory on feedback and assessment overlap to a great extent, in this section we consider the research and theory that is specific to assessment.

Research on Assessment

The research on the effects of assessments on student learning paints a positive picture. To illustrate, table 1.2 (page 6) provides a synthesis of a number of meta-analytic studies on the effects of assessment as reported by Hattie (2009).

Table 1.2 Meta-Analytic Studies on Assessment as Reported by Hattie (2009)


a Two effect sizes are listed because of the differences in variables as reported by Hattie (2009). Readers should consult that study for more details.

Notice that table 1.2 is subdivided into three categories: frequency of assessment, general effects of assessment, and providing assessment feedback to teachers. The first category speaks to how frequently assessments are given. In general, student achievement benefits when assessments are given relatively frequently as opposed to infrequently. The study by Robert Bangert-Drowns, James Kulik, and Chen-Lin Kulik (1991) depicted in table 1.3 adds some interesting details to this generalization.

Note that in table 1.3, the effect of even one assessment in a fifteen-week period of time is substantial (0.34). Also note that there is a gradual increase in the size of the effect as the number of assessments increases. This trend should not be misconstrued as indicating that the more tests a teacher gives, the more students will achieve. As we shall see in subsequent chapters, a test is only one of many ways to obtain assessment data.

Table 1.3 Achievement Gain Associated With Number of Assessments Over Fifteen Weeks

Number of AssessmentsEffect SizePercentile Point Gain
000
10.3413.5
50.5320
100.6022.5
150.6624.5
200.7126
250.7828.5
300.8229

Note: Effect sizes computed using data reported by Bangert-Drowns, Kulik, and Kulik (1991).

The second category in table 1.2, general effects of assessment, is the broadest and incorporates a variety of perspectives on assessment. Again, many of the specific findings in these studies manifest as the recommendations in subsequent chapters. Here it suffices to note that in the aggregate, these studies attest to the fact that properly executed assessments can be an effective tool for enhancing student learning.

The third category in table 1.2 deals with providing assessment feedback to teachers. Lynn Fuchs and Douglas Fuchs (1986) found that providing teachers with graphic representations of student progress was associated with an ES of 0.70, which translates into a 26 percentile point gain. This is quite consistent with a set of studies conducted at Marzano Research (since renamed Marzano Resources) in which teachers had students chart their progress on specific learning goals (Marzano Resources, 2009). The results are depicted in table 1.4.

Table 1.4 Studies on Students Tracking Their Progress

StudyEffect SizePercentile Gain
12.4449
23.6649
31.5043
4-0.39-15
50.7527
61.0034
70.073
81.6845
90.073
101.2038
11-0.32-13
120.4317
130.8430
140.6324
Average0.9232

Table 1.4 reports the results of fourteen studies conducted by K–12 teachers on the effects of tracking student progress. The average ES of these fourteen studies was 0.92, which translates into a 32 percentile point gain. Taking these findings at face value, one would conclude that learning is enhanced when students track their own progress.

Note that in studies 4 and 11, tracking student progress had a negative effect on student achievement (indicated by the negative ES). As is the case with all assessment (and instructional) strategies, this strategy does not work equally well in all situations. Effective assessment requires ascertaining the correct way to use a strategy. In subsequent chapters, we make recommendations as to the correct way to track student progress.

Formative Assessments

Formative assessment has become very popular in the last decade. It is typically contrasted with summative assessment in that summative assessments are employed at the end of an instructional episode while formative assessments are used while instruction is occurring. As Susan Brookhart (2004, p. 45) explained, “Formative assessment means information gathered and reported for use in the development of knowledge and skills, and summative assessment means information gathered and reported for use in judging the outcome of that development.”

Formative assessments became popular after Paul Black and Dylan Wiliam (1998a) summarized the findings from more than 250 studies on formative assessment. They saw ESs in those studies that ranged from 0.4 to 0.7 and drew the following conclusion:

The research reported here shows conclusively that formative assessment does improve learning. The gains in achievement appear to be quite considerable, and as noted earlier, among the largest ever reported for educational interventions. As an illustration of just how big these gains are, an effect size of 0.7, if it could be achieved on a nationwide scale, would be equivalent to raising the mathematics attainment score of an “average” country like England, New Zealand, or the United States into the “top five” after the Pacific rim countries of Singapore, Korea, Japan, and Hong Kong. (p. 61)

In effect, Black and Wiliam were saying that an ES of 0.70 (the largest ES reported in the studies they summarized), when sustained for an entire nation, would dramatically enhance student achievement. Indeed, consulting the table in appendix B (page 155), we see that an ES of 0.70 is associated with a 26 percentile point gain in student achievement. The reporting of these findings captured the attention of U.S. educators.

The Black and Wiliam study is sometimes referenced as a meta-analysis of some 250 studies on formative assessment. As described in appendix B of this book, a meta-analysis is a quantitative synthesis of research in a specific area. When performing a meta-analysis, a researcher attempts to compute an average ES of a particular innovation (in this case, formative assessment) by examining all of the available studies. While Black and Wiliam certainly performed a rigorous analysis of the studies they examined, they did not conduct a traditional meta-analysis. In fact, in a section of their article titled “No Meta-Analysis,” they explain, “It might seem desirable, and indeed might be anticipated as conventional, for a review of this type to attempt a meta-analysis of the quantitative studies that have been reported” (1998a, p. 52). They go on to note, however, that the 250 studies they examined were simply too different to compute an average ES.

It is important to keep two things in mind when considering the practice of formative assessment. The first is that, by definition, formative assessment is intimately tied to the formal and informal processes in classrooms. Stated differently, it would be a contradiction in terms to use “off the shelf” formative assessment designed by test makers. James Popham (2006) has harshly criticized the unquestioning use of commercially prepared formative assessments. He noted:

As news of Black and Wiliam’s conclusions gradually spread into faculty lounges, test publishers suddenly began to relabel many of their tests as “formative.” This name-switching sales ploy was spurred on by the growing perception among educators that formative assessments could improve their students’ test scores and help schools dodge the many accountability bullets being aimed their way. (p. 86)

To paraphrase Popham (2006), externally developed assessments simply do not meet the defining characteristics of formative assessment. Lorrie Shepard (2006) made the same point:

The research-based concept of formative assessment, closely grounded in classroom instructional processes, has been taken over—hijacked—by commercial test publishers and is used instead to refer to formal testing systems called “benchmark” or “interim assessment systems.” (as cited in Popham, 2006, p. 86)

A similar criticism might be leveled at many district-made “benchmark” assessments in that they frequently violate many of the basic assumptions underlying good formative assessment. As James McMillan (2007) explained:

These tests, which are typically provided by the district or commercial test publishers, are administered on a regular basis to compare student achievement to “benchmarks” that indicate where student performance should be in relation to what is needed to do well on end-of-year high stakes tests…. Although the term benchmark is often used interchangeably with formative in the commercial testing market, there are important differences. Benchmark assessments are formal, structured tests that typically do not provide the level of detail needed for appropriate instructional correctives. (pp. 2–3)

The second thing to keep in mind is that while there is a good deal of agreement about its potential as a tool to enhance student achievement, the specifics of formative assessment are somewhat elusive. In fact, most descriptions of formative assessment are very general in nature. To illustrate, in their original study, Black and Wiliam (1998a) noted that “formative assessment does not have a tightly defined and widely accepted meaning” (p. 7). Dylan Wiliam and Siobhan Leahy (2007) described formative assessment as follows:

The qualifier formative will refer not to an assessment or even to the purpose of an assessment, but rather to the function it actually serves. An assessment is formative to the extent that information from the assessment is fed back within the system and actually used to improve the performance of the system in some way (i.e., that the assessment forms the direction of improvement). (p. 31)

Rick Stiggins, Judith Arter, Jan Chappuis, and Stephen Chappuis (2006) described formative assessment as assessment for learning rather than assessment of learning:

Assessments for learning happen while learning is still underway. These are the assessments that we conduct throughout teaching and learning to diagnose student needs, plan our next steps in instruction, provide students with feedback they can use to improve the quality of their work, and help students see and feel in control of their journey to success…. This is not about accountability—these are assessments of learning. This is about getting better. (p. 31)

Susan Brookhart and Anthony Nitko (2007) explained that “formative assessment is a loop: Students and teachers focus on a learning target, evaluate current student work against the target, act to move the work closer to the target, and repeat” (p. 116).

Along with these general descriptions, specifics regarding the practice of formative assessment have been offered. Unfortunately, there is no clear pattern of agreement regarding the specifics. For example, some advocates stress that formative assessments should not be recorded, whereas others believe they should. Some assert that formative assessments should not be considered when designing grades, where others see a place for them in determining a student’s true final status (see O’Connor, 2002; Welsh & D’Agostino, 2009; Marzano, 2006). To a great extent, the purpose of this book is to articulate a well-crafted set of specifics regarding the practice of formative assessment.

Learning Progressions and Clear Goals

The development of learning progressions has become a prominent focus in the field of formative assessment. Margaret Heritage (2008) explained the link between learning progressions and formative assessment as follows:

The purpose of formative assessment is to provide feedback to teachers and students during the course of learning about the gap between students’ current and desired performance so that action can be taken to close the gap. To do this effectively, teachers need to have in mind a continuum of how learning develops in any particular knowledge domain so that they are able to locate students’ current learning status and decide on pedagogical action to move students’ learning forward. Learning progressions that clearly articulate a progression of learning in a domain can provide the big picture of what is to be learned, support instructional planning, and act as a touchstone for formative assessment. (p. 2)

One might think that learning progressions have already been articulated within the many state and national standards documents. This is not the case. Again, Heritage noted:

Yet despite a plethora of standards and curricula, many teachers are unclear about how learning progresses in specific domains. This is an undesirable situation for teaching and learning, and one that particularly affects teachers’ ability to engage in formative assessment. (p. 2)

The reason state and national standards are not good proxies for learning progressions is that they were not designed with learning progressions in mind. To illustrate, consider the following standard for grade 3 mathematics from the state of Washington (Washington Office of Superintendent of Public Instruction, 2008):

Students will be able to round whole numbers through 10,000 to the nearest ten, hundred, and thousand. (p. 33)

This sample provides a fairly clear target of what students should know by grade 3, but it does not provide any guidance regarding the building blocks necessary to attain that goal. In contrast, Joan Herman and Kilchan Choi (2008, p. 7) provided a detailed picture of the nature of a learning progression relative to the concept of buoyancy. They identified the following levels (from highest to lowest) of understanding regarding the concept:

• Student knows that floating depends on having less density than the medium.

• Student knows that floating depends on having a small density.

• Student knows that floating depends on having a small mass and a large volume.

• Student knows that floating depends on having a small mass, or that floating depends on having a large volume.

• Student thinks that floating depends on having a small size, heft, or amount, or that it depends on being made out of a particular material.

• Student thinks that floating depends on being flat, hollow, filled with air, or having holes.

Obviously, with a well-articulated sequence of knowledge and skills like this, it is much easier to provide students with feedback as to their current status regarding a specific learning goal and what they must do to progress.

While one might characterize the work on learning progressions as relatively new and therefore relatively untested, it is related to a well-established and heavily researched area of curriculum design—learning goals. One might think of learning progressions as a series of related learning goals that culminate in the attainment of a more complex learning goal. Learning progressions can also be used to track student progress. The research on learning goals is quite extensive. Some of the more prominent studies are reported in table 1.5 (page 12).

Table 1.5 Research Results for Establishing Learning Goals


a Two effect sizes are listed because of the manner in which effect sizes were reported. Readers should consult the study for more details.

b As reported in Hattie (2009).

c Both Tubbs (1986) and Locke and Latham (1990) report results from organizational as well as educational settings.

d As reported in Locke and Latham (2002).

e The review includes a wide variety of ways and contexts in which goals might be used.

A scrutiny of the studies reported in table 1.5 provides a number of useful generalizations about learning goals and, by extrapolation, about learning progressions. First, setting goals appears to have a notable effect on student achievement in its own right. This is evidenced by the substantial ESs reported in table 1.5 for the general effects of goal setting. For example, Kevin Wise and James Okey (1983) reported an ES of 1.37, Mark Lipsey and David Wilson (1993) reported an ES of 0.55, and Herbert Walberg (1999) reported an ES of 0.40. Second, specific goals have more of an impact than do general goals. Witness Mark Tubbs’s (1986) ES of 0.50 associated with setting specific goals as opposed to general goals. Edwin Locke and Gary Latham (1990) reported ESs that range from 0.42 to 0.82 regarding specific versus general goals, and Steve Graham and Dolores Perin (2007) reported an ES of 0.70 (for translations of ESs into percentile gains, see appendix B). Third, goals must be at the right level of difficulty for maximum effect on student achievement. This is evidenced in the findings reported by Tubbs (1986), Anthony Mento, Robert Steel, and Ronald Karren (1987), Locke and Latham (1990), Kluger and DeNisi (1996), and Matthew Burns (2004). Specifically, goals must be challenging enough to interest students but not so difficult as to frustrate them (for a detailed discussion of learning goals, see Marzano, 2009).

The Imprecision of Assessments

One fact that must be kept in mind in any discussion of assessment—formative or otherwise—is that all assessments are imprecise to one degree or another. This is explicit in a fundamental equation of classical test theory that can be represented as follows:

Observed score = true score + error score

Marzano (2006) explained:

This equation indicates that a student’s observed score on an assessment (the final score assigned by the teacher) consists of two components—the student’s true score and the student’s error score. The student’s true score is that which represents the student’s true level of understanding or skill regarding the topic being measured. The error score is the part of an observed score that is due to factors other than the student’s level understanding or skill. (pp. 36–37)

In technical terms, every score assigned to a student on every assessment probably contains some part that is error. To illustrate the consequences of error in the interpretation of assessment scores, consider table 1.6.

Table 1.6 Ranges of Possible “True Scores” for Differing Levels of Reliability


Note: 95% confidence interval based on the assumption of a standard deviation of 12 points.

Table 1.6 shows what can be expected in terms of the amount of error that surrounds a score of 70 when an assessment has reliabilities that range from 0.85 to 0.45. In all cases, the student is assumed to have received a score of 70 on the assessment. That is, the student’s observed score is 70.

First, let us consider the precision of an observed score of 70 when the reliability of the assessment is 0.85. This is the typical reliability one would expect from a standardized test or a state test (Lou et al., 1996). Using statistical formulas, it is possible to compute a range of scores in which you are 95 percent sure the true score actually falls. Columns three, four, and five of table 1.6 report that range. In the first row of table 1.6, we see that for an assessment with a reliability of 0.85 and an observed score of 70, one would be 95 percent sure the student’s true score is anywhere between a score of 60 and 80. That is, the student’s true score might really be as low as 60 or as high as 80 even though he or she receives a score of 70. This is a range of 20 points. But this assumes the reliability of the assessment to be 0.85, which, again, is what you would expect from a state test or a standardized test.

Next, let us consider the range with classroom assessments. To do so, consider the second row of table 1.6, which pertains to the reliability of 0.75. This is probably the highest reliability you could expect from an assessment designed by a teacher, school, or district (see Lou et al., 1996). Now the low score is 58 and the high score is 82—a range of 24 points. To obtain the full impact of the information presented in table 1.6, consider the last row, which depicts the range of possible true scores when the reliability is 0.45. This reliability is, in fact, probably more typical of what you could expect from a teacher-designed classroom assessment (Marzano, 2002). The lowest possible true score is 52 and the highest possible true score is 88—a range of 36 points.

Quite obviously, no single assessment can ever be relied on as an absolute indicator of a student’s status. Gregory Cizek (2007) added a perspective on the precision of assessments in his discussion on the mathematics section of the state test in a large midwestern state. He explained that the total score reliability for the mathematics portion of the test in that state at the fourth grade is 0.87—certainly an acceptable level of reliability. That test also reports students’ scores in subareas using the National Council of Teachers of Mathematics categories: algebra, data analysis and probability, estimation and mental computation, geometry, and problem-solving strategies. Unfortunately, the reliability of these subscale scores ranges from 0.33 to 0.57 (p. 103). As evidenced by table 1.6, reliabilities this low would translate into a wide range of possible true scores.

Imprecision in assessments can come in many forms. It can be a function of poorly constructed items on a test, or it can come from students’ lack of attention or effort when taking a test. Imprecision can also be a function of teachers’ interpretations of assessments. A study done by Herman and Choi (2008) asked two questions: How accurate are teachers’ judgments of student learning, and how does accuracy of teachers’ judgments relate to student performance? They found that “the study results show that the more accurate teachers are in their knowledge of where students are, the more effective they may be in promoting subsequent subject learning” (p. 18). Unfortunately, they also found that “average accuracy was less than 50%” (p. 19). Margaret Heritage, Jinok Kim, Terry Vendlinski, and Joan Herman (2008) added that “inaccurate analyses or inappropriate inference about students’ learning status can lead to errors in what the next instructional steps will be” (p. 1). They concluded that “using assessment information to plan subsequent instruction tends to be the most difficult task for teachers as compared to other tasks (for example, assessing student responses)” (p. 14).

One very important consideration when interpreting scores from assessments or making inferences about a student based on an assessment is the native language of the student. Christy Kim Boscardin, Barbara Jones, Claire Nishimura, Shannon Madsen, and Jae-Eun Park (2008) conducted a review of performance assessments administered in high school biology courses. They focused their review on English language learners, noting that “the language demand of content assessments may introduce construct-irrelevant components into the testing process for EL students” (p. 3). Specifically, they found that the students with a stronger grasp of the English language would perform better on the tests even though they might not have had any better understanding of the science content. The same concept holds true for standardized tests in content-specific areas. They noted that “the language demand of a content assessment is a potential threat to the validity of the assessment” (p. 3).

Grading

At the classroom level, any discussion of assessment ultimately ends up in a discussion of grading. As its title indicates, Formative Assessment and Standards-Based Grading focuses on grading as well as on formative assessment. Not only are teachers responsible for evaluating a student’s level of knowledge or skill at one point in time through classroom assessments, they are also responsible for translating all of the information from assessments into an overall evaluation of a student’s performance over some fixed period of time (usually a quarter, trimester, or semester). This overall evaluation is in the form of some type of overall grade commonly referred to as an “omnibus grade.” Unfortunately, grades add a whole new layer of error to the assessment process.

Brookhart (2004) discussed the difficulties associated with grading:

Grades have been used to serve three general purposes simultaneously: ranking (for sorting students into those eligible for higher education and those not eligible); reporting results (accounting to parents the degree to which students learned the lessons prescribed for them); and contributing to learning (providing feedback and motivating students). (p. 23)

While all three purposes are valid, they provide very different perspectives on student achievement.

Since the teachers in many schools and districts have not agreed on any one grading philosophy, they are forced to design their own systems. To illustrate, consider the following grading criteria Thomas Guskey (2009, p. 17) listed as elements frequently included in teachers’ grading practices:

• Major exams or compositions

• Class quizzes

• Reports or projects

• Student portfolios

• Exhibits of students’ work

• Laboratory projects

• Students’ notebooks or journals

• Classroom observations

• Oral presentations

• Homework completion

• Homework quality

• Class participation

• Work habits and neatness

• Effort

• Attendance

• Punctuality of assignments

• Class behavior or attitude

• Progress made

He made the point that because of their different philosophies, different teachers rely on different combinations of these elements to construct an overall grade. For example, one teacher might include major exams, quizzes, class participation, and punctuality of assignments in his or her grading policy while another teacher teaching exactly the same course might include major exams, reports, effort, and attendance. Consequently, the grading schemes for the same course taught by two different teachers might be so different that grades are not comparable from teacher to teacher. In effect, individual teachers’ grades are interpretable only in the context of the grading scheme constructed by that specific teacher.

Norm-Referenced Grading

One approach to grading that has been used over the years is to report how students are performing in relation to one another. This might be called norm-referenced grading. In his doctoral dissertation, Kenneth Haponstall (2009) pointed out that this might have been the impetus for what was referred to as the “grading system” in the mid-nineteenth century, whereby students were grouped by level of knowledge and skill as well as by age so that teachers might provide more focused instruction to these homogeneous groups. He explained that James Baldwin (1884, in Haponstall, 2009) saw problems with the system even then, pointing out that no standard criteria about how students were “graded” or by whom had been established. The decision was subjective and left to anyone from the superintendent to the school secretary or a member of the board of education.

While few, if any, grading schemes currently in place use a strict norm-referenced approach, vestiges of it can be found in the practices of class rankings and grading on a curve.

Class Rankings

Class rankings are related to the concept of norm-referenced grading. Haponstall pointed out that “with districts using differing measures, including grade weighting for advanced placement classes, grade improvement for special education classes, [and] credit recovery for failed courses, there seems to be no standard method for schools to demonstrate those students who are showing academic excellence” (p. 22).

Lawrence Cross and Robert Frary (1999) noted that even though “grading is a hodgepodge of attitude, effort, and achievement at the middle and high school levels, colleges not only accept the grade point and class ranking in determining enrollment, but many are starting to use these measures exclusively” (as cited in Haponstall, 2009, p. 22). The admissions policies of many colleges exacerbate the practice of class ranking. To illustrate, David Lang (2007) pointed out that states such as California, Florida, and Texas guarantee a certain top percentage of each graduating class admission to a state school. This renders class ranks a high-stakes endeavor, particularly for those ranked too low for a guarantee of admission to a state school (as cited in Haponstall, 2009).

Grading on the Curve

When a teacher grades on the curve, he or she gives the highest grade to the student who performed best on an assessment and then gives every other student a grade by ranking his or her performance accordingly. This system essentially grades students in relation to one another. Thus, it has a basis in norm-referencing. Proponents for grading on the curve maintain that it is fair and equitable because most classes will have a normal distribution of achievement scores in any given subject area (for a discussion, see Brookhart, 2004).

Thomas Guskey (2009), however, maintained that “grading ‘on the curve’ communicates nothing about what students have learned or are able to do” (p. 11). Instead of telling teachers what a student has learned, it simply reports how much or how little he or she learned in relation to his or her fellow students. He also pointed to research by Benjamin Bloom (Bloom, 1976; Bloom, Madaus, & Hastings, 1981) indicating that student achievement does not necessarily follow a normal distribution when teachers exhibit a high level of instructional acumen. Grading students only in relation to one another, therefore, may provide information about a student’s rank in class, but it does not speak to the student’s academic achievement.

Self-Referenced Grading

Self-referenced grading occurs in relation to one’s own past performances. Proponents say that it may reduce competition in classrooms and serve to motivate students (for a discussion, see Brookhart & Nitko, 2007). On first glance, this kind of grading seems to make intuitive sense: the reference point for each student is his or her personal growth and the extent of active engagement in his or her own learning. But Brookhart and Nitko pointed out that this form of grading tends to be used primarily with low-ability students, and while heavily weighting factors such as effort, behavior, attitude, and participation might seem positive, this emphasis is one of the major criticisms of this form of grading. Mixing nonacademic competencies with academic competencies contaminates the meaning of a grade.

Standards-Based Grading

Grading that references student achievement to specific topics within each subject area is growing in popularity. This is called standards-based grading, and many consider this method to be the most appropriate method of grading (for a discussion, see Brookhart & Nitko, 2007, p. 219). Where there is interest in this system, however, there is also quite a bit of poor practice on top of considerable confusion about its defining characteristics.

As described in Marzano (2006), the origins of standards-based reporting can be traced to the concept of a performance standard. The term was popularized in a 1993 report commonly referred to as the Malcom Report in deference to Shirley M. Malcom, chair of the planning group. The report defined a “performance standard” as “how good is good enough” (National Education Goals Panel, 1993, pp. ii–iii). Since then, a popular practice has been to define student performance in terms of four categories: advanced, proficient, basic, and below basic. The scheme has its roots in the work of the National Assessment of Educational Progress. As Popham (2003) noted:

Increasingly, U.S. educators are building performance standards along the lines of the descriptive categories used in the National Assessment of Educational Progress (NAEP), a test administered periodically under the auspices of the federal government. NAEP results permit students’ performances in participating states to be compared…. Since 1990, NAEP results have been described in four performance categories: advanced, proficient, basic, and below basic. Most of the 50 states now use those four categories or labels quite similar to them. (p. 39)

The actual practice of standards-based reporting requires the identification of what we have referred to as reporting topics or measurement topics (Marzano, 2006; Marzano & Haystead, 2008). For example, consider the following common measurement topics for language arts at the fourth grade:

Reading

Word recognition and vocabulary

Reading comprehension

Literary analysis

Writing

Spelling

Language mechanics and conventions

Research and technology

Evaluation and revision

Listening and Speaking

Listening comprehension

Analysis and evaluation of oral media

Speaking applications

Here, ten measurement topics are organized under three categories (or strands, as some districts call them): reading, writing, and listening and speaking. For reporting purposes, each student would receive a score of advanced, proficient, basic, or below basic on each of the ten measurement topics. Typically, some type of rubric or scale that describes these levels is constructed for each measurement topic (we discuss this in depth in chapters 3, 5, and 6).

While this system seems like good practice, without giving teachers guidance and support on how to collect and interpret the assessment data with which scores like advanced, proficient, basic, and below basic are assigned, standards-based reporting can be highly inaccurate. Indeed, at the writing of this book, no major study (that we are aware of) has demonstrated that simply grading in a standards-based manner enhances student achievement. However, as the previous discussion illustrates, a fairly strong case can be made that student achievement will be positively affected if standards-based reporting is rooted in a clear-cut system of formative assessments.

Another problem that plagues standards-based reporting is the lack of distinction between standards-referenced systems and standards-based systems. Grant Wiggins (1993, 1996) was perhaps the first modern-day educator to highlight the differences between a standards-based system and a standards-referenced system. In a standards-based system, a student does not move to the next level until he or she can demonstrate competence at the current level. In a standards-referenced system, a student’s status is reported (or referenced) relative to the performance standard for each area of knowledge and skill on the report card; however, even if the student does not meet the performance standard for each topic, he or she moves to the next level. Thus, the vast majority of schools and districts that claim to have standards-based systems in fact have standards-referenced systems. As we shall see in chapter 6, both systems are viable, but they are quite different in their underlying philosophies. Understanding the distinctions between standards-based and standards-referenced systems helps schools and districts design a grading system that meets their needs.

Translating Research Into Classroom Practice

In subsequent chapters, we draw from the research and theory in this chapter and from sources such as Classroom Assessment and Grading That Work (Marzano, 2006) and Designing and Teaching Learning Goals and Objectives (Marzano, 2009) to discuss how formative assessment can be effectively implemented in the classroom. We also outline a system of grading that, when used uniformly and consistently, can yield much more valid and reliable information than that provided by traditional grading systems.

As mentioned in the introduction, as you progress through the remaining chapters, you will encounter exercises that ask you to examine the content presented. Some of these exercises ask you to answer specific questions. Answer these questions and check your answers with those provided in the back of the book. Other exercises are more open-ended and ask you to generate applications of what you have read.

Formative Assessment & Standards-Based Grading

Подняться наверх