期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Comparing the Effectiveness of Self‐Paced and Collaborative Frame‐of‐Reference Training on Rater Accuracy in a Large‐Scale Writing Assessment

下载免费PDF全文

Kevin R. Raczynski Allan S. Cohen George Engelhard Jr. Zhenqiu Lu 《Journal of Educational Measurement》2015,52(3):301-318

There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large‐scale writing assessments. This study compared the effectiveness of two widely used rater training methods—self‐paced and collaborative frame‐of‐reference training—in the context of a large‐scale writing assessment. Sixty‐six raters were randomly assigned to the training methods. After training, all raters scored the same 50 representative essays prescored by a group of expert raters. A series of generalized linear mixed models were then fitted to the rating data. Results suggested that the self‐paced method was equivalent in effectiveness to the more time‐intensive and expensive collaborative method. Implications for large‐scale writing assessments and suggestions for further research are discussed. 相似文献

2.

The effect of writers' personalities and raters' personalities on the holistic evaluation of writing

Patricia L. Carrell 《Assessing Writing》1995,2(2)

Recent work in reading and writing theory, research and pedagogy has raised questions about relationships between Fluent reading processes and holistic scoring of essays (e.g., Huot 1993). In holistic scoring settings, are the raters behaving as normal Fluent readers (i.e., readers interacting critically and personally with the text) or, are they somehow disconnected From their normal reader responses because they are using reliable scoring guides? Related questions concern the behavior of such holistic raters when they are teachers (e.g., Barritt, Stock, & Clark, 1986), and when those teachers respond to student writing (Connors & Lunsford, 1993). How are teachers/raters behaving, and what are they responding to in judging the writing? Previous research has suggested a role for personality type in the study of the process of writing evaluation (Jensen & DiTiberio, 1984, 1989).Thus, it seems reasonable to ask what role personality types play in the holistic evaluation of writing.This empirical study addressed the general question: What role, if any, do personality types of writers and of raters play in the holistic rating of writing? Moreover, is there a relationship between writers' personalities and raters' personalities?Writers were native English-speaking university freshman composition students; raters were native English-speaking university freshman composition instructors.Results indicate that the personality types of writers affect the ratings their essays receive, and the personality types of raters affect the ratings they give to essays. However, there is no significant relationship between writers' styles and raters' styles. Implications for future research, as well as classroom implications of these results are discussed. 相似文献

3.

Development of Automated Scoring Algorithms for Complex Performance Assessments: A Comparison of Two Approaches

Brian E. Clauser Melissa J. Margolis Stephen G. Clyman Linette P. Ross 《Journal of Educational Measurement》1997,34(2):141-161

Performance assessments are typically scored by having experts rate individual performances. The cost associated with using expert raters may represent a serious limitation in many large-scale testing programs. The use of raters may also introduce an additional source of error into the assessment. These limitations have motivated development of automated scoring systems for performance assessments. Preliminary research has shown these systems to have application across a variety of tasks ranging from simple mathematics to architectural problem solving. This study extends research on automated scoring by comparing alternative automated systems for scoring a computer simulation test of physicians'patient management skills; one system uses regression-derived weights for components of the performance, the other uses complex rules to map performances into score levels. The procedures are evaluated by comparing the resulting scores to expert ratings of the same performances. 相似文献

4.

Rater calibration when observational assessment occurs at large scale: Degree of calibration and characteristics of raters associated with calibration 总被引：1，自引：0，他引：1

Anne H. Cash Bridget K. Hamre Robert C. Pianta Sonya S. Myers 《Early childhood research quarterly》2012

Observational assessment is used to study program and teacher effectiveness across large numbers of classrooms, but training a workforce of raters that can assign reliable scores when observations are used in large-scale contexts can be challenging and expensive. Limited data are available to speak to the feasibility of training large numbers of raters to calibrate to an observation tool, or the characteristics of raters associated with calibration. This study reports on the success of rater calibration across 2093 raters trained by the Office of Head Start (OHS) in 2008–2009 on the Classroom Assessment Scoring System (CLASS), and for a subsample of 704 raters, characteristics that predict their calibration. Findings indicate that it is possible to train large numbers of raters to calibrate to an observation tool, and rater beliefs about teachers and children predicted the degree of calibration. Implications for large-scale observational assessments are discussed. 相似文献

5.

Validating Automated Essay Scoring: A (Modest) Refinement of the “Gold Standard”

Donald E. Powers David S. Escoffery Matthew P. Duchnowski 《教育实用测度》2015,28(2):130-142

By far, the most frequently used method of validating (the interpretation and use of) automated essay scores has been to compare them with scores awarded by human raters. Although this practice is questionable, human-machine agreement is still often regarded as the “gold standard.” Our objective was to refine this model and apply it to data from a major testing program and one system of automated essay scoring. The refinement capitalizes on the fact that essay raters differ in numerous ways (e.g., training and experience), any of which may affect the quality of ratings. We found that automated scores exhibited different correlations with scores awarded by experienced raters (a more compelling criterion) than with those awarded by untrained raters (a less compelling criterion). The results suggest potential for a refined machine-human agreement model that differentiates raters with respect to experience, expertise, and possibly even more salient characteristics. 相似文献

6.

Judgment‐Based Scoring by Teachers as Professional Development: Distinguishing Promises from Proof

Gail Lynn Goldberg 《Educational Measurement》2012,31(3):38-47

The engagement of teachers as raters to score constructed response items on assessments of student learning is widely claimed to be a valuable vehicle for professional development. This paper examines the evidence behind those claims from several sources, including research and reports over the past two decades, information from a dozen state educational agencies regarding past and ongoing involvement of teachers in scoring‐related activities as of 2001, and interviews with educators who served a decade or more ago for one state's innovative performance assessment program. That evidence reveals that the impact of scoring experience on teachers is more provisional and nuanced than has been suggested. The author identifies possible issues and implications associated with attempts to distill meaningful skills and knowledge from hand‐scoring training and practice, along with other forms of teacher involvement in assessment development and implementation. The paper concludes with a series of research questions that—based on current and proposed practice for the coming decade—seem to the author to require the most immediate attention. 相似文献

7.

Using scaffolded rubrics to improve peer assessment in a MOOC writing course

Scott Ashton 《Distance Education》2015,36(3):312-334

This study explored the value of using a guided rubric to enable students participating in a massive open online course in writing to produce more reliable assessments of their fellow students’ writing. To test the assumption that training students to assess will improve their ability to provide quality feedback, a multivariate factorial analysis was used to determine differences in assessments made by students who received guidance on using a rating rubric and those who did not. Although results were mixed, on average students who were provided no guidance in scoring writing samples were less likely to successfully differentiate between novice, intermediate, and advanced writing samples than students who received rubric guidance. Rubric guidance was most beneficial for items that were subjective, technically complex, and likely to be unfamiliar to the student. Items addressing relatively simple and objective constructs were less likely to be improved by rubric guidance. 相似文献

8.

PETS口试评分培训效果的多面Rasch分析

李英关丹丹《外语教学理论与实践》2016,153(3):43-48

本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现：培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。相似文献

9.

The quality assurance of a national English writing assessment: Policy implications for quality improvement

《Studies in Educational Evaluation》2020

The purpose of this study was to examine the quality assurance issues of a national English writing assessment in Chinese higher education. Specifically, using generalizability theory and rater interviews, this study examined how the current scoring policy of the TEM-4 (Test for English Majors – Band 4, a high-stakes national standardized EFL assessment in China) writing could impact its score variability and reliability. Eighteen argumentative essays written by nine English major undergraduate students were selected as the writing samples. Ten TEM-4 raters were first invited to use the authentic TEM-4 writing scoring rubric to score these essays holistically and analytically (with time intervals in between). They were then interviewed for their views on how the current scoring policy of the TEM-4 writing assessment could affect its overall quality. The quantitative generalizability theory results of this study suggested that the current scoring policy would not yield acceptable reliability coefficients. The qualitative results supported the generalizability theory findings. Policy implications for quality improvement of the TEM-4 writing assessment in China are discussed. 相似文献

10.

Modeling Rater Response Processes in Evaluating Score Meaning

Suzanne Lane 《Journal of Educational Measurement》2019,56(3):653-663

Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments. 相似文献

11.

Evaluating the reliability of a detailed analytic scoring rubric for foreign language writing

Martin 《Assessing Writing》2009,14(2):88-115

The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed. 相似文献

12.

Understanding the growth of ESL paragraph writing skills and its relationships with linguistic features

Vahid Aryadoust 《教育心理学》2016,36(10):1742-1770

This study sought to examine the development of paragraph writing skills of 116 English as a second language university students over the course of 12 weeks and the relationship between the linguistic features of students’ written texts as measured by Coh-Metrix – a computational system for estimating textual features such as cohesion and coherence – and the scores assigned by human raters. The raters’ reliability was investigated using many-facet Rasch measurement (MFRM); the growth of students’ paragraph writing skills was explored using a factor-of-curves latent growth model (LGM); and the relationships between changes in linguistic features and writing scores across time were examined by path modelling. MFRM analysis indicates that despite several misfits, students’ and raters’ performances and scale’s functionality conformed to the expectations of MFRM, thus providing evidence of psychometric validity for the assessments. LGM shows that students’ paragraph writing skills develop steadily during the course. The Coh-Metrix indices have more predictive power before and after the course than during it, suggesting that Coh-Metrix may struggle to discriminate between some ability levels. Whether a Coh-Metrix index gains or loses predictive power over time is argued to be partly a function of whether raters maintain or lose sensitivity to the linguistic feature measured by that index in their own assessment as the course progresses. 相似文献

13.

APPLICATION OF GENERALIZABILITY THEORY IN THE INVESTIGATION OF THE QUALITY OF JOURNAL WRITING IN MATHEMATICS

Youyan Shu Mei Shun 《Studies in Educational Evaluation》2007,33(3-4):371-383

This study examined the use of generalizability theory to evaluate the quality of an alternative assessment (journal writing) in mathematics. Twenty-nine junior college students wrote journal tasks on the given topics and two raters marked the tasks using a scoring rubric, constituting a two-facet G-study design in which students were crossed with tasks and raters. The G coefficient was .76 and index of dependability was .72. The results showed that increasing the number of tasks had a larger effect on the G coefficient and index of dependability than increasing the number of raters. Implications for educational practices are discussed. 相似文献

14.

A Comparison of the Generalizability of Scores Produced by Expert Raters and Automated Scoring Systems

《教育实用测度》2013,26(3):281-299

The growing use of computers for test delivery, along with increased interest in performance assessments, has motivated test developers to develop automated systems for scoring complex constructed-response assessment formats. In this article, we add to the available information describing the performance of such automated scoring systems by reporting on generalizability analyses of expert ratings and computer-produced scores for a computer-delivered performance assessment of physicians' patient management skills. Two different automated scoring systems were examined. These automated systems produced scores that were approximately as generalizable as those produced by expert raters. Additional analyses also suggested that the traits assessed by the expert raters and the automated scoring systems were highly related (i.e., true correlations between test forms, across scoring methods, were approximately 1.0). In the appendix, we discuss methods for estimating this correlation, using ratings and scores produced by an automated system from a single test form. 相似文献

15.

Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches

Sara Cushing Weigle 《Assessing Writing》1999,6(2):137

This study investigates how experienced and inexperienced raters score essays written by ESL students on two different prompts. The quantitative analysis using multi-faceted Rasch measurement, which provides measurements of rater severity and consistency, showed that the inexperienced raters were more severe than the experienced raters on one prompt but not on the other prompt, and that differences between the two groups of raters were eliminated following rater training. The qualitative analysis, which consisted of analysis of raters' think-aloud protocols while scoring essays, provided insights into reasons for these differences. Differences were related to the ease with which the scoring rubric could be applied to the two prompts and to differences in how the two groups of raters perceived the appropriateness of the prompts. 相似文献

16.

Rater Certification Tests: A Psychometric Approach

Yigal Attali 《Educational Measurement》2019,38(2):6-13

Rater training is an important part of developing and conducting large‐scale constructed‐response assessments. As part of this process, candidate raters have to pass a certification test to confirm that they are able to score consistently and accurately before they begin scoring operationally. Moreover, many assessment programs require raters to pass a calibration test before every scoring shift. To support the high‐stakes decisions made on the basis of rater certification tests, a psychometric approach for their development, analysis, and use is proposed. The circumstances and uses of these tests suggest that they are expected to have relatively low reliability. This expectation is supported by empirical data. Implications for the development and use of these tests to ensure their quality are discussed. 相似文献

17.

Constructs of writing proficiency in US state and national writing assessments: Exploring variability

Jill V. Jeffery 《Assessing Writing》2009,14(1):3-24

Persistent gaps between optimistic state and pessimistic national academic performance assessment results are increasingly leading to calls for unified national standards in the US. Critics argue that these gaps reveal vast differences in how proficiency is conceptualized; however, little is known about how conceptualizations compare among large-scale US assessments. To explore this issue, the present study investigated constructs of writing proficiency implicated in 41 US state and national high school direct writing assessments by analyzing the relationships between prompt-genre demands and assessment scoring criteria. Results of this analysis suggest that national writing assessments differ as a group from state assessments in the extent to which they emphasize genre distinctions and present coherent conceptualizations of writing proficiency. The implications of these assessment variations for college preparedness are discussed. 相似文献

18.

Interrater reliability of the original and a revised scoring system for the developmental test of visual-motor integration

Sheila Ratsch Lepkin Walter B. Pryzwansky 《Psychology in the schools》1983,20(3):284-288

This study investigated the interrater reliability of teachers' and school psychology ex-terns' scoring of protocols for the Developmental Test of Visual-Motor Integration (VMI). Previous studies suggest that the scoring criteria of the VMI are ambiguous, which when coupled with raters' lack of scoring experience, as well as limited knowledge of testing issues, contributes to low rater reliability. The original manual scoring system was used by four trained teachers with no VMI experience and by four experienced raters. A VMI scoring system, revised to eliminate ambiguous scoring criteria, was used by an additional four teachers inexperienced with the VMI and by four experienced raters. High reliability coefficients (>.90) were found for all raters, regardless of the scoring system employed. The influence on interrater reliability of factors such as training, nature of the training setting, characteristics of the raters, and ambiguity of scoring criteria is discussed. 相似文献

19.

Estimating Variance Components from Sparse Data Matrices in Large-Scale Educational Assessments

Christine DeMars 《教育实用测度》2015,28(1):1-13

In generalizability theory studies in large-scale testing contexts, sometimes a facet is very sparsely crossed with the object of measurement. For example, when assessments are scored by human raters, it may not be practical to have every rater score all students. Sometimes the scoring is systematically designed such that the raters are consistently grouped throughout the scoring, so that the data can be analyzed as raters nested within teams. Other times, rater pairs are randomly assigned for each student, such that each rater is paired with many other raters at different times. One possibility for this scenario is to treat the data as if raters were nested within students. Because the raters are not truly independent across all students, the resulting variance components could be somewhat biased. This study illustrates how the bias will tend to be small in large-scale studies. 相似文献

20.

Examining rater accuracy and consistency with a special education observation protocol

《Studies in Educational Evaluation》2020

Research indicates that instructional aspects of teacher performance are the most difficult to reach consensus on, significantly limiting teacher observation as a way to systematically improve instructional practice. Understanding the rationales that raters provide as they evaluate teacher performance with an observation protocol offers one way to better understand the training efforts required to improve rater accuracy. The purpose of this study was to examine the accuracy of raters evaluating special education teachers’ implementation of evidence-based math instruction. A mixed-methods approach was used to investigate: 1) the consistency of the raters’ application of the scoring criteria to evaluate teachers’ lessons, 2) raters’ accuracy on two lessons with those given by expert-raters, and 3) the raters’ understanding and application of the scoring criteria through a think-aloud process. The results show that raters had difficulty understanding some of the high inference items in the rubric and applying them accurately and consistently across the lessons. Implications for rater training are discussed. 相似文献