期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Comparing holistic and analytic scoring methods: issues of validity and reliability

Claudia Harsch Guido Martin 《Assessment in Education: Principles, Policy & Practice》2013,20(3):281-307

相似文献

2.

Using Rating Augmentation to Expand the Scale of an Analytic Rubric

Jim Penny Robert L. Johnson Belita Gordon 《Journal of Experimental Education》2013,81(3):269-287

A method of expanding a rating scale 3-fold without the expense of defining additional benchmarks was studied. The authors used an analytic rubric representing 4 domains of writing and composed of 4-point scales to score 120 writing samples from Georgia's 11th-grade Writing Assessment. The raters augmented the scores of papers on which the proficiency levels appeared slightly higher or lower than the benchmark papers at the selected proficiency level by adding a “+” or a “?” to the score. The results of the study indicate that the use of this method of rating augmentation tends to improve most indices of interrater reliability, although the percentage of exact and adjacent agreement decreases because of the increased number of rating possibilities. In addition, there was evidence to suggest that the use of augmentation may produce domain-level scores with sufficient reliability for use with diagnostic feedback to teachers about the performance of students. 相似文献

3.

Rating scale impact on EFL essay marking: A mixed-method study

《Assessing Writing》2007,12(2):86-107

相似文献

4.

A Two‐Stage Method for Classroom Assessments of Essay Writing

Stephen Mark Humphry Sandy Heldsinger 《Journal of Educational Measurement》2019,56(3):505-520

相似文献

5.

Writing evaluation: rater and task effects on the reliability of writing scores for children in Grades 3 and 4

Young-Suk Grace Kim Christopher Schatschneider Jeanne Wanzek Brandy Gatlin Stephanie Al Otaiba 《Reading and writing》2017,30(6):1287-1310

We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54 and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44 and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in some state accountability systems. 相似文献

6.

The impact of the scoring system of a large-scale standardized EFL writing assessment on its score variability and reliability: Implications for assessment policy makers

《Studies in Educational Evaluation》2020

Using generalizability (G-) theory and rater interviews as research methods, this study examined the impact of the current scoring system of the CET-4 (College English Test Band 4, a high-stakes national standardized EFL assessment in China) writing on its score variability and reliability. One hundred and twenty CET-4 essays written by 60 non-English major undergraduate students at one Chinese university were scored holistically by 35 experienced CET-4 raters using the authentic CET-4 scoring rubric. Ten purposively selected raters were further interviewed for their views on how the current scoring system could impact its score variability and reliability. The G-theory results indicated that the current single-task and single-rater holistic scoring system would not be able to yield acceptable generalizability and dependability coefficients. The rater interview results supported the quantitative findings. Important implications for the CET-4 writing assessment policy in China are discussed. 相似文献

7.

Application of Latent Trait Models to Identifying Substantively Interesting Raters

Edward W. Wolfe Aaron McVay 《Educational Measurement》2012,31(3):31-37

Historically, research focusing on rater characteristics and rating contexts that enable the assignment of accurate ratings and research focusing on statistical indicators of accurate ratings has been conducted by separate communities of researchers. This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings. We employ two data sources in our demonstration—simulated data and data from a large‐scale state‐wide writing assessment. We apply latent trait models to these data to identify examples of rater leniency, centrality, inaccuracy, and differential dimensionality; and we investigate the association between rater training procedures and the manifestation of rater effects in the real data. 相似文献

8.

作文网上评分“三评法”初探

李银玲《考试研究》2013,(2):64-70

文章针对目前网阅环境下作文"一评"定分评分方法的缺陷,提出了将"三评法"应用于作文评分中。结果表明,"一评法"下,评分员间一致性不够理想,存在显著性差异。"三评法"在一定程度上降低了评分误差,确保了阅卷质量。但这种方法在实施过程中也要注意避免三评人员的求稳心理,以确保该方法得到科学合理的使用。对于该方法能否投入到大规模作文网上评分中,还有待进一步研究。相似文献

9.

Nonparametric Evidence of Validity,Reliability, and Fairness for Rater‐Mediated Assessments: An Illustration Using Mokken Scale Analysis

Stefanie A. Wind 《Journal of Educational Measurement》2019,56(3):478-504

Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings. 相似文献

10.

The Stability of Rater Severity in Large-Scale Assessment Programs

Peter J. Congdon Joy MeQueen 《Journal of Educational Measurement》2000,37(2):163-178

The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re-rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily estimates of the relative severity of individual raters were found to differ significantly from single, on-average estimates for the whole rating period. For 10 raters, severity estimates on the last day were significantly different from estimates on the first day. These fndings cast doubt on the practice of using a single calibration of rater severity as the basis for adjustment of person measures. 相似文献

11.

Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model 总被引：2，自引：0，他引：2

George Engelhard Jr 《Journal of Educational Measurement》1994,31(2):93-112

This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used 相似文献

12.

The Consistency Between Raters Scoring in Different Test Years

《教育实用测度》2013,26(2):195-208

The consistency between raters over 3 years of a high-stakes performance assessment was examined in 2 studies that involved students in Grades 3, 5, and 8. The students' performance was evaluated in reading, writing, language usage, mathematics, science, and social studies. The results showed that the groups of raters used in different years differed in severity. Their consistency tended to improve over years, but differences between the rater groups remained. It is shown that these differences could affect students' proficiency classifications, indicating the need to adjust for rater effects during the equating process. The Grade 8 raters generally were found to be more consistent than the Grade 3 and Grade 5 raters. Also, the raters in mathematics generally were the most consistent, those in the language arts areas were the least consistent, and the consistency of raters in science and social studies varied over grade levels. 相似文献

13.

Effects of writing instruction and assessment on functional composition performance

Hans Kuhlemeier Huub Van Den Bergh 《Assessing Writing》1997,4(2):203-223

In this study the relationships between writing instruction and functional composition performance were analyzed. The data were obtained in a national assessment on the language proficiency of students in the third year of Dutch secondary education (age ±15). Multivariate multilevel analysis showed that 10 out of 36 instructional characteristics were related to functional composition performance. The effective instructional characteristics included: instruction and exercises in writing functional texts, writing for a specific purpose, tailoring to a particular audience, global rating of writing products by the teacher, and frequent evaluation of Dutch language proficiency through teacher-made tests and written assignments. No effects were found for the rather popular subskill exercises on idiom, syntax, spelling and punctuation, and for pre-writing activities, text revisions and peer-review. Furthermore, the effect of instructional characteristics was often different on one task than on another. Finally, there was little differential effectiveness for different groups of students. If one instructional characteristic was more effective than the other, this was generally true, to an equal degree, for boys and girls and for promoted and non-promoted students. 相似文献

14.

高考英语书面表达评分标准的问题与建议

李迅辉《考试研究》2014,(2):64-72

随着新课程改革的不断深入,教学理念逐步更新,学生的英语水平也在逐渐提高,但沿用多年的高考英语书面表达的评分标准并没有与时俱进,已经不能完全适应英语教学改革的要求。笔者认为,与《课程标准》相对照,它存在对学生的书面表达能力要求偏低的问题;与托福等考试的写作评分标准相比较,其整体评分方式不确定度相对较大,分项式描述不尽合理。针对上述问题,本文提出了"改良的整体评分法"的建议。相似文献

15.

Components of Rater Error in a Complex Performance Assessment

Brian E. Clauser Stephen G. Clyman David B. Swanson 《Journal of Educational Measurement》1999,36(1):29-45

Numerous studies have examined performance assessment data using generaliz-ability theory. Typically, these studies have treated raters as randomly sampled from a population, with each rater judging a given performance on a single occasion. This paper presents two studies that focus on aspects of the rating process that are not explicitly accounted for in this typical design. The first study makes explicit the "committee" facet, acknowledging that raters often work within groups. The second study makes explicit the "rating-occasion" facet by having each rater judge each performance on two separate occasions. The results of the first study highlight the importance of clearly specifying the relevant facets of the universe of interest. Failing to include the committee facet led to an overly optimistic estimate of the precision of the measurement procedure. By contrast, failing to include the rating-occasion facet, in the second study, had minimal impact on the estimated error variance. 相似文献

16.

Rater Effects on Essay Scoring: A Multilevel Analysis of Severity Drift,Central Tendency,and Rater Experience

George Leckie Jo‐Anne Baird 《Journal of Educational Measurement》2011,48(4):399-418

This study examined rater effects on essay scoring in an operational monitoring system from England's 2008 national curriculum English writing test for 14‐year‐olds. We fitted two multilevel models and analyzed: (1) drift in rater severity effects over time; (2) rater central tendency effects; and (3) differences in rater severity and central tendency effects by raters’ previous rating experience. We found no significant evidence of rater drift and, while raters with less experience appeared more severe than raters with more experience, this result also was not significant. However, we did find that there was a central tendency to raters’ scoring. We also found that rater severity was significantly unstable over time. We discuss the theoretical and practical questions that our findings raise. 相似文献

17.

The Reliability,Validity, and Utility of Three Handwriting Measurement Procedures

《The Journal of educational research》2012,105(6):373-380

Abstract

This study investigated the reliability, validity, and utility of the following three measures of letter-formation quality: (a) a holistic rating system, in which examiners rated letters on a five-point Likert-type scale; (h) a holistic rating system with model letters, in which examiners used model letters that exemplified specific criterion scores to rate letters; and (c) a correct/incorrect procedure, in which examiners used transparent overlays and standard verbal criteria to score letters. Intrarater and interrater reliability coefficients revealed that the two holistic scoring procedures were unreliable, whereas scores obtained by examiners who used the correct/incorrect procedure were consistent over time and across examiners. Although all three of the target measures were sensitive to differences between individual letters, only the scores from the two holistic procedures were associated with other indices of handwriting performance. Furthermore, for each of the target measures, variability in scores was, for the most part, not attributable to the level of experience or sex of the respondents. Findings are discussed with respect to criteria for validating an assessment instrument. 相似文献

18.

Comparing the Effectiveness of Self‐Paced and Collaborative Frame‐of‐Reference Training on Rater Accuracy in a Large‐Scale Writing Assessment

下载免费PDF全文

Kevin R. Raczynski Allan S. Cohen George Engelhard Jr. Zhenqiu Lu 《Journal of Educational Measurement》2015,52(3):301-318

There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large‐scale writing assessments. This study compared the effectiveness of two widely used rater training methods—self‐paced and collaborative frame‐of‐reference training—in the context of a large‐scale writing assessment. Sixty‐six raters were randomly assigned to the training methods. After training, all raters scored the same 50 representative essays prescored by a group of expert raters. A series of generalized linear mixed models were then fitted to the rating data. Results suggested that the self‐paced method was equivalent in effectiveness to the more time‐intensive and expensive collaborative method. Implications for large‐scale writing assessments and suggestions for further research are discussed. 相似文献

19.

Evaluating Rater Accuracy in Performance Assessments 总被引：1，自引：0，他引：1

George Engelhard Jr. 《Journal of Educational Measurement》1996,33(1):56-70

A new method for evaluating rater accuracy within the context of performance assessments is described. Accuracy is defined as the match between ratings obtained from operational raters and those obtained from an expert panel on a set of benchmark, exemplar, or anchor performances. An extended Rasch measurement model called the FACETS model is presented for examining rater accuracy. The FACETS model is illustrated with 373 benchmark papers rated by 20 operational raters and an expert panel. The data are from the 1993field test of the High School Graduation Writing Test in Georgia. The data suggest that there are statistically significant differences in rater accuracy; the data also suggest that it is easier to be accurate on some benchmark papers than on others. A small example is presented to illustrate how the accuracy ordering of raters may not be invariant over different subsets of benchmarks used to evaluate accuracy. 相似文献

20.

A Hierarchical Rater Model for Constructed Responses,with a Signal Detection Rater Model

Lawrence T. DeCarlo YoungKoung Kim Matthew S. Johnson 《Journal of Educational Measurement》2011,48(3):333-356

The hierarchical rater model (HRM) re‐cognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters’ scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn serves as an indicator of the examinee's proficiency, thus yielding a hierarchical structure. Here it is shown that a latent class model motivated by signal detection theory (SDT) is a natural candidate for the first level of the HRM, the rater model. The latent class SDT model provides measures of rater precision and various rater effects, above and beyond simply severity or leniency. The HRM‐SDT model is applied to data from a large‐scale assessment and is shown to provide a useful summary of various aspects of the raters’ performance. 相似文献