期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《中国考试》2015,(2)

目前大规模考试作文评分大都采用双评评分模式,本研究采用多侧面Rasch模型(MFRM)分析双评模式下大型英语作文评分中的评分者误差来源及主要影响因素。对57名评分者所评价的2 427篇作文分析发现:1评分者的宽严度存在显著的差异;2在作文评分中,约有22.8%的评分者之间的一致性较差,也存在约3.5%的评分者之间一致性过高;3约90%的评分者自身的一致性都较高,但仍有8.8%的评分者自身一致性很差,约2%的评分者出现评分自身一致性过高的情况;4从整体上讲,评分者在不同的评分标准(或维度)上、不同评分等级宽严程度的把握存在差异;评分者和被试,以及评分者、被试和评分标准三者的交互作用不显著;5评分者对男生和女生具有相同的宽严度。相似文献

2.

CAS在自学考试翻译测验评分中的应用研究

田霖王桥影赵晓茫《教育与考试》2012,(1):5-9

计算机自动评分（CAS）用于自学考试外语类课程的翻译测验评分,能够有效提高评分效率及客观性。本研究对72名自考学习者翻译测验作答数据的计算机自动评分结果与人工评分结果进行相关分析及配对样本t检验,并将两种评分方式的诊断结果进行比较。研究发现,计算机自动评分与人工评分结果高度相关,两种评分方式的翻译测验总分无显著差异,总体而言本次翻译测验自动评分结果是可靠的;但计算机自动评分与人工评分对自考学习者的翻译能力结构诊断结果有一定差异。相似文献

3.

用Longford方法检验主观评分差异

李传益《中国考试》2010,(1):22-27

主观考试采用评分员进行主观评分,由于评分一致性不高,缺乏信度,测量学界一直在努力探索提高主观评分信度的办法。本文用Longford方法对参加HSK[高等]作文考试评分的异常评分员作了一次实证检验。结果证明,该方法对检验大规模标准化主观考试评分员差异确实有效。相似文献

4.

作文网上评分“三评法”初探

李银玲《考试研究》2013,(2):64-70

文章针对目前网阅环境下作文"一评"定分评分方法的缺陷,提出了将"三评法"应用于作文评分中。结果表明,"一评法"下,评分员间一致性不够理想,存在显著性差异。"三评法"在一定程度上降低了评分误差,确保了阅卷质量。但这种方法在实施过程中也要注意避免三评人员的求稳心理,以确保该方法得到科学合理的使用。对于该方法能否投入到大规模作文网上评分中,还有待进一步研究。相似文献

5.

自动测评系统在高职英语写作评分中应用的信度和效度分析

查静宁毅《潍坊教育学院学报》2015,(2):97-101

文章首先回顾了信度和效度的概念以及检测信度和效度的方法,以此为依据,将收集到的电脑评分和专家人工评分的数据进行了相关性分析、信度检验、重复性方差分析、独立样本t检验以及定性分析等各项分析,多方位地多元评分系统的信度和效度进行了验证。结果表明,本系统具有良好的内部一致性,信度较好,但是初评分比例较高时,信度较低;与专家评分的结果对比研究表明,自动评分系统结果对说明文和应用文体两种文体写作能力解释力较差。相似文献

6.

论英语口语考试的评分误差 总被引：1，自引：0，他引：1

曾用强《考试研究》2007,(3)

口语考试的评分是评分员基于评分标准对语言产出的认知处理过程,处理的目的就是解释考生之间的分数差异(score vari-ance)。用于解释分数差异的变量包括构念相关变量(construct-rele-vant variables)和构念不相关变量(construct-irrelevant variables)。如果构念不相关变量发生作用,那么评分就产生误差。考试误差可区分为系统性误差(systematic error)和随机性误差(randomerror)。随机性误差是评分误差控制的重点内容。口语考试评分误差的主要表现形式包括评分员的个性差异、回归均值趋势和假正态分布。我们可以通过分数差异分布和回归系数等统计手段验证口语考试评分误差的大小程度。本文还讨论了口语考试评分误差控制的目标、原则和方法。评估误差控制的目的就是最大化构念相关变量的作用,最小化构念不相关变量的影响作用;这就要求评分员在评分过程中坚持一致性、完整性和独立性三条基本原则;在手段的使用方面,口语考试的评分误差控制主要包括管理手段、技术手段和统计手段等。相似文献

7.

教育评价中等级次序评定的评分一致性检验

卢晓旭黄彦婷《江苏教育研究》2010,(5):47-48

由多位评委评分的教育评价活动中,评分的等级次序的一致性影响评价的可信性。运用肯德尔和谐系数可以检验评分的一致性程度,以判断评价数据的可信性和评价活动的有效性．相似文献

8.

成人高考网上阅卷的评分者差异研究 总被引：1，自引：0，他引：1

高丙成秦旭芳《乌鲁木齐职业大学学报》2007,16(4):96-99

采用四评模式的网上阅卷既有优点也有不足,本文通过调查法对辽宁省2006年10月份成人高考网上阅卷的评分者差异进行了研究。结果表明,评分者之间在阅卷速度、阅卷平均分、阅卷标准差及出分率等方面均存在差异。评分者评分过程中在阅卷速度、出分率方面逐渐提高,在阅卷标准差方面逐渐降低,在阅卷平均分方面差异不明显。并通过访谈法归纳出了及时反馈、严格要求、做好培训、减少误差阈值等减少评分者差异的有效策略。相似文献

9.

基于领域预训练的孪生网络智能评分方法

肖国亮马磊袁峰郭成锋《中国考试》2023,(4):78-85

随着信息技术的发展,主观题智能评分成为考试与测评领域的研究热点。基于深度学习的主观题智能评分方法目前尚存在一定局限性：一是基于深度学习的方法通常需要充足的训练样本才能达到比较好的效果,而一些真实阅卷场景却无法提供充足的标定样本;二是评分模型仅预测总分值,缺少评分细节,无法为后续的结果评价提供依据。针对以上问题,提出基于领域预训练的孪生网络智能评分方法,探索利用考生作答文本提高评卷精度的方法,探索得分点模型的可行性与实现方法。实验证明,孪生网络智能评分方法能够有效提高小样本情况下的主观题智能评分精度。相似文献

10.

两种评分量表的评分效应比较研究

关丹丹陈睿张开赵静宇《教育研究与实验》2011,(4)

本研究运用多面Rasch模型对比分析了大、小两种评分量表下评分员的评分效应。结果显示,与小尺度评分量表相比,评分员在大尺度评分量表下不能使用全距分值,容易给出趋中分数;而且,在大尺度评分量表下评分员间的评分一致性较差。据此,提出应改进我国各项考试中写作评分量表的设置,并单独报告写作成绩的建议。相似文献

11.

Evaluating the reliability of a detailed analytic scoring rubric for foreign language writing

Martin 《Assessing Writing》2009,14(2):88-115

The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed. 相似文献

12.

Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches

Sara Cushing Weigle 《Assessing Writing》1999,6(2):137

This study investigates how experienced and inexperienced raters score essays written by ESL students on two different prompts. The quantitative analysis using multi-faceted Rasch measurement, which provides measurements of rater severity and consistency, showed that the inexperienced raters were more severe than the experienced raters on one prompt but not on the other prompt, and that differences between the two groups of raters were eliminated following rater training. The qualitative analysis, which consisted of analysis of raters' think-aloud protocols while scoring essays, provided insights into reasons for these differences. Differences were related to the ease with which the scoring rubric could be applied to the two prompts and to differences in how the two groups of raters perceived the appropriateness of the prompts. 相似文献

13.

Validating human and automated scoring of essays against “True” scores

Yoav Cohen Effi Levi Anat Ben-Simon 《教育实用测度》2018,31(3):241-250

ABSTRACT

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two, or three raters at a time, and by calculating an estimate of the true scores using the remaining raters, an independent criterion against which to judge the validity of the human raters and that of the AES system, as well as the interrater reliability was produced. The results of the study indicated that the automated scores correlate with human scores to the same degree as human raters correlate with each other. However, the findings regarding the validity of the ratings support a claim that the reliability and validity of AES diverge: although the AES scoring is, naturally, more consistent than the human ratings, it is less valid. 相似文献

14.

Estimating Variance Components from Sparse Data Matrices in Large-Scale Educational Assessments

Christine DeMars 《教育实用测度》2015,28(1):1-13

In generalizability theory studies in large-scale testing contexts, sometimes a facet is very sparsely crossed with the object of measurement. For example, when assessments are scored by human raters, it may not be practical to have every rater score all students. Sometimes the scoring is systematically designed such that the raters are consistently grouped throughout the scoring, so that the data can be analyzed as raters nested within teams. Other times, rater pairs are randomly assigned for each student, such that each rater is paired with many other raters at different times. One possibility for this scenario is to treat the data as if raters were nested within students. Because the raters are not truly independent across all students, the resulting variance components could be somewhat biased. This study illustrates how the bias will tend to be small in large-scale studies. 相似文献

15.

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

Jo-Anne Baird Michelle Meadows George Leckie Daniel Caro 《Assessment in Education: Principles, Policy & Practice》2017,24(1):44-59

This study evaluated rater accuracy with rater-monitoring data from high stakes examinations in England. Rater accuracy was estimated with cross-classified multilevel modelling. The data included face-to-face training and monitoring of 567 raters in 110 teams, across 22 examinations, giving a total of 5500 data points. Two rater-monitoring systems (Expert consensus scores and Supervisor judgement of correct scores) were utilised for all raters. Results showed significant group training (table leader) effects upon rater accuracy and these were greater in the expert consensus score monitoring system. When supervisor judgement methods of monitoring were used, differences between training teams (table leader effects) were underestimated. Supervisor-based judgements of raters’ accuracies were more widely dispersed than in the Expert consensus monitoring system. Supervisors not only influenced their teams’ scoring accuracies, they overestimated differences between raters’ accuracies, compared with the Expert consensus system. Systems using supervisor judgements of correct scores and face-to-face rater training are, therefore, likely to underestimate table leader effects and overestimate rater effects. 相似文献

16.

Modeling Rater Response Processes in Evaluating Score Meaning

Suzanne Lane 《Journal of Educational Measurement》2019,56(3):653-663

Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments. 相似文献

17.

Training and Scoring Issues Involved in Large-Scale Writing Assessments

Tonya R. Moon Kevin R. Hughes 《Educational Measurement》2002,21(2):15-19

Many states are implementing direct writing assessments to assess student achievement. While much literature has investigated minimizing raters' effects on writing scores, little attention has been given to the type of model used to prepare raters to score direct writing assessments. This study reports on an investigation that occurred in a state-mandated writing program when a scoring anomaly became apparent once assessments were put in operation. The study indicates that using a spiral model for training raters and scoring papers results in higher mean ratings than does using a sequential model for training and scoring. Findings suggest that making decisions about cut-scores based on pilot data has important implications for program implementation. 相似文献

18.

Resistance to Confounding Style and Content in Scoring Constructed-Response Items

William D. Schafer Phill Gagné Robert W. Lissitz 《Educational Measurement》2005,24(2):22-28

An assumption that is fundamental to the scoring of student-constructed responses (e.g., essays) is the ability of raters to focus on the response characteristics of interest rather than on other features. A common example, and the focus of this study, is the ability of raters to score a response based on the content achievement it demonstrates independent of the quality with which it is expressed. Previously scored responses from a large-scale assessment in which trained scorers rated exclusively constructed-response formats were altered to enhance or degrade the quality of the writing, and scores that resulted from the altered responses were compared with the original scores. Statistically significant differences in favor of the better-writing condition were found in all six content areas. However, the effect sizes were very small in mathematics, reading, science, and social studies items. They were relatively large for items in writing and language usage (mechanics). It was concluded from the last two content areas that the manipulation was successful and from the first four that trained scorers are reasonably well able to differentiate writing quality from other achievement constructs in rating student responses. 相似文献