首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 672 毫秒
建构反应试题是自主招生考试的主要题型,其既有学业考查方面的优势,又有难以避免的评分误差。本文通过对2013年“华约”自主招生数学试卷的统计分析与质量评价,从总体评分、评分松紧度、评分趋中、量表等级限制、交互作用和侧面功能差异六个方面阐释建构反应试题评分误差产生的原因及其影响,在试卷评价的一致性、合理性和准确性的诉求下,提出在试卷的命制、评阅和反馈阶段消除和控制评分误差的建议,以提高我国基础教育的试卷评价质量。  相似文献   

论英语口语考试的评分误差   总被引:1,自引:0,他引:1  
口语考试的评分是评分员基于评分标准对语言产出的认知处理过程,处理的目的就是解释考生之间的分数差异(score vari-ance)。用于解释分数差异的变量包括构念相关变量(construct-rele-vant variables)和构念不相关变量(construct-irrelevant variables)。如果构念不相关变量发生作用,那么评分就产生误差。考试误差可区分为系统性误差(systematic error)和随机性误差(randomerror)。随机性误差是评分误差控制的重点内容。口语考试评分误差的主要表现形式包括评分员的个性差异、回归均值趋势和假正态分布。我们可以通过分数差异分布和回归系数等统计手段验证口语考试评分误差的大小程度。本文还讨论了口语考试评分误差控制的目标、原则和方法。评估误差控制的目的就是最大化构念相关变量的作用,最小化构念不相关变量的影响作用;这就要求评分员在评分过程中坚持一致性、完整性和独立性三条基本原则;在手段的使用方面,口语考试的评分误差控制主要包括管理手段、技术手段和统计手段等。  相似文献   

随着现代教育和心理测量理论的发展,特别是计算机技术的应用,高考作文评分误差控制研究成果迭出。自21世纪初我国部分省市采用网上阅卷以来,高考作文评分误差控制理论在实践层面也获得了较大的进展。但毋庸讳言,高考作文评分误差依然(个别地方甚至还严重地)存在,因此,我们将在前人研究的基础上提出自己的构想。  相似文献   

综合题或论文式题的评分有着较大的随机误差,这是因为评卷教师主观因素的影响造成的.例如某省对84年高考作文评卷作了一项实验,请阅卷点上的全体评卷教师(438人),按教育部规定的统一的评分标准,各自独立地对四篇作文评分.结果评分平均误差的标准误达4.4;评分的平均最大全距达28(作文的满分是50).这样惊人的评分误差,造成的后果是不言而喻的.由此可见,在评分中对误差的控制是一件十分重要的工作.本文向大家介绍三种控制评分误差的方法:一、平均法每个试题至少请两人进行评定,并规定评分误差的极限.如果两个人的分数之差不超过这个极  相似文献   

网上阅卷是近年来兴起的利用现代技术控制主观题评分误差的方法,其在作文评分中误差控制的效果十分明显。网上阅卷主要通过评卷员之间一致性误差控制、评卷员本人一致性误差控制、两评的误差控制、评分点之间的误差控制、抽查监控等5种方法来实现评分误差控制,同时通过机控系统实施评分误差控制管理。随着研究的深入与技术的发展,将有可能实现基于互联网的高考作文评卷和计算机自动评卷,以进一步实现评卷误差控制。  相似文献   

一、问题的提出论文性试题的评分存在误差,这是由评分者的主观原因造成的。考察评分误差的方法常常是计算评分之间的相关度,从而估计评分误差对信度的影响,称为评分者信度。计算评分者信度一般有两种情况,一种是由两位评分者给许多被试评分,或一位教师给许多被试评两次分数,计算两次分数之间的相关系数;另一种是许多位评分者给许多被试评分,或一位教师反复多次给许多被试评分,计算肯德尔和谐系数。肯德尔和谐系数实际就是多列变量间的等级相关系数。  相似文献   

概化理论在结构化面试评分误差中的应用研究   总被引:1,自引:0,他引:1  
应用概化理论对结构化面试的评分误差的控制问题进行了研究。结果表明:结构化面试评分能够较好地反映出被试的真实能力水平,评分具有较高的信度;在保证较高的面试评分信度(0.80)的情况下,建议将考官人数减少至9名,以提高结构化面试的经济性和效率性。  相似文献   

目前大规模考试作文评分大都采用双评评分模式,本研究采用多侧面Rasch模型(MFRM)分析双评模式下大型英语作文评分中的评分者误差来源及主要影响因素。对57名评分者所评价的2 427篇作文分析发现:1评分者的宽严度存在显著的差异;2在作文评分中,约有22.8%的评分者之间的一致性较差,也存在约3.5%的评分者之间一致性过高;3约90%的评分者自身的一致性都较高,但仍有8.8%的评分者自身一致性很差,约2%的评分者出现评分自身一致性过高的情况;4从整体上讲,评分者在不同的评分标准(或维度)上、不同评分等级宽严程度的把握存在差异;评分者和被试,以及评分者、被试和评分标准三者的交互作用不显著;5评分者对男生和女生具有相同的宽严度。  相似文献   

问题的提出 语文教学的“最终目的为:自能读书,不待老师讲,自能作文,不待老师改。老师之训练必作到此两点,乃为教学之成功”(叶圣陶语)。故历届高考,都把阅读和作文作为语文考核的基本内容。以86届高考为例,语文知识,阅读(语体和文言各半)和作文的比分为7:5。 高考语文阅卷,语文知识和阅读的评分,历来误差不大,随着命题向标准化的靠拢,这两部分的评分,将更易于接近或达到科学化。惟独作文,不仅评分难以标准化,而且误差相当大,不能较为准确地评定考生的写作能力;不利于高等院校选拔合格的新生。鉴于此,我们不揣谫陋,拟就高考作文评分中存在的问题,提出初步改革评分办法的构想,抛砖引玉,以期引起广大语文工作者的思考与研讨,使高  相似文献   

随着现代教育和心理测量理论的发展,特别是计算机技术的应用,高考作文评分误差控制研究成果迭出。自21世纪初我国部分省市采用网上阅卷以来,高考作文评分误差控制理论在实践层面也获得了较大的进展。但毋庸讳言,高考作文评分误差依然(个别地方甚至还严重地)存在,因此,我们将在前人研究的基础上提出自己的构想。  相似文献   

Although much attention has been given to rater effects in rater‐mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large‐scale, multi‐state summative assessment program. Multilevel models were used to assess the overall extent of rater leniency/severity during scoring and examine the extent to which leniency/severity effects were stable across the three administrations. Model results were then applied to scaled scores to estimate the impact of the stability of leniency/severity effects on students’ scores. Results showed relative scoring stability across administrations in mathematics. In English language arts, short constructed response items showed evidence of slightly increasing severity across administrations, while essays showed mixed results: evidence of both slightly increasing severity and moderately increasing leniency over time, depending on trait. However, when model results were applied to scaled scores, results revealed rater effects had minimal impact on students’ scores.  相似文献   

量刑建议是解释量刑抗诉合理性的依据,量刑抗诉需要以量刑建议的明确化为前提;同时,量刑建议也是限制量刑抗诉的有效手段。因此,我们必须建构检察机关对刑罚裁量的合理参与模式,特别是量刑建议与量刑抗诉之间的制度衔接,以及量刑抗诉本身的修正问题。  相似文献   

主观题评分标准研究   总被引:1,自引:0,他引:1  
本文以2006年上海市高考政治学科论述题评分标准为例,从三个方面研究如何评价主观题评分标准的优劣,即每个评分项是否具有相对独立性;根据若干评分项的结果是否能够推测出考生的综合论述的能力;每个评分项等第划分是否合理。因子分析表明该主观题四个评分项具有单维性,一个因子可以解释为考生的综合论述能力。相关分析表明四个评分项均具有相对独立性,对推测考生的综合论述能力起到了彼此独立的作用。Rasch评分量表模型分析显示,各评分项等级划分基本合理,但个别等级出现信息量不足,在此基础上,提出了改进评分标准的若干建议。  相似文献   

This study considered middle school mathematics teachers’ use of rubrics to score non‐traditional tasks. A group of eighth‐grade teachers attended a two‐day workshop where they evaluated assessment tasks and discussed the use an associated scoring rubric. Scored samples of student work submitted by the teachers indicated that they had difficulty using the rubrics for scoring. When compared to expert ratings, all except one teacher had discrepancies in scoring and some discrepancies indicated major problems. These discrepancies appear to be related to whether the task contained familiar or unfamiliar content and the mix of procedure and explanation the task required. Several other factors related to discrepancies, such as leniency errors, teacher knowledge, and the halo effect are also discussed. With the expanded use of rubrics in many arenas, these results show the need for more professional development related to rubric use.  相似文献   

Many researchers have suggested that the main cause of item bias is the misspecification of the latent ability space, where items that measure multiple abilities are scored as though they are measuring a single ability. If two different groups of examinees have different underlying multidimensional ability distributions and the test items are capable of discriminating among levels of abilities on these multiple dimensions, then any unidimensional scoring scheme has the potential to produce item bias. It is the purpose of this article to provide the testing practitioner with insight about the difference between item bias and item impact and how they relate to item validity. These concepts will be explained from a multidimensional item response theory (MIRT) perspective. Two detection procedures, the Mantel-Haenszel (as modified by Holland and Thayer, 1988) and Shealy and Stout's Simultaneous Item Bias (SIB; 1991) strategies, will be used to illustrate how practitioners can detect item bias.  相似文献   

A formal analysis of the effects of item deletion on equating/scaling functions and reported score distributions is presented. There are two components of the present analysis: analytical and empirical. The analytical decomposition demonstrates how the effects of item characteristics, test properties, individual examinee responses, and rounding rules combine to produce the item deletion effect on the equating/scaling function and candidate scores, In addition to demonstrating how the deleted item's psychometric characteristics can affect the equating function, the analytical component of the report examines the effects of not scoring versus scoring all options correct, the effects of re-equating versus not re-equating, and the interaction between the decision to re-equate or to not re-equate and the scoring option chosen for the flawed item. The empirical portion of the report uses data from the May 1982 administration of the SA T, which contained the circles item, to illustrate the effects of item deletion on reported score distributions and equating functions. The empirical data verify what the analytical decomposition predicts.  相似文献   

Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号