首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
张媛  张兰芳  朱新华 《文教资料》2009,(23):205-207
长期以来,教育测量对于客观题部分的信度系数测量有很多方法,并且越来越精确,但是对论文式测验的信度系数测量却没有太多改进,由于对评分者信度的忽视导致了对论文式测验信度测量的误差.文章首先分析了对论文式信度系数的测量方法及评分者信度系数的计算方法,然后在指出这种误差的基础上分析了错误的原因,并提出了相对完善的公式,最后介绍了相应的信度估计方法.  相似文献   

2.
本文选取聊城市初一、初二、高一、高二学生作为被试,以被试完成的作文作品为研究样本,运用概化理论的随机双侧面交叉设计,对作文评价指标、评分者数量的界定问题进行研究。研究表明,适当增加评分者或评价指标数量均能降低测验误差,提高测验信度;随着评分者或评价指标数量的逐渐增加,测验误差降低或测验信度提高的幅度将变得很小。该文为高考作文评价时确定较为合适的评价指标、评分者数量提供了科学依据  相似文献   

3.
测验长度(test length)是影响语言测试信度和效度的重要因素之一。本文借助概化理论(Generalizability Theory,GT)的固定侧面s×(i:p)嵌套设计和边际效用递减法则(the Law of Diminishing Marginal Utility),对中国汉语水平考试(HSK[中级])的测验长度进行了实证研究。研究结果显示:由130题构成的HSK[中级]测验具有相当高的测验信度,概化系数(Eρ2)可达0.8890,即使将测验的题目数量减少至120题或110题,测验的概化系数仍可以达到0.8856和0.8816(分别降低了0.38%和0.83%),这种测验长度的缩减不仅明显地降低了研发成本,而且提高了测试效率,完全能够满足标准化考试在误差控制方面的较高要求,并确保测验结果和分数解释具有较高的信度和效度。  相似文献   

4.
测验信度大盘点   总被引:1,自引:0,他引:1  
信度是对测量一致性程度的估计。信度分成再测信度、复本信度、同质信度、评分者信度等四种类型。测验的长度与难度以及被试团体的变异性与能力水平是影响信度的主要因素。测量标准误差属另类信度,可用于解释个体分数或解释分数差异。估计速度测验和掌握测验的信度,需使用特殊的方法。  相似文献   

5.
普通话水平测试中,测试员之间的评分差异是影响测试信度的一个重要因素。要提高测试信度,让测评出的等级趋于应试人的普通话实际运用水平,使测试更具科学性、权威性,就必须缩小测试员间的评分差异。通过对普通话语调、语调偏误及其包括的范围,如何依据语调偏误程度给应试人的朗读和说话定性评分的探讨,以期为测试中的定性评分提供缩小差距的有效的依据,并以此提出对测试员自身听辨能力及普通话口语语感的更高要求。  相似文献   

6.
本研究对数字加工和计算能力测验进行了修订,修订工作主要包括三个部分:(1)测验的翻译;(2)预测验;(3)大样本测验。测验的分半信度为0.86,克伦巴赫α系数为0.93,被教师诊断为数字加工和计算障碍的学生几乎在所有的子测验成绩上都低于正常学生。这些结果表明,此测验是一个信度和效度均较高的测验。  相似文献   

7.
本研究的目的是建立一种适用于社区保健系统的2—4岁儿童视力的筛查测验。本研究对我们编制的儿童视力筛查测验的可行性、信度、效度等方面进行了检验和分析。本测验的部分项目是从ATYCAR视觉测验中选择的,部分是自己设计的。选择玩具、图卡和小物体作为视标,令儿童匹配,观察行为表现,作为视觉反应的指标。结果表明,此测验具有较高的信度、效度和可行性,同时也适用于不能配合其它视觉测验的儿童。可用于我国城、乡地区的保健系统进行儿童视力筛查,使视力障碍儿童得到进一步诊断和干预。  相似文献   

8.
面试是过程评价和综合评价特别是高水平大学选拔拔尖创新人才的重要手段.但目前常用的面试质量评价方法,如评分者信度或概化系数估计方法,并不能快捷评估每个评分者的工作表现,影响了面试的质量.借助经典测验理论和概化理论,通过逐一核查各评分者评分信息缺失条件下的信度估计值变化情况,构造了一个评分者贡献度指数,并举例展示其使用方法和注意事项,为实时监控评分者的表现、保障和提高面试质量提供了新的方法.  相似文献   

9.
我们运用教育测量学的原理和方法,对近年的高考试卷进行了信度、效度、题目的难度,区分度四个方面的分析。信度是反映测验的稳定性和可靠性指标,表明一个测验反映受测者稳定水平的程度。国外一般要求信度在0.90以上,常达0.95,我们分析的部分试卷的信度,只有少数几份,如79年的物理、数学,80年的化学,81年的英语,历史较高外,其他均不高。效度是反映测验的准确性和有效性的指标,表示对它所要测量的东西实际测得有多好。国外一般是以被录取的考生的大学一年级各科平均成绩作为入学考试有效性的标准,二者相关达0.4—0.7,我们分析了16所大学24个班资料发现相关值如此之低,特别有许多负值,在一定程度上说明高考分数不能有效预测学生在大学的学习成绩。  相似文献   

10.
张军 《考试研究》2013,(4):68-75
对外汉语课程测验属于标准参照性测验,应使用标准参照性测验理论体系下的技术指标对测验进行项目分析和评价,传统的分析方法(如区分度)不完全适用于课程测验的项目分析。本文使用该理论对北京语言大学汉语进修学院某次考试试卷进行分析,希冀为对外汉语教学提供一些有益的经验。实验结果表明:对“掌握者”和“未掌握者”来说,题目难度总体上可接受,大部分题目的区分性能良好,虽然有的题目略有“瑕疵”,但值得保留,以提高教学内容的测试覆盖面及测验信度。有7个题目过难或过易,几乎不具备区分性能,需要删除或修改。  相似文献   

11.
ABSTRACT

The authors address the reliability of scores obtained on the summative performance assessments during the pilot year of our research. Contrary to classical test theory, we discussed the advantages of using generalizability theory for estimating reliability of scores for summative performance assessments. Generalizability theory was used as the framework because of the flexibility this approach provides for examining sources of inconsistency within a complex assessment. Two major sources of inconsistency on scores considered in this study were raters and agencies (teachers' rating vs. researchers' rating). Overall, results showed that the inconsistency in scores attributable to raters and agencies was relatively small. Suggestions regarding improvement of consistency in the subsequent years of our research were provided.  相似文献   

12.
多面Rasch模型在主观题评分培训中的应用   总被引:7,自引:2,他引:7  
主观题的评分受到很多因素的影响,如评分者的知识水平、综合能力和个人偏好等。这些评分者偏差不仅会导致不同评分者之间存在主观差异,也会到导致同一评分者在不同的时间也具有主观不稳定性,最终导致主观题评分信度的降低。本研究将多面Rasch模型运用到某国家级考试论述题的评分培训中。通过分析6名有经验评分者对58份试卷的试评数据,鉴别出四种评分者偏差,然后据此对每个评分者进行个别反馈,从而提高评分的客观性和精确性。  相似文献   

13.
概化理论在结构化面试评分误差中的应用研究   总被引:1,自引:0,他引:1  
应用概化理论对结构化面试的评分误差的控制问题进行了研究。结果表明:结构化面试评分能够较好地反映出被试的真实能力水平,评分具有较高的信度;在保证较高的面试评分信度(0.80)的情况下,建议将考官人数减少至9名,以提高结构化面试的经济性和效率性。  相似文献   

14.
为了科学、客观地评价七年制临床医学生的临床技能,我们应用新的教育测量理论——多元概化理论,对七年制临床医学生毕业前的内科临床技能考核结果进行分析研究。结果表明,临床技能考核总的可靠性指数为0.63725,绝对信噪比为1.75668,提示本次临床技能考核总的信度符合考核要求;临床思维能力和理论知识水平部分内容信度相对较高,也最能有效区分考生能力的差别;临床实践能力的信度较低;评分者素质与评分能力相对较好。多元概化理论能客观、科学地评价七年制医学毕业生临床技能考核,其分析结果对于提高及改进临床技能考核质量有较大帮助。  相似文献   

15.
The purpose of this study was to investigate the effects of items, passages, contents, themes, and types of passages on the reliability and standard errors of measurement for complex reading comprehension tests. Seven different generalizability theory models were used in the analyses. Results indicated that generalizability coefficients estimated using multivariate models incorporating content strata and types of passages were similar in size to reliability estimates based upon a model that did not include these factors. In contrast, incorporating passages and themes within univariate generalizability theory models produced non-negligible differences in the reliability estimates. This suggested that passages and themes be taken into account when evaluating the reliability of test scores for complex reading comprehension tests.  相似文献   

16.
The consensual assessment technique (CAT) is a measurement tool for creativity research in which appropriate experts evaluate creative products [Amabile, T. M. (1996). Creativity in context: Update to the social psychology of creativity. Boulder, CO: Westview]. However, the CAT is hampered by the time-consuming nature of the products (asking participants to write stories or draw pictures) and the ratings (getting appropriate experts). This study examined the reliability of ratings of sentence captions. Specifically, four raters evaluated 12 captions written by 81 undergraduates. The purpose of the study was to see whether the CAT could provide reliable ratings of captions across raters and across multiple captions and, if so, how many such captions would be required to generate reliable scores, and how many judges would be needed? Using generalizability theory, we found that captions appear to be a useful way of measuring creativity with a reasonable level of reliability in the frame of CAT.  相似文献   

17.
This study examined the use of generalizability theory to evaluate the quality of an alternative assessment (journal writing) in mathematics. Twenty-nine junior college students wrote journal tasks on the given topics and two raters marked the tasks using a scoring rubric, constituting a two-facet G-study design in which students were crossed with tasks and raters. The G coefficient was .76 and index of dependability was .72. The results showed that increasing the number of tasks had a larger effect on the G coefficient and index of dependability than increasing the number of raters. Implications for educational practices are discussed.  相似文献   

18.
《教育实用测度》2013,26(4):301-309
The relevance of test content to practice is essential for credentialing examinations and one way to ensure it is to collect ratings of item relevance from job incumbents. This study analyzed ratings of the 132 single-best-answer items and 117 multiple true-false item sets that formed the pretest books in a single administration of a medical certifying examination. Ratings collected from 57 practitioners were high (an average of more than 4 on a 5-point scale) and correlated with item difficulty (r = .31 to .34). The relationship between ratings and item discrimination is less clear (r = -.04 to .31). Application of generalizability theory to the ratings shows that reasonable estimates of item, stem, and total test relevance can be obtained with about 10 raters.  相似文献   

19.
This experimental project investigated the reliability and validity of rubrics in assessment of students’ written responses to a social science “writing prompt”. The participants were asked to grade one of the two samples of writing assuming it was written by a graduate student. In fact both samples were prepared by the authors. The first sample was well written in terms of sentence structure, spelling, grammar, and punctuation; however, the author did not fully answer the question. The second sample fully answered each part of the question, but included multiple errors in structure, spelling, grammar and punctuation. In the first experiment, the first sample was assessed by participants once without a rubric and once with a rubric. In the second experiment, the second sample was assessed by participants once without a rubric and once with a rubric. The results showed that raters were significantly influenced by mechanical characteristics of students’ writing rather than the content even when they used a rubric. Study results also indicated that using rubrics may not improve the reliability or validity of assessment if raters are not well trained on how to design and employ them effectively.  相似文献   

20.
The problem of assessing the content validity (or relevance) of standardized achievement tests is considered within the framework of generalizability theory. Four illustrative designs are described that may be used to assess test-item fit to a curriculum. For each design, appropriate variance components are identified for making relative and absolute item (or test) selection decisions. Special consideration is given to use of these procedures for determining the number of raters and/or schools needed in a content-validation decisionmaking study. Application of these procedures is illustrated using data from an international assessment of mathematics achievement  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号