首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 218 毫秒
1.
本文旨在通过定量分析来检验广东省高考英语(NMET)Ⅱ计算机化口语考试的构念效度问题,即这一考试(COT)是否考到了它所要考的构念.通过定量分析.包括内部相关、外部相关的分析和因子分析等方法证明,COT考的是一个独立的构念,而且这一构念就是口语交际能力,因此我们得出COT有比较高的构念效度.  相似文献   

2.
张洁 《考试研究》2008,(4):65-78
口语考试作为一种相对真实(authentic)和直接(direct)的测试手段,已被越来越广泛地应用于语言测试实践中。然而,在测试过程中引入的主观判断、评分标准和量表的设计与使用等因素,使分数受到更多考生能力以外因素的影响。本研究基于2007年某考点PETS三级口语考试数据,用多侧面Rasch模型(Many-facet Rasch Model,简称MFRM)对这次考试的评分进行了事后质量控制研究。MFRM将语言运用测试多方面因素综合在一个数学模型中,不仅能够把所有侧面在同一标尺下进行衡量,还能对单独侧面,甚至每个个体进行具体分析,有针对性地找到潜在的"问题评分员"和可能被误判的考生,是主观评分环节有效的质量监控手段。  相似文献   

3.
本文依据Upshur and Turner(1999)考试与评分的理论模型,将考生口语产出的话语语言特征作为参照,研究口语考试中综合式与分析式评分的异同。实验结果表明考生口语产出的话语特征中流利度衡量指标之每分钟有意义音节数对两种不同评分模式都产生显著影响;评分员在两种评分过程中都注重考生话语的流利性,忽视语言准确性和复杂性。文章进一步对此进行了分析并从考生话语角度揭示口试评分中误差控制的问题。  相似文献   

4.
"趋中评分"指在作文评分等主观性评价过程中,评分员较少给出高分或低分,分数多集中在评分量表中间段的现象。趋中评分是一种系统性的评分误差,它对考试质量有较大影响。本研究分析了主观性试题网上评阅中的趋中评分现象,归纳出三种趋中评分类型,分析了趋中评分的成因,认为趋中评分可以通过校验卷法和统计指标法进行判定,在考试研发和评阅阶段可在多个方面、结合技术和非技术的手段进行控制。  相似文献   

5.
网上阅卷是近年来兴起的利用现代技术控制主观题评分误差的方法,其在作文评分中误差控制的效果十分明显。网上阅卷主要通过评卷员之间一致性误差控制、评卷员本人一致性误差控制、两评的误差控制、评分点之间的误差控制、抽查监控等5种方法来实现评分误差控制,同时通过机控系统实施评分误差控制管理。随着研究的深入与技术的发展,将有可能实现基于互联网的高考作文评卷和计算机自动评卷,以进一步实现评卷误差控制。  相似文献   

6.
国内外写作评分量表的对比研究   总被引:1,自引:0,他引:1  
陈睿 《考试研究》2011,(6):59-67
国外考试项目的写作通常采用小评分量表综合评分法,国内则采用大评分量表综合评分或分项评分法。国外写作评分量表的描述具体、详细,层次清楚,各评分等级间的差别可鉴别,便于评卷者操作。与小评分量表相比,评卷者在大评分量表下不能使用全距分值,容易给出趋中分数,评分员间的评分一致性较差。据此,得出小评分量表下"整体描述+分项具体描述"的综合评分法较大评分量表的综合评分法准确度高,评卷者易于掌握,评卷效率高,评分误差小,考试的公平性也可以得到有效保障。  相似文献   

7.
李燕 《考试周刊》2008,(25):6-8
本文旨在通过定性分析来检验广东省高考英语(NMET)Ⅱ计算机化口语考试的构念效度问题,即这一考试(COT)是否考到了它所要考的构念.Hedge(2001)的语言交际能力模型作为本研究的理论框架.通过定性分析,包括对COT专家的问卷调查分析、试题内容的分析以及对考生考试表现的分析证明了COT考到了大部分的口语交际能力.定性分析表明COT有比较高的构念效度.  相似文献   

8.
标准化考试是目前国际上比较流行的一种考试制度。所谓标准化,是指对考试制度定出一套客观而规范的标准,命题、实考、阅卷评分及计算分数等各个环节都努力减少或控制各种误差,以测出考生比较真实的成绩。标准化考试包括命题标准化、施测标准化、评分标准化和分数解释标准化等。(一)由于标准化考试命题是在认真研究教育测量、教育统计、教育心理等科学理论  相似文献   

9.
本研究运用多面Rasch模型对比分析了大、小两种评分量表下评分员的评分效应。结果显示,与小尺度评分量表相比,评分员在大尺度评分量表下不能使用全距分值,容易给出趋中分数;而且,在大尺度评分量表下评分员间的评分一致性较差。据此,提出应改进我国各项考试中写作评分量表的设置,并单独报告写作成绩的建议。  相似文献   

10.
IELTS总分及单项评分标准细则公布 英国大使馆文化教育处公布了最新的《IELTS评分标准、分数报告和解释》,内容包括雅思(IELTS)考试评分标准、分数报告解释及口语、阅读、听力、写作等级的评分细则。按照《IELTS评分标准、分数报告和解释》,考生成绩从1~9分为九个等级。  相似文献   

11.
评分教师的评分效应和评分量表研究是研究主观题评分误差的核心问题。本论文以2006年高考政治(上海卷)第38题(论述题)为例,运用ACER Conquest的Raters Effect模型研究,结果显示该大题基本没有表现出模糊性、趋中性和等级限制等评分误差,评分教师能够比较好地区分考生不同行为特征,除个别评分教师的评分一致性还有待提高外,评分松紧度差异比较显著。为此,作者提出根据松紧度调整考试分数的方法。  相似文献   

12.
评分标准在写作测试中非常重要,使用不同的评分方法会影响评卷者的评分行为。研究显示,虽然整体法和分析法两种英语写作评分方法都可靠,但是在两种评分中,评卷者的严厉程度以及考生的写作成绩发生很大变化。总体上,整体法评分中,评卷者的严厉程度趋于一致,接近理想值;分析法评分中,考生的写作成绩更高,同时评卷者的严厉程度也存在显著差异。因而,在决定考生前途命运的重大考试中,整体评分法更受推崇。  相似文献   

13.
The purpose of the study is to investigate rating behavior between Korean and native English speaking (NES) raters. Five Korean English teachers and five NES teachers graded 420 essays written by Korean college freshmen and completed survey questionnaires. The grading data were analyzed with FACETS program. The results revealed Korean raters’ inferiority in measuring linguistic components. Furthermore, the Korean raters were more severe in scoring grammar, sentence structure, and organization, whereas the NES raters were stricter toward content and overall scores. In addition, the analysis of the raters’ responses on survey discovered that the NNS raters’ perception spread into content and grammar as the most difficult feature to grade, while all NES raters thought content as the most difficult. Based on these research findings, future research suggestions and implications are discussed.  相似文献   

14.
Hundreds of thousands of raters are recruited internationally to score examinations, but little research has been conducted on the selection criteria for these raters. Many countries insist upon teaching experience as a selection criterion and this has frequently become embedded in the cultural expectations surrounding the tests. Shortages in raters for some of England's national examinations has led to non-teachers being hired to score a small minority of items and changes in technology have fostered this approach. For a National Curriculum test in English taken at age 14, this study investigated whether teaching experience was a necessary selection criterion for all aspects of the examination. Fifty-seven raters with different backgrounds were trained in the normal manner and scored the same 97 students' work. Accuracy was investigated using a cross-classified multilevel model of absolute score differences with accuracy measures at level 1 and raters crossed with candidates at level 2. By comparing the scoring accuracy of graduates with a degree in English, teacher trainees, experienced teachers and experienced raters, this study found that teaching experience was not a necessary selection criterion. A rudimentary model for allocation of raters to different question types is proposed and further research to investigate the limits of necessary qualifications for scoring is suggested.  相似文献   

15.
本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现:培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。  相似文献   

16.
17.
通过有声思维实验方法并辅以刺激回忆,收集四名不同性格倾向的评分员在配对口语考试评分时进行的思维报告数据,定性分析结果表明:在实际评分中,评分员对评分量表的理解和使用存在很大的差异性,具体表现在:(1)外向的评分员在评分过程中,表现的比内向的评分员更为宽容;(2)内向的评分员更多地关注评分量表中的各项具体指标和标准,而外向的评分员强调任务的完成状况和考生之间的比较、交流,和互动;(3)外向的评分员比内向的评分员更少地依赖评分量表,更多地使用非语言的特征。本研究结果对考试评分标准的修订和评分员培训均有启示。  相似文献   

18.
An assumption that is fundamental to the scoring of student-constructed responses (e.g., essays) is the ability of raters to focus on the response characteristics of interest rather than on other features. A common example, and the focus of this study, is the ability of raters to score a response based on the content achievement it demonstrates independent of the quality with which it is expressed. Previously scored responses from a large-scale assessment in which trained scorers rated exclusively constructed-response formats were altered to enhance or degrade the quality of the writing, and scores that resulted from the altered responses were compared with the original scores. Statistically significant differences in favor of the better-writing condition were found in all six content areas. However, the effect sizes were very small in mathematics, reading, science, and social studies items. They were relatively large for items in writing and language usage (mechanics). It was concluded from the last two content areas that the manipulation was successful and from the first four that trained scorers are reasonably well able to differentiate writing quality from other achievement constructs in rating student responses.  相似文献   

19.
This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号