首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Educational Assessment》2013,18(2):129-153
States are increasingly using test scores as part of the requirements for high school graduation or certification. In these circumstances, a battery of tests or, with writing, analytic traits are considered that usually cover different aspects of the state's content standards. Because pass or fail decisions are made affecting students' futures, the validity of standard-setting procedures and strategies is a major concern. Policymakers and legislators must decide which of these 2 standard-setting strategies to use for making pass or fail decisions for students seeking certification or for meeting a high school graduation requirement. The compensatory strategy focuses on total performance, summing scores across all tests in the battery. The conjunctive strategy requires passing performance for each test in the battery. This article reviews and evaluates compensatory and conjunctive standard-setting strategies. The rationales for each type are presented and discussed. Results from a study comparing the compensatory and conjunctive strategies for a state high school certification writing test provide insight into the problem of choosing either strategy. This article concludes with a set of recommendations for those who must decide which type of standard-setting strategy to use.  相似文献   

2.
主观题是语言测试中的重要组成部分。主观题可以弥补标准化试题的不足,但又存在评分依赖于评分员主观印象的问题,这导致评分员自身的不稳定性和评分员之间的差异。借鉴、利用三大测量理论和计算机辅助评分,可以优化主观题评分质量,提高其精准性和有效性。  相似文献   

3.
Psychometric models based on structural equation modeling framework are commonly used in many multiple-choice test settings to assess measurement invariance of test items across examinee subpopulations. The premise of the current article is that they may also be useful in the context of performance assessment tests to test measurement invariance of raters. The modeling approach and how it can be used for performance tests with less than optimal rater designs are illustrated using a data set from a performance test designed to measure medical students’ patient management skills. The results suggest that group-specific rater statistics can help spot differences in rater performance that might be due to rater bias, identify specific weaknesses and strengths of individual raters, and enhance decisions related to future task development, rater training, and test scoring processes.  相似文献   

4.
《教育实用测度》2013,26(3):231-244
For any testing program intended for licensure, certification, competency, or proficiency, the estimation of content relevant test scores for pass/fail decision making is necessary. This study compares number-correct scoring to empirical option weighting in the context of such tests. The study was conducted under two test design conditions, three test length conditions, and four passing score levels. Two criteria were used to evaluate the effectiveness of empirical option weighting versus number-correct scoring. Empirical option weighting typically produced slightly more reliable domain score estimates and more consistent pass/fail decisions than number-correct scoring, particularly in the lower half of the test score distribution. For many types of testing programs where the passing scores are established in the lower half of the test score distribution, the empirical option weighting method used in this study seems both appropriate and effective in improving the depend- ability of test scores and the consistency of pass/fail decisions. Test users, however, must weigh the effort required to use option weighting against the small gains obtained with this method. Other problems are discussed that may limit the usefulness of option weighting.  相似文献   

5.
Performance assessments are typically scored by having experts rate individual performances. The cost associated with using expert raters may represent a serious limitation in many large-scale testing programs. The use of raters may also introduce an additional source of error into the assessment. These limitations have motivated development of automated scoring systems for performance assessments. Preliminary research has shown these systems to have application across a variety of tasks ranging from simple mathematics to architectural problem solving. This study extends research on automated scoring by comparing alternative automated systems for scoring a computer simulation test of physicians'patient management skills; one system uses regression-derived weights for components of the performance, the other uses complex rules to map performances into score levels. The procedures are evaluated by comparing the resulting scores to expert ratings of the same performances.  相似文献   

6.
评分标准在写作测试中非常重要,使用不同的评分方法会影响评卷者的评分行为。研究显示,虽然整体法和分析法两种英语写作评分方法都可靠,但是在两种评分中,评卷者的严厉程度以及考生的写作成绩发生很大变化。总体上,整体法评分中,评卷者的严厉程度趋于一致,接近理想值;分析法评分中,考生的写作成绩更高,同时评卷者的严厉程度也存在显著差异。因而,在决定考生前途命运的重大考试中,整体评分法更受推崇。  相似文献   

7.
本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现:培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。  相似文献   

8.
For the purposes of scoring essays written in a second language, two of the most important considerations are the intelligibility and the structural complexity of the writing.

Various disadvantages are inherent in the use of clauses and/or sentences as a basis for analysing structure in written work; a more satisfactory technique was developed by Kellogg W. Hunt in America, using what he termed a ‘minimal terminable unit’ or ‘T‐unit’.

This technique was applied in the scoring of the NFER open‐ended writing and speaking tests, which formed part of the battery of ‘Tests of English Proficiency forImmigrant Children’.

During the development of these tests, the battery was administered to Asian children, for whom it was found that the average T‐unit length in writing and speech increased with increasing length of stay in Britain. (Average T‐unit length has been found by Hunt and O'Don‐nell to increase with age for children writing and speaking in their native language.) The results of the NFER testing indicated that much of the development of proficiency in both the speech and writing of the Asian children tested took place after three‐and‐a‐half years in Britain. (These findings were essentially a by‐product of test development, and therefore must be viewed with caution.).  相似文献   

9.
In signal detection rater models for constructed response (CR) scoring, it is assumed that raters discriminate equally well between different latent classes defined by the scoring rubric. An extended model that relaxes this assumption is introduced; the model recognizes that a rater may not discriminate equally well between some of the scoring classes. The extension recognizes a different type of rater effect and is shown to offer useful tests and diagnostic plots of the equal discrimination assumption, along with ways to assess rater accuracy and various rater effects. The approach is illustrated with an application to a large‐scale language test.  相似文献   

10.
本文对基于局域网评分中间结果进行研究,发现阈值高低对一评、二评评分结果统计差异大小有影响,一般阈值越小,一评、二评评分结果无统计差异的越多。但阈值高低不是决定评分一致性的最重要因素,关键在于一评、二评差值的分布。阈值设置高,可能一评、二评结果也会无统计差异;阈值设置低,一评、二评结果也会出现显著差异。在考试分数“分分计较”的情况下,阈值设置应该是1分。在阈值规定的范围内,如果成对样本t检验结果无显著差异,并不意味着评分一致性一定好。如果成对样本t检验结果有显著差异,评分一致性也未必差。成对样本t检验并不是评价评分一致性的有效、可靠的方法。需要采用其他评价评分一致性的方法。  相似文献   

11.
Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to random sampling of items and/or responses in the validation sets. Any statistical hypothesis test of the differences in rankings needs to be appropriate for use with rater statistics and adjust for multiple comparisons. This study considered different statistical methods to evaluate differences in performance across multiple raters and items. These methods are illustrated leveraging data from the 2012 Automated Scoring Assessment Prize competitions. Using average rankings to test for significant differences in performance between automated and human raters, findings show that most automated raters did not perform statistically significantly different from human-to-human inter-rater agreement for essays but they did perform differently on short-answer items. Differences in average rankings between most automated raters were not statistically significant, even when their observed performance differed substantially.  相似文献   

12.
Martin   《Assessing Writing》2009,14(2):88-115
The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed.  相似文献   

13.
论英语口语考试的评分误差   总被引:1,自引:0,他引:1  
口语考试的评分是评分员基于评分标准对语言产出的认知处理过程,处理的目的就是解释考生之间的分数差异(score vari-ance)。用于解释分数差异的变量包括构念相关变量(construct-rele-vant variables)和构念不相关变量(construct-irrelevant variables)。如果构念不相关变量发生作用,那么评分就产生误差。考试误差可区分为系统性误差(systematic error)和随机性误差(randomerror)。随机性误差是评分误差控制的重点内容。口语考试评分误差的主要表现形式包括评分员的个性差异、回归均值趋势和假正态分布。我们可以通过分数差异分布和回归系数等统计手段验证口语考试评分误差的大小程度。本文还讨论了口语考试评分误差控制的目标、原则和方法。评估误差控制的目的就是最大化构念相关变量的作用,最小化构念不相关变量的影响作用;这就要求评分员在评分过程中坚持一致性、完整性和独立性三条基本原则;在手段的使用方面,口语考试的评分误差控制主要包括管理手段、技术手段和统计手段等。  相似文献   

14.
Recent work in reading and writing theory, research and pedagogy has raised questions about relationships between Fluent reading processes and holistic scoring of essays (e.g., Huot 1993). In holistic scoring settings, are the raters behaving as normal Fluent readers (i.e., readers interacting critically and personally with the text) or, are they somehow disconnected From their normal reader responses because they are using reliable scoring guides? Related questions concern the behavior of such holistic raters when they are teachers (e.g., Barritt, Stock, & Clark, 1986), and when those teachers respond to student writing (Connors & Lunsford, 1993). How are teachers/raters behaving, and what are they responding to in judging the writing? Previous research has suggested a role for personality type in the study of the process of writing evaluation (Jensen & DiTiberio, 1984, 1989).Thus, it seems reasonable to ask what role personality types play in the holistic evaluation of writing.This empirical study addressed the general question: What role, if any, do personality types of writers and of raters play in the holistic rating of writing? Moreover, is there a relationship between writers' personalities and raters' personalities?Writers were native English-speaking university freshman composition students; raters were native English-speaking university freshman composition instructors.Results indicate that the personality types of writers affect the ratings their essays receive, and the personality types of raters affect the ratings they give to essays. However, there is no significant relationship between writers' styles and raters' styles. Implications for future research, as well as classroom implications of these results are discussed.  相似文献   

15.
Hundreds of thousands of raters are recruited internationally to score examinations, but little research has been conducted on the selection criteria for these raters. Many countries insist upon teaching experience as a selection criterion and this has frequently become embedded in the cultural expectations surrounding the tests. Shortages in raters for some of England's national examinations has led to non-teachers being hired to score a small minority of items and changes in technology have fostered this approach. For a National Curriculum test in English taken at age 14, this study investigated whether teaching experience was a necessary selection criterion for all aspects of the examination. Fifty-seven raters with different backgrounds were trained in the normal manner and scored the same 97 students' work. Accuracy was investigated using a cross-classified multilevel model of absolute score differences with accuracy measures at level 1 and raters crossed with candidates at level 2. By comparing the scoring accuracy of graduates with a degree in English, teacher trainees, experienced teachers and experienced raters, this study found that teaching experience was not a necessary selection criterion. A rudimentary model for allocation of raters to different question types is proposed and further research to investigate the limits of necessary qualifications for scoring is suggested.  相似文献   

16.
Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments.  相似文献   

17.
As more secondary students with learning disabilities (LD) enroll in advanced content‐area classes and are expected to pass state exams, they are faced with the challenge of mastering difficult concepts and abstract vocabulary while learning content. Once in these classes, students must learn from lectures that move at a quick pace, record accurate and complete notes, and then demonstrate their mastery of the content on tests. This article provides an overview of the challenges faced by students with LD in content‐area classes and discusses the problems that students have learning from lectures and recording notes. Further, the article discusses theory and research related to note‐taking and presents note‐taking interventions that teachers can use to help students improve their note‐taking skills, and ultimately, improve their achievement in these classes.  相似文献   

18.
This study compared short-form constructed responses evaluated by both human raters and machine scoring algorithms. The context was a public competition on which both public competitors and commercial vendors vied to develop machine scoring algorithms that would match or exceed the performance of operational human raters in a summative high-stakes testing environment. Data (N = 25,683) were drawn from three different states, employed 10 different prompts, and were drawn from two different secondary grade levels. Samples ranging in size from 2,130 to 2,999 were randomly selected from the data sets provided by the states and then randomly divided into three sets: a training set, a test set, and a validation set. Machine performance on all of the agreement measures failed to match that of the human raters. The current study concluded with recommendations on steps that might improve machine-scoring algorithms before they can be used in any operational way.  相似文献   

19.
Although high‐stakes tests play an increasing role in students’ schooling experiences, scholars have not examined these tests as sites for socialisation. Drawing on qualitative data collected at an American urban primary school, this study explores what educators teach students about motivation and effort through high‐stakes testing, how students interpret and internalise these messages, and how student hierarchies develop as a result. I found that teachers located boys’ failure in their poor behavior and attitudes, while arguing that girls simply needed more self‐esteem to pass the test. Most boys accepted their teachers’ diagnosis of the problem. However, the boys who felt that they were already ‘doing their best’ and ‘working hard’ began to doubt that educational success is a function of merit and effort. I conclude that students learn about much more than the three Rs through their experiences with high‐stakes testing, and argue that future research should attend to the social dimensions of these experiences.  相似文献   

20.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号