首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks.  相似文献   

3.
《Educational Assessment》2013,18(3):257-272
Concern about the education system has increasingly focused on achievement outcomes and the role of assessment in school performance. Our research with fifth and eighth graders in California explored several issues regarding student performance and rater reliability on hands-on tasks that were administered as part of a field test of a statewide assessment program in science. This research found that raters can produce reliable scores for hands-on tests of science performance. However, the reliability of performance test scores per hour of testing time is quite low relative to multiple-choice tests. Reliability can be improved substantially by adding more tasks (and testing time). Using more than one rater per task produces only a very small improvement in the reliability of a student's total score across tasks. These results were consistent across both grade levels, and they echo the findings of past research.  相似文献   

4.
5.
We studied the performance in three genres of Chinese written composition (narration, exposition, and argumentation) of 158 grade 4, 5, and 6 poor Chinese text comprehenders compared with 156 good Chinese text comprehenders. We examined text comprehension and written composition relationship. Verbal working memory (verbal span working memory and operation span working memory) and different levels of linguistic tasks—morphological sensitivity (morphological compounding and morphological chain), sentence processing (syntax construction and syntax integrity), and text comprehension (narrative and expository texts)—were used to predict separately narrative, expository, and argumentation written compositions in these students. Grade for grade, the good text comprehenders outperformed the poor text comprehenders in all tasks, except for morphological chain. Hierarchical multiple regression analyses showed differential contribution of the tasks to different genres of writing. In particular, text comprehension made unique contribution to argumentation writing in the poor text comprehenders. Future studies should ask students to read and write parallel passages in the same genre for better comparison and incorporate both instructional and motivational variables.  相似文献   

6.
This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

7.
8.
This study of the reliability and validity of scales from the Child's Report of Parental Behavior (CRPBI) presents data on the utility of aggregating the ratings of multiple observers. Subjects were 680 individuals from 170 families. The participants in each family were a college freshman student, the mother, the father, and 1 sibling. The results revealed moderate internal consistency (M = .71) for all rater types on the 18 subscales of the CRPBI, but low interrater agreement (M = .30). The same factor structure was observed across the 4 rater types; however, aggregation within raters across salient scales to form estimated factor scores did not improve rater convergence appreciably (M = .36). Aggregation of factor scores across 2 raters yields much higher convergence (M = .51), and the 4-rater aggregates yielded impressive generalizability coefficients (M = .69). These and other analyses suggested that the responses of each family member contained a small proportion of true variance and a substantial proportion of factor-specific systematic error. The latter can be greatly reduced by aggregating scores across multiple raters.  相似文献   

9.
本文以某届国际奥林匹克运动会女子跳水决赛为例,综合应用CTT、GT和IRT三大测量理论进行评分者信度分析,从不同角度揭示评分者之间和评分者内部的差异情况。结果表明:CTT的评分者信度分别为0.981和078;GT的概化系数和可靠性指数分别为0.8279和0.8271,比赛中所采用的7名评委分别对选手在5轮上的跳水表现进行评定的决策是比较适宜的决策;在IRT中,相对而言,评委5在7名评委中最为严厉,评委2最为宽松,但评委之间在宽严程度上的差异不显著,评委1和评委4在自身一致性上存在问题,不同评委在评定不同选手、不同难度系数动作和不同轮数上存在偏差,但未达到显著性水平。基于本文的分析,可以了解三种评分者信度分析方法的特点及各自优势,为评分者培训和提高评分信度提供有用信息。  相似文献   

10.
This study evaluated rater accuracy with rater-monitoring data from high stakes examinations in England. Rater accuracy was estimated with cross-classified multilevel modelling. The data included face-to-face training and monitoring of 567 raters in 110 teams, across 22 examinations, giving a total of 5500 data points. Two rater-monitoring systems (Expert consensus scores and Supervisor judgement of correct scores) were utilised for all raters. Results showed significant group training (table leader) effects upon rater accuracy and these were greater in the expert consensus score monitoring system. When supervisor judgement methods of monitoring were used, differences between training teams (table leader effects) were underestimated. Supervisor-based judgements of raters’ accuracies were more widely dispersed than in the Expert consensus monitoring system. Supervisors not only influenced their teams’ scoring accuracies, they overestimated differences between raters’ accuracies, compared with the Expert consensus system. Systems using supervisor judgements of correct scores and face-to-face rater training are, therefore, likely to underestimate table leader effects and overestimate rater effects.  相似文献   

11.
12.
This study examined rater effects on essay scoring in an operational monitoring system from England's 2008 national curriculum English writing test for 14‐year‐olds. We fitted two multilevel models and analyzed: (1) drift in rater severity effects over time; (2) rater central tendency effects; and (3) differences in rater severity and central tendency effects by raters’ previous rating experience. We found no significant evidence of rater drift and, while raters with less experience appeared more severe than raters with more experience, this result also was not significant. However, we did find that there was a central tendency to raters’ scoring. We also found that rater severity was significantly unstable over time. We discuss the theoretical and practical questions that our findings raise.  相似文献   

13.
Novice members of a Norwegian national rater panel tasked with assessing Year 8 pupils’ written texts were studied during three successive preparation sessions (2011–2012). The purpose was to investigate how the raters successfully make use of different decision-making strategies in an assessment situation where pre-set criteria and standards give a rather strict framework. The data sources were the raters’ pair assessment dialogues. The analysis shows that the raters use a ‘shared standards strategy’, but when reaching agreement on text quality they also seem to make very good use of assessment strategies related to their work as writing teachers. Moreover, asymmetries in knowledge and participation among raters contribute to creating an image of writing assessment as a challenging hermeneutic practice. It is suggested that future rater preparation would gain from being attentive to the internalised assessment practices teachers bring to the fore when working as raters.  相似文献   

14.
Despite their widespread use in identifying and evaluating programs for gifted and talented students, the Torrance Tests of Creative Thinking were standardized on samples that excluded gifted children. The interrater reliability of measures like the TTCT has been questioned repeatedly, yet studies with average students have demonstrated high interrater reliability. This study compares the interrater reliability of the TTCT for groups of gifted and nongifted elementary-school-aged students. Results indicated most interrater reliability coefficients exceeding .90 for both gifted and nongifted groups. However, multivariate analysis of variance indicated significant mean differences across the three self-trained raters for both gifted and nongifted groups. Consequently, use of a single scorer to evaluate TTCT protocols is recommended, especially where specific cutoff scores are used to select students.  相似文献   

15.
本研究的目的是描述一个用于测量写作能力的多面Rasch(FACETS)模型。该FACETS模型是Rasch测量模型的多元变量拓展,它可为写作测评中的校标评分员和写作题目提供框架。本文展示了如何应用FACETS模型解决大型写作测评中遇到的测量问题。参加全州写作考试的1000个随机抽取的学生样本被用来显示该FACETS模型。数据表明即使经过强化训练,评分员的严格度有显著区别。同时,本研究还发现,写作题目难度的区分,虽然微小,却具有统计意义上的显著性。该FACETS模型为解决以作文测评写作能力的大型考试遇到的测量问题提供了一个有前景的途径。  相似文献   

16.
《教育实用测度》2013,26(3):171-191
The purpose of this study is to describe a Many-Faceted Rasch (FACETS) model for the measurement of writing ability. The FACETS model is a multivariate extension of Rasch measurement models that can be used to provide a framework for calibrating both raters and writing tasks within the context of writing assessment. The use of the FACETS model for solving measurement problems encountered in the large-scale assessment of writing ability is presented here. A random sample of 1,000 students from a statewide assessment of writing ability is used to illustrate the FACETS model. The data suggest that there are significant differences in rater severity, even after extensive training. Small, but statistically significant, differences in writing- task difficulty were also found. The FACETS model offers a promising approach for addressing measurement problems encountered in the large- scale assessment of writing ability through written compositions.  相似文献   

17.
Abstract

This article investigates the effect of the raters' perception of a given topic on students' writing scores. Three raters, who were teachers in TEFL with backgrounds in teaching essay and letter writing in English, scored the compositions. The means of the three sets of scores by three raters were compared using the two-way analysis of variance (ANOVA). Concerning the hypothesis of the study that states “raters' perception has no effect on writers' composition scores,” the result of ANOVA was nonsignificant, meaning that there are other factors such as students' attitude and cognitive ability that may affect raters' judgments on student writing performance.  相似文献   

18.
This study describes three least squares models to control for rater effects in performance evaluation: ordinary least squares (OLS); weighted least squares (WLS); and ordinary least squares, subsequent to applying a logistic transformation to observed ratings (LOG-OLS). The models were applied to ratings obtained from four administrations of an oral examination required for certification in a medical specialty. For any single administration, there were 40 raters and approximately 115 candidates, and each candidate was rated by four raters. The results indicated that raters exhibited significant amounts of leniency error and that application of the least squares models would change the pass-fail status of approximately 7% to 9% of the candidates. Ratings adjusted by the models demonstrated higher reliability and correlated slightly higher than observed ratings with the scores on a written examination.  相似文献   

19.
20.
In research and development designed to assess the writing skills of third-year college students, the University of Wisconsin Verbal Assessment Project developed and tested procedures for assessing writing portfolios from students in courses representing each college in the university. Following the work of Britton (1970) and 1:he National Assessment of Education Progress, we defined expository writing as sustained reflection in which the writer focuses and processes information to various degrees. Basing our work on this construct, we assessed writing samples in each portfolio in terms of both degree of reflection and extent of text elaboration. Results of two studies are presented. In Study 1, raters scored each text from a given portfolio before rating texts in the next portfolio. Reliability estimates were low to moderate for both scores. In follow-up Study 2, involving a comparable group of students, several changes were made to improve reliability: (a) Raters scored all texts written in response to a given prompt or assignment within a class before moving to the next set of texts; and (b) each time readers dealt with a new task, they read several examples together, coming to agreement about how various texts were to be rated. Estimates of reliability for both scores were somewhat higher and suggest that the modifications improved reliabilities. Results demonstrate that adequate reliability should be expected if texts are rated by task across portfolios within classes. Based on these findings, we contend that, because writing normally varies by topic, genre, and other variables, writing portfolios are better characterized by scores for each piece than by a single writing-skill score.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号