首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Historically, research focusing on rater characteristics and rating contexts that enable the assignment of accurate ratings and research focusing on statistical indicators of accurate ratings has been conducted by separate communities of researchers. This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings. We employ two data sources in our demonstration—simulated data and data from a large‐scale state‐wide writing assessment. We apply latent trait models to these data to identify examples of rater leniency, centrality, inaccuracy, and differential dimensionality; and we investigate the association between rater training procedures and the manifestation of rater effects in the real data.  相似文献   

2.
The traditional kappa statistic in assessing interrater agreement is not adequate when multiraters and multiattributes are involved. In this article, latent trait models are proposed to assess the multirater multiattribute (MRMA) agreement. Data from the Third International Mathematics and Science Studies (TIMSS) are used to illustrate the application of the latent trait models. Results showed that among four possible latent trait models, the correlated uniqueness model had the best fit to assess the MRMA agreement. Furthermore, in coding a set of different attributes, the coding accuracy within the same rater may differ across attributes. Likewise, when different raters rate the same attribute, the accuracy in rating varies among the raters. Thus, the latent models provide us with a more refined and accurate assessment of interrater agreement. The application of the latent trait models is important in school psychology research and intervention because accurate assessment of children's functioning is fundamental in designing effective intervention strategies. © 2007 Wiley Periodicals, Inc. Psychol Schs 44: 515–525, 2007.  相似文献   

3.
Given the wide use of peer assessment, especially in higher education, the relative accuracy of peer ratings compared to teacher ratings is a major concern for both educators and researchers. This concern has grown with the increase of peer assessment in digital platforms. In this meta-analysis, using a variance-known hierarchical linear modelling approach, we synthesise findings from studies on peer assessment since 1999 when computer-assisted peer assessment started to proliferate. The estimated average Pearson correlation between peer and teacher ratings is found to be .63, which is moderately strong. This correlation is significantly higher when: (a) the peer assessment is paper-based rather than computer-assisted; (b) the subject area is not medical/clinical; (c) the course is graduate level rather than undergraduate or K-12; (d) individual work instead of group work is assessed; (e) the assessors and assessees are matched at random; (f) the peer assessment is voluntary instead of compulsory; (g) the peer assessment is non-anonymous; (h) peer raters provide both scores and qualitative comments instead of only scores; and (i) peer raters are involved in developing the rating criteria. The findings are expected to inform practitioners regarding peer assessment practices that are more likely to exhibit better agreement with teacher assessment.  相似文献   

4.
The use of evidence to guide policy and practice in education (Cooper, Levin, & Campbell, 2009) has included an increased emphasis on constructed-response items, such as essays and portfolios. Because assessments that go beyond selected-response items and incorporate constructed-response items are rater-mediated (Engelhard, 2002, Engelhard, 2013), it is necessary to develop evidence-based indices of quality for the rating processes used to evaluate student performances. This study proposes a set of criteria for evaluating the quality of ratings based on the concepts of measurement invariance and accuracy within the context of a large-scale writing assessment. Two measurement models are used to explore indices of quality for raters and ratings: the first model provides evidence for the invariance of ratings, and the second model provides evidence for rater accuracy. Rating quality is examined within four writing domains from an analytic rubric. Further, this study explores the alignment between indices of rating quality based on these invariance and accuracy models within each of the four domains of writing. Major findings suggest that rating quality varies across analytic rubric domains, and that there is some correspondence between indices of rating quality based on the invariance and accuracy models. Implications for research and practice are discussed.  相似文献   

5.
Researchers have explored a variety of topics related to identifying and distinguishing among specific types of rater effects, as well as the implications of different types of incomplete data collection designs for rater‐mediated assessments. In this study, we used simulated data to examine the sensitivity of latent trait model indicators of three rater effects (leniency, central tendency, and severity) in combination with different types of incomplete rating designs (systematic links, anchor performances, and spiral). We used the rating scale model and the partial credit model to calculate rater location estimates, standard errors of rater estimates, model–data fit statistics, and the standard deviation of rating scale category thresholds as indicators of rater effects and we explored the sensitivity of these indicators to rater effects under different conditions. Our results suggest that it is possible to detect rater effects when each of the three types of rating designs is used. However, there are differences in the sensitivity of each indicator related to type of rater effect, type of rating design, and the overall proportion of effect raters. We discuss implications for research and practice related to rater‐mediated assessments.  相似文献   

6.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

7.
This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

8.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

9.
The effects of rating scale format (behaviorally anchored vs. Likert) and rater training on leniency and halo in student ratings of instruction were investigated. The subjects (N=269) were students enrolled in required courses at a graduate theological seminary in the Southwest United States. A repeated measures design controlling for teacher and course was used. Findings indicated: (a) training was effective in reducing leniency and halo in ratings from both instruments; (b) trained raters exhibited less leniency on two rating dimensions when using behaviorally anchored rating scales (BARS's) than when using the Likert scale; and (c) trained raters exhibited less halo when using the Likert than when using the BARS. The findings demonstrate the importance of focusing efforts to improve quality of ratings on the students rather than on the format of the instrument.Presented at the Twenty-Eighth Annual Forum of the Association for Institutional Research, Phoenix, Ariz., May 1988.  相似文献   

10.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.  相似文献   

11.
12.
The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re-rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily estimates of the relative severity of individual raters were found to differ significantly from single, on-average estimates for the whole rating period. For 10 raters, severity estimates on the last day were significantly different from estimates on the first day. These fndings cast doubt on the practice of using a single calibration of rater severity as the basis for adjustment of person measures.  相似文献   

13.
Peer assessment can be conducted online with rapid development of online learning technology. The current study was conducted empirically to investigate peer rating accuracy and student learning outcomes in online peer assessments, comparing compulsory and voluntary peer assessment. Section 1 (N?=?93) was assigned to the voluntary group and Section 2 (N?=?31) was assigned to the compulsory group. The results showed the voluntary group scored significantly higher than the compulsory group in the final task of the course, while there was no significant difference on the final task score increase. Students who participated in the voluntary group provided more accurate scores (i.e. peer rater accuracy) than those who participated in the compulsory group. The peer score leniency/severity rating, comparing peer assigned scores with the teacher assigned scores, were generally consistent with the peer rater accuracy results. The current study offers insights for researchers who are interested in studying the effect of online peer assessment activities. The results are also of interest for instructors who may want to conduct peer assessments in online courses and are choosing between compulsory and voluntary formats.  相似文献   

14.
One of the productive lines of research on self-assessment (SA) and peer assessment (PA) concerns their concurrent validity with respect to a criterion measure. However, similar research has rarely been conducted for spoken-language interpreting. This article therefore reports on a longitudinal study that investigated the validity of self and peer ratings on three performance dimensions of English-Chinese consecutive interpretation (i.e., information completeness, fluency of delivery, and target language quality), taking teachers’ ratings as a yardstick. Major findings include: although the students as a group were unable to replicate teachers’ ratings, they were able to rank-order their performances in a fairly accurate manner and improved their SA and PA accuracy over time. Interpreting directionality seems to moderate the correlational strength of self/teacher ratings and peer/teacher ratings. These results are discussed in relation to previous literature, and pedagogical suggestions are provided to improve SA and PA for bi-directional interpretation.  相似文献   

15.
Despite considerable interest in the topic of instructional quality in research as well as practice, little is known about the quality of its assessment. Using generalizability analysis as well as content analysis, the present study investigates how reliably and validly instructional quality is measured by observer ratings. Twelve trained raters judged 57 videotaped lesson sequences with regard to aspects of domain-independent instructional quality. Additionally, 3 of these sequences were judged by 390 untrained raters (i.e., student teachers and teachers). Depending on scale level and dimension, 16–44% of the variance in ratings could be attributed to instructional quality, whereas rater bias accounted for 12–40% of the variance. Although the trained raters referred more often to aspects considered essential for instructional quality, this was not reflected in the reliability of their ratings. The results indicate that observer ratings should be treated in a more differentiated manner in the future.  相似文献   

16.
The hierarchical rater model (HRM) re‐cognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters’ scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn serves as an indicator of the examinee's proficiency, thus yielding a hierarchical structure. Here it is shown that a latent class model motivated by signal detection theory (SDT) is a natural candidate for the first level of the HRM, the rater model. The latent class SDT model provides measures of rater precision and various rater effects, above and beyond simply severity or leniency. The HRM‐SDT model is applied to data from a large‐scale assessment and is shown to provide a useful summary of various aspects of the raters’ performance.  相似文献   

17.
Abstract

In recent years, there has been an increasing use of peer assessment in classrooms and other learning settings. Despite the prevailing view that peer assessment has a positive effect on learning across empirical studies, the results reported are mixed. In this meta-analysis, we synthesised findings based on 134 effect sizes from 58 studies. Compared to students who do not participate in peer assessment, those who participate in peer assessment show a .291 standard deviation unit increase in their performance. Further, we performed a meta-regression analysis to examine the factors that are likely to influence the peer assessment effect. The most critical factor is rater training. When students receive rater training, the effect size of peer assessment is substantially larger than when students do not receive such training. Computer-mediated peer assessment is also associated with greater learning gains than the paper-based peer assessment. A few other variables (such as rating format, rating criteria and frequency of peer assessment) also show noticeable, although not statistically significant, effects. The results of the meta-analysis can be considered by researchers and teachers as a basis for determining how to make effective use of peer assessment as a learning tool.  相似文献   

18.
The purpose of this study was to build a Random Forest supervised machine learning model in order to predict musical rater‐type classifications based upon a Rasch analysis of raters’ differential severity/leniency related to item use. Raw scores (N = 1,704) from 142 raters across nine high school solo and ensemble festivals (grades 9–12) were collected using a 29‐item Likert‐type rating scale embedded within five domains (tone/intonation, n = 6; balance, n = 5; interpretation, n = 6; rhythm, n = 6; and technical accuracy, n = 6). Data were analyzed using a Many Facets Rasch Partial Credit Model. An a priori k‐means cluster analysis of 29 differential rater functioning indices produced a discrete feature vector that classified raters into one of three distinct rater‐types: (a) syntactical rater‐type, (b) expressive rater‐type, or (c) mental representation rater‐type. Results of the initial Random Forest model resulted in an out‐of‐bag error rate of 5.05%, indicating that approximately 95% of the raters were correctly classified. After tuning a set of three hyperparameters (ntree, mtry, and node size), the optimized model demonstrated an improved out‐of‐bag error rate of 2.02%. Implications for improvements in assessment, research, and rater training in the field of music education are discussed.  相似文献   

19.
多面Rasch模型在主观题评分培训中的应用   总被引:7,自引:2,他引:7  
主观题的评分受到很多因素的影响,如评分者的知识水平、综合能力和个人偏好等。这些评分者偏差不仅会导致不同评分者之间存在主观差异,也会到导致同一评分者在不同的时间也具有主观不稳定性,最终导致主观题评分信度的降低。本研究将多面Rasch模型运用到某国家级考试论述题的评分培训中。通过分析6名有经验评分者对58份试卷的试评数据,鉴别出四种评分者偏差,然后据此对每个评分者进行个别反馈,从而提高评分的客观性和精确性。  相似文献   

20.
In an essay rating study multiple ratings may be obtained by having different raters judge essays or by having the same rater(s) repeat the judging of essays. An important question in the analysis of essay ratings is whether multiple ratings, however obtained, may be assumed to represent the same true scores. When different raters judge the same essays only once, it is impossible to answer this question. In this study 16 raters judged 105 essays on two occasions; hence, it was possible to test assumptions about true scores within the framework of linear structural equation models. It emerged that the ratings of a given rater on the two occasions represented the same true scores. However, the ratings of different raters did not represent the same true scores. The estimated intercorrelations of the true scores of different raters ranged from .415 to .910. Parameters of the best fitting model were used to compute coefficients of reliability, validity, and invalidity. The implications of these coefficients are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号