首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In the United Kingdom, the majority of national assessments involve human raters. The processes by which raters determine the scores to award are central to the assessment process and affect the extent to which valid inferences can be made from assessment outcomes. Thus, understanding rater cognition has become a growing area of research in the United Kingdom. This study investigated rater cognition in the context of the assessment of school‐based project work for high‐stakes purposes. Thirteen teachers across three subjects were asked to “think aloud” whilst scoring example projects. Teachers also completed an internal standardization exercise. Nine professional raters across the same three subjects standardized a set of project scores whilst thinking aloud. The behaviors and features attended to were coded. The data provided insights into aspects of rater cognition such as reading strategies, emotional and social influences, evaluations of features of student work (which aligned with scoring criteria), and how overall judgments are reached. The findings can be related to existing theories of judgment. Based on the evidence collected, the cognition of teacher raters did not appear to be substantially different from that of professional raters.  相似文献   

2.
《Educational Assessment》2013,18(3):257-272
Concern about the education system has increasingly focused on achievement outcomes and the role of assessment in school performance. Our research with fifth and eighth graders in California explored several issues regarding student performance and rater reliability on hands-on tasks that were administered as part of a field test of a statewide assessment program in science. This research found that raters can produce reliable scores for hands-on tests of science performance. However, the reliability of performance test scores per hour of testing time is quite low relative to multiple-choice tests. Reliability can be improved substantially by adding more tasks (and testing time). Using more than one rater per task produces only a very small improvement in the reliability of a student's total score across tasks. These results were consistent across both grade levels, and they echo the findings of past research.  相似文献   

3.
We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54 and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44 and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in some state accountability systems.  相似文献   

4.
Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

5.
Content‐based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept‐based scoring tool for content‐based scoring, c‐rater?, for four science items with rubrics aiming to differentiate among multiple levels of understanding. The items showed moderate to good agreement with human scores. The findings suggest that automated scoring has the potential to score constructed‐response items with complex scoring rubrics, but in its current design cannot replace human raters. This article discusses sources of disagreement and factors that could potentially improve the accuracy of concept‐based automated scoring.  相似文献   

6.
Argumentation is fundamental to science education, both as a prominent feature of scientific reasoning and as an effective mode of learning—a perspective reflected in contemporary frameworks and standards. The successful implementation of argumentation in school science, however, requires a paradigm shift in science assessment from the measurement of knowledge and understanding to the measurement of performance and knowledge in use. Performance tasks requiring argumentation must capture the many ways students can construct and evaluate arguments in science, yet such tasks are both expensive and resource-intensive to score. In this study we explore how machine learning text classification techniques can be applied to develop efficient, valid, and accurate constructed-response measures of students' competency with written scientific argumentation that are aligned with a validated argumentation learning progression. Data come from 933 middle school students in the San Francisco Bay Area and are based on three sets of argumentation items in three different science contexts. The findings demonstrate that we have been able to develop computer scoring models that can achieve substantial to almost perfect agreement between human-assigned and computer-predicted scores. Model performance was slightly weaker for harder items targeting higher levels of the learning progression, largely due to the linguistic complexity of these responses and the sparsity of higher-level responses in the training data set. Comparing the efficacy of different scoring approaches revealed that breaking down students' arguments into multiple components (e.g., the presence of an accurate claim or providing sufficient evidence), developing computer models for each component, and combining scores from these analytic components into a holistic score produced better results than holistic scoring approaches. However, this analytical approach was found to be differentially biased when scoring responses from English learners (EL) students as compared to responses from non-EL students on some items. Differences in the severity between human and computer scores for EL between these approaches are explored, and potential sources of bias in automated scoring are discussed.  相似文献   

7.
Classical test theory (CTT), generalizability theory (GT), and multi-faceted Rasch model (MFRM) approaches to detecting and correcting for rater variability were compared. Each of 4,930 students' responses on an English examination was graded on 9 scales by 3 raters drawn from a pool of 70. CTT and MFRM indicated substantial variation among raters; the MFRM analysis identified far more raters as different than the CTT analysis did. In contrast, the GT rater variance component and the Rasch histograms suggested little rater variation. CTT and MFRM correction procedures both produced different scores for more than 50% of the examinees, but 75% of the examinees received identical results after each correction. The demonstrated value of a correction for systems of well-trained multiple graders has implications for all systems in which subjective scoring is used.  相似文献   

8.
9.
This study investigated the interrater reliability of teachers' and school psychology ex-terns' scoring of protocols for the Developmental Test of Visual-Motor Integration (VMI). Previous studies suggest that the scoring criteria of the VMI are ambiguous, which when coupled with raters' lack of scoring experience, as well as limited knowledge of testing issues, contributes to low rater reliability. The original manual scoring system was used by four trained teachers with no VMI experience and by four experienced raters. A VMI scoring system, revised to eliminate ambiguous scoring criteria, was used by an additional four teachers inexperienced with the VMI and by four experienced raters. High reliability coefficients (>.90) were found for all raters, regardless of the scoring system employed. The influence on interrater reliability of factors such as training, nature of the training setting, characteristics of the raters, and ambiguity of scoring criteria is discussed.  相似文献   

10.
In generalizability theory studies in large-scale testing contexts, sometimes a facet is very sparsely crossed with the object of measurement. For example, when assessments are scored by human raters, it may not be practical to have every rater score all students. Sometimes the scoring is systematically designed such that the raters are consistently grouped throughout the scoring, so that the data can be analyzed as raters nested within teams. Other times, rater pairs are randomly assigned for each student, such that each rater is paired with many other raters at different times. One possibility for this scenario is to treat the data as if raters were nested within students. Because the raters are not truly independent across all students, the resulting variance components could be somewhat biased. This study illustrates how the bias will tend to be small in large-scale studies.  相似文献   

11.
Using generalizability (G-) theory and rater interviews as research methods, this study examined the impact of the current scoring system of the CET-4 (College English Test Band 4, a high-stakes national standardized EFL assessment in China) writing on its score variability and reliability. One hundred and twenty CET-4 essays written by 60 non-English major undergraduate students at one Chinese university were scored holistically by 35 experienced CET-4 raters using the authentic CET-4 scoring rubric. Ten purposively selected raters were further interviewed for their views on how the current scoring system could impact its score variability and reliability. The G-theory results indicated that the current single-task and single-rater holistic scoring system would not be able to yield acceptable generalizability and dependability coefficients. The rater interview results supported the quantitative findings. Important implications for the CET-4 writing assessment policy in China are discussed.  相似文献   

12.
This study compared short-form constructed responses evaluated by both human raters and machine scoring algorithms. The context was a public competition on which both public competitors and commercial vendors vied to develop machine scoring algorithms that would match or exceed the performance of operational human raters in a summative high-stakes testing environment. Data (N = 25,683) were drawn from three different states, employed 10 different prompts, and were drawn from two different secondary grade levels. Samples ranging in size from 2,130 to 2,999 were randomly selected from the data sets provided by the states and then randomly divided into three sets: a training set, a test set, and a validation set. Machine performance on all of the agreement measures failed to match that of the human raters. The current study concluded with recommendations on steps that might improve machine-scoring algorithms before they can be used in any operational way.  相似文献   

13.
Psychometric models based on structural equation modeling framework are commonly used in many multiple-choice test settings to assess measurement invariance of test items across examinee subpopulations. The premise of the current article is that they may also be useful in the context of performance assessment tests to test measurement invariance of raters. The modeling approach and how it can be used for performance tests with less than optimal rater designs are illustrated using a data set from a performance test designed to measure medical students’ patient management skills. The results suggest that group-specific rater statistics can help spot differences in rater performance that might be due to rater bias, identify specific weaknesses and strengths of individual raters, and enhance decisions related to future task development, rater training, and test scoring processes.  相似文献   

14.
Novice members of a Norwegian national rater panel tasked with assessing Year 8 pupils’ written texts were studied during three successive preparation sessions (2011–2012). The purpose was to investigate how the raters successfully make use of different decision-making strategies in an assessment situation where pre-set criteria and standards give a rather strict framework. The data sources were the raters’ pair assessment dialogues. The analysis shows that the raters use a ‘shared standards strategy’, but when reaching agreement on text quality they also seem to make very good use of assessment strategies related to their work as writing teachers. Moreover, asymmetries in knowledge and participation among raters contribute to creating an image of writing assessment as a challenging hermeneutic practice. It is suggested that future rater preparation would gain from being attentive to the internalised assessment practices teachers bring to the fore when working as raters.  相似文献   

15.
Although much attention has been given to rater effects in rater‐mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large‐scale, multi‐state summative assessment program. Multilevel models were used to assess the overall extent of rater leniency/severity during scoring and examine the extent to which leniency/severity effects were stable across the three administrations. Model results were then applied to scaled scores to estimate the impact of the stability of leniency/severity effects on students’ scores. Results showed relative scoring stability across administrations in mathematics. In English language arts, short constructed response items showed evidence of slightly increasing severity across administrations, while essays showed mixed results: evidence of both slightly increasing severity and moderately increasing leniency over time, depending on trait. However, when model results were applied to scaled scores, results revealed rater effects had minimal impact on students’ scores.  相似文献   

16.
The purpose of this study was to build a Random Forest supervised machine learning model in order to predict musical rater‐type classifications based upon a Rasch analysis of raters’ differential severity/leniency related to item use. Raw scores (N = 1,704) from 142 raters across nine high school solo and ensemble festivals (grades 9–12) were collected using a 29‐item Likert‐type rating scale embedded within five domains (tone/intonation, n = 6; balance, n = 5; interpretation, n = 6; rhythm, n = 6; and technical accuracy, n = 6). Data were analyzed using a Many Facets Rasch Partial Credit Model. An a priori k‐means cluster analysis of 29 differential rater functioning indices produced a discrete feature vector that classified raters into one of three distinct rater‐types: (a) syntactical rater‐type, (b) expressive rater‐type, or (c) mental representation rater‐type. Results of the initial Random Forest model resulted in an out‐of‐bag error rate of 5.05%, indicating that approximately 95% of the raters were correctly classified. After tuning a set of three hyperparameters (ntree, mtry, and node size), the optimized model demonstrated an improved out‐of‐bag error rate of 2.02%. Implications for improvements in assessment, research, and rater training in the field of music education are discussed.  相似文献   

17.
To meet recent accountability mandates, school districts are implementing assessment frameworks to document teachers’ effectiveness. Observational assessments play a key role in this process, albeit without compelling evidence of their psychometric rigor. Using a sample of kindergarten teachers, we employed Generalizability theory to investigate (across teachers, raters, and lessons) the stability of scores obtained with two different observation measures: The CLASS K-3 and the FFT. We conducted a series of Decision studies to document (for both measures’ constituent domains) the number of lessons per teacher and raters per lesson that would justify the use of observation scores for high stakes decisions. Acceptable, stable scores for individual-level decisions about teachers may generally require more raters and lessons than is typically used in practice (1–2 raters and fewer than 3 lessons). The considerable variability of observation-based scores raises concerns about either measure’s appropriateness for making individual or group decisions about teachers’ effectiveness.  相似文献   

18.
This study evaluated the reliability and validity of a performance assessment designed to measure students' thinking and reasoning skills in mathematics. The QUASAR Cognitive Assessment Instrument (QCA1) was administered to over 1.700 sixth and seventh grade students of various ethnic backgrounds in six schools that are participating in the QUASAR project. The consistency of students' responses across tasks and the validity for inferences drawn from the scores on the assessment to the more broadly-defined construct domain were examined. The intertask consistency and the dimensionality of the assessment was assessed through the use of polychoric correlations and confirmatory factor analysis, and the generalizability of the derived scores was examined through the use of generalizability theory. The results from the confirmatory factor analysis indicate that a one-factor model fits the data for each of the four QCAI forms. The major findings from the generalizability studies (person x task and person x rater x task) indicate that, for each of the four forms, the person x task variance component accounts for the largest percentage of the total variability and the percentage of variance accounted for by the variance components that include the rater effect is negligible. The variance components that-include the rater effect were negligible. The generalizability and dependability coefficients for the person x task decision studies (nt, = 9) range from .71-.84. These results indicate that the use of nine tasks may not be adequate for generalizing to the larger domain of mathematics for individual student level scores. The QUASAR project, however, is interested in assessing mathematics achievement at the program level not the student level; therefore, these coefficients are not alarmingly low.  相似文献   

19.
Observational assessment is used to study program and teacher effectiveness across large numbers of classrooms, but training a workforce of raters that can assign reliable scores when observations are used in large-scale contexts can be challenging and expensive. Limited data are available to speak to the feasibility of training large numbers of raters to calibrate to an observation tool, or the characteristics of raters associated with calibration. This study reports on the success of rater calibration across 2093 raters trained by the Office of Head Start (OHS) in 2008–2009 on the Classroom Assessment Scoring System (CLASS), and for a subsample of 704 raters, characteristics that predict their calibration. Findings indicate that it is possible to train large numbers of raters to calibrate to an observation tool, and rater beliefs about teachers and children predicted the degree of calibration. Implications for large-scale observational assessments are discussed.  相似文献   

20.
Research indicates that instructional aspects of teacher performance are the most difficult to reach consensus on, significantly limiting teacher observation as a way to systematically improve instructional practice. Understanding the rationales that raters provide as they evaluate teacher performance with an observation protocol offers one way to better understand the training efforts required to improve rater accuracy. The purpose of this study was to examine the accuracy of raters evaluating special education teachers’ implementation of evidence-based math instruction. A mixed-methods approach was used to investigate: 1) the consistency of the raters’ application of the scoring criteria to evaluate teachers’ lessons, 2) raters’ accuracy on two lessons with those given by expert-raters, and 3) the raters’ understanding and application of the scoring criteria through a think-aloud process. The results show that raters had difficulty understanding some of the high inference items in the rubric and applying them accurately and consistently across the lessons. Implications for rater training are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号