首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
In the United Kingdom, the majority of national assessments involve human raters. The processes by which raters determine the scores to award are central to the assessment process and affect the extent to which valid inferences can be made from assessment outcomes. Thus, understanding rater cognition has become a growing area of research in the United Kingdom. This study investigated rater cognition in the context of the assessment of school‐based project work for high‐stakes purposes. Thirteen teachers across three subjects were asked to “think aloud” whilst scoring example projects. Teachers also completed an internal standardization exercise. Nine professional raters across the same three subjects standardized a set of project scores whilst thinking aloud. The behaviors and features attended to were coded. The data provided insights into aspects of rater cognition such as reading strategies, emotional and social influences, evaluations of features of student work (which aligned with scoring criteria), and how overall judgments are reached. The findings can be related to existing theories of judgment. Based on the evidence collected, the cognition of teacher raters did not appear to be substantially different from that of professional raters.  相似文献   

2.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

3.
Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks.  相似文献   

4.
本研究的目的是描述一个用于测量写作能力的多面Rasch(FACETS)模型。该FACETS模型是Rasch测量模型的多元变量拓展,它可为写作测评中的校标评分员和写作题目提供框架。本文展示了如何应用FACETS模型解决大型写作测评中遇到的测量问题。参加全州写作考试的1000个随机抽取的学生样本被用来显示该FACETS模型。数据表明即使经过强化训练,评分员的严格度有显著区别。同时,本研究还发现,写作题目难度的区分,虽然微小,却具有统计意义上的显著性。该FACETS模型为解决以作文测评写作能力的大型考试遇到的测量问题提供了一个有前景的途径。  相似文献   

5.
In writing assessment, the inconsistency of teachers’ scorings is among the frequently reported concerns regarding the validity and the reliability of assessment. The study aimed to find out to what extent participating in a community of assessment practice (CAP) can impact the discrepancies among raters’ scorings. Adopting one group pretest-posttest design, patterns in the teachers’ scoring judgments were explored based on both quantitative and qualitative data. The results indicate significant increase in the degrees of agreement in the teachers’ differential scorings showing changes in their severity tendencies for structural variety, lexical accuracy, organization and mechanics criteria while their scoring judgements on structural accuracy, task achievement, and lexical variety criteria had low levels of agreement.  相似文献   

6.
The purpose of this study was to examine the quality assurance issues of a national English writing assessment in Chinese higher education. Specifically, using generalizability theory and rater interviews, this study examined how the current scoring policy of the TEM-4 (Test for English Majors – Band 4, a high-stakes national standardized EFL assessment in China) writing could impact its score variability and reliability. Eighteen argumentative essays written by nine English major undergraduate students were selected as the writing samples. Ten TEM-4 raters were first invited to use the authentic TEM-4 writing scoring rubric to score these essays holistically and analytically (with time intervals in between). They were then interviewed for their views on how the current scoring policy of the TEM-4 writing assessment could affect its overall quality. The quantitative generalizability theory results of this study suggested that the current scoring policy would not yield acceptable reliability coefficients. The qualitative results supported the generalizability theory findings. Policy implications for quality improvement of the TEM-4 writing assessment in China are discussed.  相似文献   

7.
8.
Research indicates that instructional aspects of teacher performance are the most difficult to reach consensus on, significantly limiting teacher observation as a way to systematically improve instructional practice. Understanding the rationales that raters provide as they evaluate teacher performance with an observation protocol offers one way to better understand the training efforts required to improve rater accuracy. The purpose of this study was to examine the accuracy of raters evaluating special education teachers’ implementation of evidence-based math instruction. A mixed-methods approach was used to investigate: 1) the consistency of the raters’ application of the scoring criteria to evaluate teachers’ lessons, 2) raters’ accuracy on two lessons with those given by expert-raters, and 3) the raters’ understanding and application of the scoring criteria through a think-aloud process. The results show that raters had difficulty understanding some of the high inference items in the rubric and applying them accurately and consistently across the lessons. Implications for rater training are discussed.  相似文献   

9.
Recent literature on the use of exemplars in the context of higher education has shown that exemplar-based instruction is implemented in various disciplines; nevertheless, how exemplar-based instruction can be implemented in English-as-a-Second-Language (ESL) writing classrooms in higher education institutions remains under-explored. In this connection, this article reports on a textbook development project which adopts an exemplar-based instruction approach to be used by university English instructors to prepare students for IELTS writing (academic module). The goal of the textbook is to cultivate students’ understanding of the assessment standards of the two IELTS writing tasks through the design and use of exemplar-based dialogic and reflective activities. In this article, theoretical underpinnings of the use of exemplars, namely tacit knowledge, assessment as learning and dialogic feedback, will first be discussed in detail. Then, an overview of an ongoing project which aims to develop an exemplar-based IELTS writing textbook will be given. The last section of this article suggests practical strategies for ESL writing teachers who are interested in using exemplars to develop students’ understanding of assessment standards.  相似文献   

10.
Recognizing the importance of formative assessment, this mixed-methods study investigates how four teachers and 100 students respond to the new emphasis on formative assessment in English as a foreign language (EFL) writing classes in Norway. While previous studies have examined formative assessment in oral classroom interactions and focused on either studying students or teachers, little research has been conducted on formative assessment of writing where both students and teachers are studied. As such, this study provides new insight. The findings mostly indicate that contradictions are prevalent amongst teachers’ and students’ perceptions of formative assessment of writing. The contradictions revolve around feedback, grades, text revision, self-assessment, and student involvement. The identified contradictions suggest the need for developing a mutual understanding of formative assessment in order to make it useful and meaningful.  相似文献   

11.
An Observation Guide, designed to help New Zealand teachers identify areas of teaching strength and aspects for development, was developed as part of a wider project. In the second phase of this project, 18 middle school teachers used the Guide to gather and record evidence as they participated in seven rounds of reciprocal peer observation and feedback during writing lessons with Grades 6–8 students. We report here on data from round 6 observations about the assessment for learning (AfL) strategies reported as evident in teachers’ practices, how these strategies were implemented and potential gaps in practice. AfL has at its heart a core of interdependent strategies that collectively contribute to the development of autonomous, self-regulating learners and quality learning. While the middle school teachers shared goals for learning and communicated what counted as successful achievement to students, they appeared to struggle when articulating goals in terms of literacy learning and conveying the substantive aspects and quality expected in students’ writing. In addition, despite AfL's promotion of learner autonomy, few teachers openly afforded students focused opportunities to take a meaningful role in their learning through the appraisal of their own and peers’ writing and the joint construction of feedback. As such, teachers’ AfL practice in the writing classroom failed to realise its full potential. It is argued that the promise of AfL can only be reached when strategies are enacted in ways that reflect its unitary nature, promote quality outcomes and give students a central role in their learning.  相似文献   

12.
The lament that ‘students can’t write’ remains loud and defiant, even after years of research pointing to the myriad factors that make students’ writing challenging, particularly when they move into university. This paper reports on a longitudinal, ethnographic study which explored students’ writing ‘in transition’, from A-levels to university in the UK, through the critical lens offered by the academic literacies conceptual framing. This paper offers critical analysis of the ways that students, teachers and institutions position writing at A-level and university, exploring the assumptions and beliefs that underpin their understandings and practices using Ivani?’s framework of discourses of writing. The analysis proposes that the centrality of assessment in the treatment of language at both levels creates an ‘assessment discourse of writing’, which originates in school, and becomes a defining and restrictive frame for students’ writing as they move into higher education. The analysis further suggests that assessment is the principal cause for the students’ challenges with adapting to the writing requirements of university. Moreover, assessment is used as a metalanguage for discussing writing at A-levels, and can become an unhelpful ‘anchor of continuance’ for students as they move into university.  相似文献   

13.
Using generalizability (G-) theory and rater interviews as research methods, this study examined the impact of the current scoring system of the CET-4 (College English Test Band 4, a high-stakes national standardized EFL assessment in China) writing on its score variability and reliability. One hundred and twenty CET-4 essays written by 60 non-English major undergraduate students at one Chinese university were scored holistically by 35 experienced CET-4 raters using the authentic CET-4 scoring rubric. Ten purposively selected raters were further interviewed for their views on how the current scoring system could impact its score variability and reliability. The G-theory results indicated that the current single-task and single-rater holistic scoring system would not be able to yield acceptable generalizability and dependability coefficients. The rater interview results supported the quantitative findings. Important implications for the CET-4 writing assessment policy in China are discussed.  相似文献   

14.
This study examined rater effects on essay scoring in an operational monitoring system from England's 2008 national curriculum English writing test for 14‐year‐olds. We fitted two multilevel models and analyzed: (1) drift in rater severity effects over time; (2) rater central tendency effects; and (3) differences in rater severity and central tendency effects by raters’ previous rating experience. We found no significant evidence of rater drift and, while raters with less experience appeared more severe than raters with more experience, this result also was not significant. However, we did find that there was a central tendency to raters’ scoring. We also found that rater severity was significantly unstable over time. We discuss the theoretical and practical questions that our findings raise.  相似文献   

15.
The hierarchical rater model (HRM) re‐cognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters’ scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn serves as an indicator of the examinee's proficiency, thus yielding a hierarchical structure. Here it is shown that a latent class model motivated by signal detection theory (SDT) is a natural candidate for the first level of the HRM, the rater model. The latent class SDT model provides measures of rater precision and various rater effects, above and beyond simply severity or leniency. The HRM‐SDT model is applied to data from a large‐scale assessment and is shown to provide a useful summary of various aspects of the raters’ performance.  相似文献   

16.
Despite the increasing volume of research in peer assessment for writing, few studies have been conducted to explore teachers’ perceptions of its appropriateness for writing instruction. It is essential to understand teachers’ perceptions of peer assessment as teachers play an important role in whether and how peer assessment is implemented in their instruction. The current study investigated tertiary English writing tutors’ perceptions of the appropriateness of peer assessment for English as a Foreign Language writing in China, where peer assessment has been increasingly discussed and researched but only occasionally used in teaching. The current study scrutinised the reasons behind its limited use via in-depth exploratory interviews with 25 writing tutors with different teaching backgrounds. The interview data showed tutors’ limited knowledge of peer assessment and unanimous hesitation in using it. The former was explained by insufficient instruction and training in peer assessment. The latter relates to the incompatibility of peer assessment with the examinations-oriented education system, learners’ low English language proficiency and learning motivation, and the conflict of peer assessment with the entrenched teacher-driven learning culture. Suggestions are made about training and engaging teachers to effectively use peer assessment in instruction.  相似文献   

17.
A random sample of middle school teachers (grades 6–9) from across the United States was surveyed about their use of writing to support students’ learning. The selection process was stratified so there were an equal number of English language arts, social studies, and science teachers. More than one-half of the teachers reported applying 15 or more writing to learn strategies at least once a month or more often. The most commonly used writing to learn strategies were writing short answers to questions, note taking for reading, note taking while listening, and completing worksheets. While teachers reported using a variety of writing to learn strategies, most of them indicated they received minimal or no formal preparation in college on how to use writing to learn strategies to support student learning, less than one-half of teachers directly taught students how to use the writing to learn strategies commonly assigned, and the most commonly used writing to learn strategies did not require students to think deeply about the material they were learning. We further found that teachers’ reported use of writing to learn strategies was related to their preparedness and the composition of their classroom in terms of above and below average writers, English Language Learners, and students with disabilities.  相似文献   

18.
Observational assessment is used to study program and teacher effectiveness across large numbers of classrooms, but training a workforce of raters that can assign reliable scores when observations are used in large-scale contexts can be challenging and expensive. Limited data are available to speak to the feasibility of training large numbers of raters to calibrate to an observation tool, or the characteristics of raters associated with calibration. This study reports on the success of rater calibration across 2093 raters trained by the Office of Head Start (OHS) in 2008–2009 on the Classroom Assessment Scoring System (CLASS), and for a subsample of 704 raters, characteristics that predict their calibration. Findings indicate that it is possible to train large numbers of raters to calibrate to an observation tool, and rater beliefs about teachers and children predicted the degree of calibration. Implications for large-scale observational assessments are discussed.  相似文献   

19.
We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54 and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44 and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in some state accountability systems.  相似文献   

20.
A random sample of 482 teachers in grades 3 through 8 from across the United States were surveyed about (a) their perceptions of the version of the Common Core writing and language standards adopted by their state and their state’s writing assessment, (b) their preparation to teach writing, and (c) their self-efficacy beliefs for teaching writing. Regardless of grade, a majority of teachers believed that the adopted standards are more rigorous than prior standards, provide clear expectations for students that can be straightforwardly translated into activities and lessons, and have pushed them to address writing more often. However, many surveyed felt the new writing and language standards are too numerous to cover, omit key aspects of writing development, and may be inappropriate for struggling writers. Moreover, most did not feel that professional development efforts have been sufficient to achieve successful implementation, and nearly one in five respondents was not familiar with the standards. The respondents were generally less sanguine regarding their state’s writing test, with elementary teachers even less positive than middle school teachers on some aspects, though nearly a third were unfamiliar with their state test. A majority believed state writing tests, though more rigorous than prior tests, fail to address important aspects of writing development, do not accommodate the needs of students with diverse abilities, and require more time than is available to prepare students. Additionally, many teachers believed professional development efforts have been insufficient to help them understand measurement properties of the assessments and how to use test data to identify students’ writing needs. Teachers who were better prepared to teach writing and who held more positive personal teaching efficacy beliefs for writing exhibited generally more positive perceptions of their state’s standards. In contrast, only teacher efficacy beliefs made a unique contribution to the survey respondents’ attitudes and beliefs about their state’s writing test.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号