首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 859 毫秒
1.
Despite the increasing popularity of peer assessment in tertiary-level interpreter education, very little research has been conducted to examine the quality of peer ratings on language interpretation. While previous research on the quality of peer ratings, particularly rating accuracy, mainly relies on correlation and analysis of variance, latent trait modelling emerges as a useful approach to investigate rating accuracy in rater-mediated performance assessment. The present study demonstrates the use of multifaceted Rasch partial credit modelling to explore the accuracy of peer ratings on English-Chinese consecutive interpretation. The analysis shows that there was a relatively wide spread of rater accuracy estimates and that statistically significant differences were found between peer raters regarding rating accuracy. Additionally, it was easier for peer raters to assess some students accurately than others, to peer-assess target language quality accurately than the other rating domains, and to provide accurate ratings to English-to-Chinese interpretation than the other direction. Through these findings, latent trait modelling demonstrates its capability to produce individual-level indices, measure rater accuracy directly, and accommodate sparse data rating designs. It is therefore hoped that substantive inquiries into peer assessment of language interpretation could utilise latent trait modelling to move this line of research forward.  相似文献   

2.
When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to control for differences in rater severity. Although several different linking designs are used in practice to establish connectivity, the implications of design differences have not been fully explored. Research is also limited related to the impact of model-data fit on the quality of MFR model-based adjustments for rater severity. This study explores the effects of linking designs and model-data fit for raters on the interpretation of student achievement estimates within the context of performance assessments in music. Results indicate that performances cannot be effectively adjusted for rater effects when inadequate linking or model-data fit is present.  相似文献   

3.
The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start‐up, plodding, boredom, or fatigue. An understanding of the different types of measurement disturbances can lead to a more complete understanding of persons or items in terms of the construct being measured. Although measurement disturbances have been explored in several contexts, they have not been explicitly considered in the context of performance assessments. The purpose of this study is to illustrate the use of graphical methods to explore measurement disturbances related to raters within the context of a writing assessment. Graphical displays that illustrate the alignment between expected and empirical rater response functions are considered as they relate to indicators of rating quality based on the Rasch model. Results suggest that graphical displays can be used to identify measurement disturbances for raters related to specific ranges of student achievement that suggest potential rater bias. Further, results highlight the added diagnostic value of graphical displays for detecting measurement disturbances that are not captured using Rasch model–data fit statistics.  相似文献   

4.
Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice.  相似文献   

5.
Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks.  相似文献   

6.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.  相似文献   

7.
We conducted generalizability studies to examine the extent to which ratings of language arts performance assignments, administered in a large, diverse, urban district to students in second through ninth grades, result in reliable and precise estimates of true student performance. The results highlight three important points when considering the use of performance assessments in large-scale settings: (a) Rater training may significantly impact reliability; (b) simple rater agreement indices do not provide enough information to assess the reliability of inferences about true student achievement; and (c) assessments adequate for relative judgments of student performance do not necessarily provide sufficient precision for absolute criterion-referenced decisions.  相似文献   

8.
A flexible approach to assessment may promote students’ engagement and academic achievement by allowing them to personalise their learning experience, even in the context of large undergraduate classes. However, studies reporting flexible assessment strategies and their impact are limited. In this paper, I present a feasible and effective approach to flexible assessment, describe choices made by 2016 students in 12 sections of two different courses using this approach, and explore associations between students’ choices and academic achievement. Students decided at the beginning of the term whether they would use the assessment scheme proposed by the instructor or modify it by selecting (from within ranges provided) which assessments they would complete and the value of each in calculating their final grade. Most students (62%) made some change, but many (38%) opted to use the suggested values. Students did not choose to minimise their workload by selecting the fewest assessments possible. Opting out of a large assignment was the most common change, but the majority (69%) completed the assignment and almost all (95%) included quizzes based on readings. Different choices were not associated with notable differences in achievement: midterm score was the most important predictor of performance on a cumulative final examination.  相似文献   

9.
Assessment options in higher education   总被引:1,自引:1,他引:0  
This article evaluates an initiative to introduce assessment choice within a taught unit on an undergraduate healthcare programme as a means of addressing poor performance, especially for those students diagnosed with dyslexia. Students’ perceptions of the assessment experience were sought via the use of two focus group interviews (n = 16). The article describes the effect the assessment experience had on students’ stress levels, individual learning styles and achievement. Students’ performance improved and statistical analyses indicated parity between the assessment methods offered with similar performance profiles between students with and without dyslexia. The conclusion reached is that while the introduction of assessment options may be time consuming for staff to develop, the benefits of an enhanced student‐centred approach to assessment may be well worth this investment in time. Although a limited study owing to the small sample size, the results should be of interest to those academics who are concerned with assessment and its impact on students’ achievement.  相似文献   

10.
Focusing on evaluating students’ performance in basic math classrooms, the researchers in this study examined the impact of alternative assessments on learning outcomes in fourth grade Palestinian classrooms. Representing a large sector of education in the Palestinian territories, fourth grade students were randomly selected to participate in the study in which they participated in various math instructional and assessment activities throughout multiple lessons and cycles. Alternative assessment approaches were used including student self- assessment, peer assessment, and teacher assessment in terms of three main achievement levels to measure the extent to which students learned the math concepts by recall and remembrance, ability to apply, and making inferences. Mixed Design ANOVA measures were used to analyze and interpret the results which showed significant trends and correlations across the four cycles of math instruction and indicated that alternative authentic assessment methods have had a positive impact on students’ learning and application of math. Implications for integrating authentic assessment measures as well as peer and self- assessments were drawn in light of the promising outcomes of augmenting student motivation and developing critical life-long skills.  相似文献   

11.
Psychometric models based on structural equation modeling framework are commonly used in many multiple-choice test settings to assess measurement invariance of test items across examinee subpopulations. The premise of the current article is that they may also be useful in the context of performance assessment tests to test measurement invariance of raters. The modeling approach and how it can be used for performance tests with less than optimal rater designs are illustrated using a data set from a performance test designed to measure medical students’ patient management skills. The results suggest that group-specific rater statistics can help spot differences in rater performance that might be due to rater bias, identify specific weaknesses and strengths of individual raters, and enhance decisions related to future task development, rater training, and test scoring processes.  相似文献   

12.
Context-based science courses stimulate students to reconstruct the information presented by connecting to their prior knowledge and experiences. However, students need support. Formative assessments inform both teacher and students about students’ knowledge deficiencies and misconceptions and how students can be supported. Research on formative assessments suggests a positive impact on students’ science achievement, although its success depends on how the formative assessment is implemented in class. The aim of this study was to provide insights into the effects of formative assessments on achievement during a context-based chemistry course on lactic acid. In a classroom action research setting, a pre-test/post-test control group design with switching replications was applied. Student achievement was measured in two pre-tests, two post-tests and a retention test. Participants were Grade 9 students from one secondary school in the Netherlands. Repeated-measures analysis showed a significant effect of formative assessments on students’ achievement. During the implementation of the formative assessments, intriguing discussions emerged between students, between students and teacher, and between teachers. Adding formative assessments to context-based approaches reinforces their strength to meet with the current challenges of chemistry education. Formative assessments affect students’ achievement positively and stimulate feedback between students and teacher(s).  相似文献   

13.
国际大规模学业测评对家庭因素影响学生学业成就发展的关注度逐年上升。综合分析影响力较大且相对成熟的6个国际大规模学业测评项目中家庭因素指标的选取原则,梳理测评结果在学生个体、学校教学和教育改革中的应用实例,为我国紧扣学生的关键家庭因素,整体考虑涵盖学生、班级和学校层面嵌套结构的家庭因素测评体系,并有效使用测评结果提供参考与借鉴。  相似文献   

14.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

15.
Cognitive diagnostic assessment (CDA) allows for diagnosing second language (L2) learners’ strengths and weaknesses of attributes in a specific domain. Exploring the little-known territory of CDA, the current study retrofitted the reading section of the International English Language Testing System (IELTS) with a cognitive diagnostic model (CDM). It aimed to identify the attributes involved in successfully implementing IELTS reading, analyze the overall and individual test-takers’ reading performance, and, finally, explore the IELTS reading differences of Iranian students in engineering and veterinary domains. Based on think-aloud protocols and expert judgement, an initial Q-matrix was developed. Using R package CDM, the generalized deterministic inputs, noisy “and” gate (G-DINA) model was applied to IELTS reading data to refine and validate the initial Q-matrix and estimate the mastery probabilities of 1025 test-takers on each attribute. The final Q-matrix consisted of 6 attributes assumed to be involved in IELTS reading. Moreover, the overall test-takers and the individuals demonstrated different mastery/non-mastery across the 6 IELTS reading attributes on both macro and micro levels. Further, significant differences were found between IELTS reading performances of Iranian engineering and veterinary students. The findings supported the assumption that CDA can provide instructors and IELTS candidates with detailed diagnostic feedback to promote test-takers’ IELTS reading performance.  相似文献   

16.
This study examines the relationship between Tongan students’ attitudes and beliefs towards their school experiences and their academic achievement on the high-stakes National Certificate of Educational Achievement (NCEA) assessments in English and mathematics. Data were obtained from using previously published self-reported inventories on a sample of Tongan senior students in New Zealand secondary schools. Confirmatory factor analysis of students’ conceptions found good fit measurement models for each domain (teaching, learning, and assessment). Structural equation modelling was used to identify the effect of the various beliefs upon students’ total score in each subject and upon internal and externally assessed performance. It was noted that different beliefs became statistically significant predictors of performance, depending on the subject and type of assessment. Nonetheless, all three constructs played some role in at least one subject. A small-to-moderate proportion of variance in NCEA performance could be attributed to student beliefs, suggesting that efforts to help students adopt adaptive beliefs will have beneficial consequences for those students.  相似文献   

17.
Historically, research focusing on rater characteristics and rating contexts that enable the assignment of accurate ratings and research focusing on statistical indicators of accurate ratings has been conducted by separate communities of researchers. This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings. We employ two data sources in our demonstration—simulated data and data from a large‐scale state‐wide writing assessment. We apply latent trait models to these data to identify examples of rater leniency, centrality, inaccuracy, and differential dimensionality; and we investigate the association between rater training procedures and the manifestation of rater effects in the real data.  相似文献   

18.
Academic self-concept is a prominent construct in educational psychology that predicts future achievement. Similarly, peer ratings of competence predict future achievement as well. Yet do self-concept ratings have predictive value over and above peer ratings of competence? In this study, the interpersonal approach (Kwan, John, Kenny, Bond, & Robins, 2004) was applied to academic self-concept. The interpersonal approach decomposes the variance in self-concept ratings into a “method” part that is due to the student as the rater (perceiver effect), a shared “trait” part that is due to the student’s perceived achievement (target effect), and an idiosyncratic self-view (self-enhancement). In a round-robin design of competence ratings in which each student in a class rated every classmate’s competence, a total of 2,094 school students in 89 classes in two age cohorts rated their own math competence and the math competence of their classmates. Three main results emerged. First, self-concept ratings and peer ratings of competence had a substantial overlap in variance. Second, the shared “trait” part of the competence ratings was highly correlated with achievement and predicted gains in achievement. Third, the idiosyncratic self-view had a small positive association with (future) achievement. Altogether, this study introduces the interpersonal approach as a general framework for studying academic self-concept and peer ratings of competence in an integrated way.  相似文献   

19.
Given the increased use of performance assessments (PAs) in higher education to evaluate achievement of learning outcomes, it is important to address the barriers related to ensuring quality for this type of assessment. This article presents a design-based research (DBR) study that resulted in the development of a Validity Inquiry Process (VIP). The study’s aim was to support faculty in examining the validity and reliability of the interpretation and use of results from locally developed PAs. DBR was determined to be an appropriate method because it is used to study interventions such as an instructional innovation, type of assessment, technology integration, or administrative activity (Anderson & Shattuck, 2012). The VIP provides a collection of instruments and utilizes a reflective practice approach integrating concepts of quality criteria and development of a validity argument as outlined in the literature (M.T. Kane, 2013; Linn, Baker, & Dunbar, 1991; Messick, 1994).  相似文献   

20.
Cognitive diagnostic assessment (CDA) approach has been increasingly applied to non-diagnostic large-scale assessments to extract fine-grained diagnostic feedback about students’ ability in a given domain and meet accountability demands for student achievement. This study aimed to diagnose the reading abilities of 4324 students from 19 European Union (EU) member countries that participated in the 2016 Progress in International Reading Literacy Study (PIRLS), one of the most comprehensive international studies that investigate students’ reading achievement. The PIRLS data were analyzed by using the Log-linear Cognitive Diagnosis Modeling (LCDM), a type of cognitive diagnostic classification model (DCM). Students’ weaknesses and strengths were identified based on a four-skill reading ability model. The results revealed that the methodology could provide more fine-grained diagnostic information about students’ reading skills than traditional aggregated-test scoring could. Such information could be utilized by teachers, school administrators, decision-makers, and students for maximizing the learning outcomes of reading programs and instruction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号