首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Educational Assessment》2013,18(3):257-272
Concern about the education system has increasingly focused on achievement outcomes and the role of assessment in school performance. Our research with fifth and eighth graders in California explored several issues regarding student performance and rater reliability on hands-on tasks that were administered as part of a field test of a statewide assessment program in science. This research found that raters can produce reliable scores for hands-on tests of science performance. However, the reliability of performance test scores per hour of testing time is quite low relative to multiple-choice tests. Reliability can be improved substantially by adding more tasks (and testing time). Using more than one rater per task produces only a very small improvement in the reliability of a student's total score across tasks. These results were consistent across both grade levels, and they echo the findings of past research.  相似文献   

2.
In this article, performance assessments are cast within a sampling framework. More specifically, a performance assessment is viewed as a sample of student performance drawn from a complex universe defined by a combination of all possible tasks, occasions, raters, and measurement methods. Using generalizability theory, we present evidence bearing on the generalizability and convergent validity of performance assessments sampled from a range of measurement facets and measurement methods. Results at both the individual and school level indicate that task-sampling variability is the major source ofmeasurment error. Large numbers of tasks are needed to get a reliable measure of mathematics and science achievement at the elementary level. With respect to convergent validity, results suggest that methods do not converge. Students' performance scores, then, are dependent on both the task and method sampled.  相似文献   

3.
Concerns relating to the reliability of teacher and student peer assessments are discussed, and some correlational analyses comparing student and teacher marks described. The benefits of the use of multiple ratings are elaborated. The distinction between gender differences and gender bias is drawn, and some studies which have reported gender bias are reviewed. The issue of ‘blind marking’ is addressed. A technique for detecting gender bias in cases where student raters have awarded marks to same and opposite sex peers is described, and illustrated by data from two case studies. Effect sizes were found to be very small, indicating an absence of gender bias in these two cases. Results are discussed in relation to task and other contextual variables. The authors conclude that the technique described can contribute to the good practice necessary to ensure the success of peer assessment in terms of pedagogical benefits and reliable and fair marking outcomes.  相似文献   

4.
Undergraduate students, and their class teachers, assessed the performance of their peers in three oral and written tasks as part of a group project. The two sets of marks awarded by peers and teachers were subsequently compared to find out whether the students were competent to assess their peers alongside their class teachers and whether this competence, or lack of it, was partly determined by the nature of the task being assessed. A number of statistical tests were run to establish the levels of agreement, the ranges, differences and relationship between peer and teacher assessments. The results have led us to conclude that the peer assessments are not sufficiently reliable to be used to supplement teacher assessments. Students’ competencies in peer assessment do not appear to be dependent on the nature of the task being assessed, but there is some evidence that practical experience of assessing a particular task type can lead to an improvement in students’ assessment skills when they assess a similar task. The paper also discusses possible improvements in peer assessment procedures based on the experiences gained.  相似文献   

5.
The purpose of this study was to compare the effects of two peer assessment methods on university students' academic writing performance and their satisfaction with peer assessment. This study also examined the validity and reliability of student generated assessment scores. Two hundred and thirty-two predominantly undergraduate students were selected by convenience sampling during the fall semester of 2007. The results indicate that students in the experimental group demonstrated greater improvement in their writing than those in the comparison group, and the findings reveal that students in the experimental group exhibited higher levels of satisfaction with the peer assessment method both in peer assessment structure and peer feedback than those in the comparison group. Additionally, the findings indicate that the validity and reliability of student generated rating scores were extremely high. Using Wiki interactive software and providing an online collaborative learning environment to facilitate peer assessment added value to peer assessment.  相似文献   

6.
Forty science students received training for 12 weeks on delivering effective presentations and using a tertiary-level English oral presentation scale comprising three subscales (Verbal Communication, Nonverbal Communication, and Content and Organization) measured by 18 items. For their final project, each student was given 10 to 12 min to present on 1 of the 5 compulsory science books for the module and was rated by the tutor, peers, and himself/herself. Many-facet Rasch measurement, correlation, and analysis of variance were performed to mine the data. The results show that the student raters, tutor, items, and rating scales achieved high psychometric quality, though a small number of assessments exhibited bias. Although all of the biased self-assessments were underestimations of presentation skills, the peer and tutor assessment bias had a mixed pattern. In addition, self-, peer, and tutor assessments had low to medium correlations on the subscales, and a significant difference was found between the assessments. Implications are discussed.  相似文献   

7.
Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks.  相似文献   

8.
In this study the relationship between domain-specific skills and peer assessment skills as a function of task complexity is investigated. We hypothesised that peer assessment skills were superposed on domain-specific skills and will therefore suffer more when higher cognitive load is induced by increased task complexity. In a mixed factorial design with the between-subjects factor task complexity (simple, n?=?51; complex, n?=?59) and within-subjects factor task type (domain-specific, peer assessment), secondary school students studied four integrated study tasks, requiring them to learn a domain-specific skill (i.e. identifying the six steps of scientific research) and to learn how to assess a fictitious peer performing the same skill. Additionally, the students performed two domain-specific test tasks and two peer assessment test tasks. The interaction effect found on test performance supports our hypothesis. Implications for the teaching and learning of peer assessment skills are discussed.  相似文献   

9.
The instrument Samples of Teaching Performance (STP) was developed to assess student teachers' capacity to plan, deliver and evaluate a unit of instruction. The current study reports consequential validity data collected from supervisors (n?=?20) and student teachers (n?=?62) from three elementary and five secondary teacher preparation programs in Chile that participated in the field-testing of the STP. Student teachers described how this assessment had honed their sense of professionalism and promoted learning of the skills assessed. Supervisors reported enlarging the topics discussed with student teachers and making some changes to the supervisory process. These findings are complemented by an analysis of the STP scores obtained by 24 student teachers, which showed better development of instructional skills when compared to pedagogical reasoning and reflection. These results raise questions about the structure of student teaching to support the implementation of standards-based assessments that entail tasks at different levels of cognitive complexity.  相似文献   

10.
To meet recent accountability mandates, school districts are implementing assessment frameworks to document teachers’ effectiveness. Observational assessments play a key role in this process, albeit without compelling evidence of their psychometric rigor. Using a sample of kindergarten teachers, we employed Generalizability theory to investigate (across teachers, raters, and lessons) the stability of scores obtained with two different observation measures: The CLASS K-3 and the FFT. We conducted a series of Decision studies to document (for both measures’ constituent domains) the number of lessons per teacher and raters per lesson that would justify the use of observation scores for high stakes decisions. Acceptable, stable scores for individual-level decisions about teachers may generally require more raters and lessons than is typically used in practice (1–2 raters and fewer than 3 lessons). The considerable variability of observation-based scores raises concerns about either measure’s appropriateness for making individual or group decisions about teachers’ effectiveness.  相似文献   

11.
The psychometric measures of accuracy, reliability and validity of peer assessment are critical qualities for its use as a supplement to instructor grading. In this study, we seek to determine which factors related to peer review are the most influential on these psychometric measures, with a primary focus on the accuracy of peer assessment or how closely peer-given grades match those of an instructor. We examine and rank the correlations of accuracy, reliability and validity with 17 quantitative and qualitative variables for three senior undergraduate courses that used peer assessment on high value written assignments. Based on these analyses, we altered the single most significant variable of one of the courses. We demonstrate that the number of reviews completed per reviewer has the greatest influence on the accuracy of peer assessment out of all the factors analysed. Our calculations suggest that six reviews must be completed per reviewer to achieve peer assessment that is no different from the grading of an instructor. Effective training, previous experience and strong academic abilities in the reviewers may reduce this number.  相似文献   

12.
In the context of widening participation, large classes and increased diversity, assessment of student learning is becoming increasingly problematic in that providing formative feedback aimed at developing student writing proves to be particularly laborious. Although the potential value of peer assessment has been well documented in the literature, the associated administrative burden, also in relation to managing anonymity and intellectual ownership, makes this option less attractive, particularly in large classes. A potential solution involves the use of information and communication technologies to automate the logistics associated with peer assessment in a time-efficient way. However, uptake of such systems in the higher education community is limited, and research in this area is only beginning. This case study reports on the use of the Moodle Workshop module for formative peer assessment of students’ individual work in a first-year introductory macro-economics class of over 800 students. Data were collected through an end-of-course evaluation survey of students. The study found that using the feature-rich Workshop module not only addressed many of the practical challenges associated with paper-based peer assessments, but also provided a range of additional options for enhancing validity and reliability of peer assessments that would not be possible with paper-based systems.  相似文献   

13.
Fieldwork training is a key component of several practical disciplines. In this study, students’ peer assessment of fieldwork is explored as a method to improve their practical training. Peer assessment theories are first discussed. A framework for peer assessment of fieldwork is proposed, and the steps taken for preparation of students for this task are discussed. A developed marking, feedback and moderation tool of assessment are presented. Application of peer assessment in the field was investigated over a period of two years in one undergraduate unit in the geospatial discipline as an example. Reliability of peer assessment was estimated by measuring the difference between assessments carried out by groups of peer assessors, and its validity was measured by comparing students’ marks with those given by tutors. Results show that students have gained from the peer assessment process, mainly as a formative form of assessment, by better understanding and endeavouring to achieve the objectives of field tasks. Tutors use differences among assessments made by groups of students compared to tutors’ assessments to identify field components that need better explanation of their content and assessment criteria.  相似文献   

14.
We report one teacher’s response to a top-down shift from external examinations to internal teacher assessment for summative purposes in the Republic of Ireland. The teacher adopted a comparative judgement approach to the assessment of secondary students’ understanding of a chemistry experiment. The aims of the research were to investigate whether comparative judgement can produce assessment outcomes that are valid and reliable without producing undue workload for the teachers involved. Comparative judgement outcomes correlated as expected both with test marks and with existing student achievement data, supporting the validity of the approach. Further analysis suggested that teacher judgement privileged scientific understanding, whereas marking privileged factual recall. The estimated reliability of the outcome was acceptably high, but comparative judgement was notably more time-consuming than marking. We consider how validity and efficiency might be improved and the contributions that comparative judgement might offer to summative assessment, moderation of teacher assessment and peer assessment.  相似文献   

15.
This study used an online peer assessment activity to help 47 college students to learn biology through writing. Each student submitted a biology writing report to an online system and then experienced three rounds of peer assessment. During the online peer assessment process, self, peer and expert evaluation scores for the writing were gathered across three rounds. It was found that self-assessment scores were not quite consistent with the expert's scores, but the peer assessment scores demonstrated adequate validity with the expert's evaluation. In particular, when the students had more rounds of peer assessment for reviewing the writing, the validity of the peer scores was enhanced. An examination of the students' writing scores, allocated by peers and expert, indicated that the students significantly improved the writing as the peer assessment activity proceeded. Content analyses of the students' writing also revealed that their writing gradually developed with significantly better coverage, richness and organization resulting from the online peer assessment activity.  相似文献   

16.
Peer assessment of long written tasks poses particular problems as these tasks typically involve complex learning and solving ill‐structured problems which require divergent responses. Marking reliability of this kind of writing task is difficult to achieve. The author illustrates this through an evaluation of two implementations of peer assessment, involving 81 students, in a UK university. In these implementations, all peer assessor grades were returned to students (not just mean grades). In this way students were exposed to subjectivity in marking. The implementations were evaluated through questionnaires, focus groups, observations of lectures and tutor interview. While students reported a better understanding of quality in student writing as a result of their experience, many complained that peer assessors’ marks were not ‘fair’. The article draws on recent research on the reliability of tutor marking to argue that marking judgements are subjective and that peer assessment offers the opportunity to explore subjectivity in marking, creating an opportunity for dialogue between tutors and students.  相似文献   

17.
In recent years, at the same time that performance assessments in science have become more popular, the number of English language learners (ELLs) (i.e., students whose native language is other than English) served by the U.S. educational system has also increased rapidly. While the research base is growing in each of these areas independently, little attention has been paid to their intersection. This case study of the use of a science performance assessment with 96 ELLs in five high school science classes investigated the face, construct, and consequential validity of this intersection. Qualitative and quantitative data analyses showed that both teachers and students had an overall favorable response to the assessment, although students' English comprehension and expression skills were determining factors for certain items. While most responses were reliably scored, ELL spelling and syntax on certain responses were significant sources of error. The degree of specificity of teachers' guidance also significantly affected students' scores. Recommendations from this study include increasing the clarity of an assessment's design, allowing ELLs more time to complete assessments, and scoring by raters who are knowledgeable about typical patterns in written English for this student population. Furthermore, it is recommended that the use of performance assessments with ELLs be exploratory until such time as their validity and reliability with this population can be more adequately established. J Res Sci Teach 34: 721–743, 1997.  相似文献   

18.
In this paper, assessments of faculty performance for the determination of salary increases are analyzed to estimate interrater reliability. Using the independent ratings by six elected members of the faculty, correlations between the ratings are calculated and estimates of the reliability of the composite (group) ratings are generated. Average intercorrelations are found to range from 0.603 for teaching, to 0.850 for research. The average intercorrelation for the overall faculty ratings is 0.794. Using these correlations, the reliability of the six-person group (the composite reliability) is estimated to be over 0.900 for each of the three areas and 0.959 for the overall faculty rating. Furthermore, little correlation is found between the ratings of performance levels of individual faculty members in the three areas of research, teaching, and service. The high intercorrelations and, consequently, the high composite reliabilities suggest that a reduction in the number of raters would have relatively small effects on reliability. The findings are discussed in terms of their relationship to issues of validity as well as to other questions of faculty assessment.  相似文献   

19.
In writing assessment, the inconsistency of teachers’ scorings is among the frequently reported concerns regarding the validity and the reliability of assessment. The study aimed to find out to what extent participating in a community of assessment practice (CAP) can impact the discrepancies among raters’ scorings. Adopting one group pretest-posttest design, patterns in the teachers’ scoring judgments were explored based on both quantitative and qualitative data. The results indicate significant increase in the degrees of agreement in the teachers’ differential scorings showing changes in their severity tendencies for structural variety, lexical accuracy, organization and mechanics criteria while their scoring judgements on structural accuracy, task achievement, and lexical variety criteria had low levels of agreement.  相似文献   

20.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号