首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
Research evaluating student ratings of professors reveals continued controversy. Interpretations of student ratings of professors in terms of face validity are marred by halo affects, the apparent inability of even skilled raters to judge complex behaviors adequately, the salience of personality features in judging tasks, and a host of other variables. Research shows student ratings to be reliable, but design flaws for simple, first order, predictions usually omit the teacher as a cause. Interpretations of research are confusing because of justifications that indiscriminately involve nomological and applied models. Rating scale peculiarities, questionable validity, and scholastic homogeneity lead to diverse professional attitudes towards student opinions of professors, with a learner or consumer emphasis occupying the extremes. Several evaluation schemes are noted along with behaviors that tend to produce favorable student opinions.  相似文献   

2.
Abstract

This study explored relationships of performance assessments for student teachers among three groups of raters. A 78-item evaluation instrument was administered to 47 student teachers and to their academic and field supervisors. Analysis of the seven subsections of the instrument revealed that student teachers’ self-evaluations were significantly higher, for most categories, than either academic or field supervisors’ ratings. The high degree of agreement between the two types of supervisory ratings on all categories suggest the presence of halo effects.  相似文献   

3.
Despite considerable interest in the topic of instructional quality in research as well as practice, little is known about the quality of its assessment. Using generalizability analysis as well as content analysis, the present study investigates how reliably and validly instructional quality is measured by observer ratings. Twelve trained raters judged 57 videotaped lesson sequences with regard to aspects of domain-independent instructional quality. Additionally, 3 of these sequences were judged by 390 untrained raters (i.e., student teachers and teachers). Depending on scale level and dimension, 16–44% of the variance in ratings could be attributed to instructional quality, whereas rater bias accounted for 12–40% of the variance. Although the trained raters referred more often to aspects considered essential for instructional quality, this was not reflected in the reliability of their ratings. The results indicate that observer ratings should be treated in a more differentiated manner in the future.  相似文献   

4.
This study examines the agreement across informant pairs of teachers, parents, and students regarding the students’ social‐emotional learning (SEL) competencies. Two student subsamples representative of the social skills improvement system (SSIS) SEL edition rating forms national standardization sample were examined: first, 168 students (3rd to 12th grades) with ratings by three informants (a teacher, a parent, and the student him/herself) and a second group of 164 students who had ratings by two raters in a similar role—two parents or two teachers. To assess interrater agreements, two methods were employed: calculation of q correlations among pairs of raters and effect size indices to capture the extant rater pairs differed in their assessments of social‐emotional skills. The empirical results indicated that pairs of different types of informants exhibited greater than chance levels of agreement as indexed by significant interrater correlations; teacher–parent informants showed higher correlations than teacher–student or parent–student pairs across all SEL competency domains assessed, and pairs of similar informants exhibited significantly higher correlations than pairs of dissimilar informants. Study limitations are identified and future research needs outlined.  相似文献   

5.
Based on the review of student ratings myths by Aleamoni (1987, 1999), a survey research design was used to analyse differences between college students' (n = 968) and faculty's (n = 34) perceptions. Generally, students held stronger beliefs in these myths, in that they believed faculty with excellent publication records were better qualified to evaluate teaching and that student ratings on single general items are accurate measures of teaching effectiveness. On the other hand, faculty believed that student ratings were invalid and unreliable. Further examination of student characteristics revealed that male students held stronger beliefs in these myths. Finally, students' beliefs in these myths were correlated with their actual ratings of nine dimensions of the Student Evaluation of Educational Quality. A discussion as well as suggestions for using student ratings is provided.  相似文献   

6.
Concerns relating to the reliability of teacher and student peer assessments are discussed, and some correlational analyses comparing student and teacher marks described. The benefits of the use of multiple ratings are elaborated. The distinction between gender differences and gender bias is drawn, and some studies which have reported gender bias are reviewed. The issue of ‘blind marking’ is addressed. A technique for detecting gender bias in cases where student raters have awarded marks to same and opposite sex peers is described, and illustrated by data from two case studies. Effect sizes were found to be very small, indicating an absence of gender bias in these two cases. Results are discussed in relation to task and other contextual variables. The authors conclude that the technique described can contribute to the good practice necessary to ensure the success of peer assessment in terms of pedagogical benefits and reliable and fair marking outcomes.  相似文献   

7.
Many states are implementing direct writing assessments to assess student achievement. While much literature has investigated minimizing raters' effects on writing scores, little attention has been given to the type of model used to prepare raters to score direct writing assessments. This study reports on an investigation that occurred in a state-mandated writing program when a scoring anomaly became apparent once assessments were put in operation. The study indicates that using a spiral model for training raters and scoring papers results in higher mean ratings than does using a sequential model for training and scoring. Findings suggest that making decisions about cut-scores based on pilot data has important implications for program implementation.  相似文献   

8.
Performance judgment is a situation of incomplete information where raters' inference would play an important role. Consequently, the schematic nature of human cognition may introduce implicit personality theory bias in performance judgment. To demonstrate this, a causal model of performance rating judgment was framed from the theories of person perception and social cognition. The model yielded a good fit to the data obtained from a performance rating task where the availability of performance information was manipulated. The results supported the hypotheses that student raters' inferences are partly contaminated by their implicit theories of a good instructor. Student raters inferred traits and behaviors and provided ratings for corresponding items even when the instructor behavior was limited to a subset of performance data only. The findings imply that one aspect of invalidity in student ratings of instructors is the bias in human inference due to the implicit theories of effective instructional behavior.  相似文献   

9.
The effects of rating scale format (behaviorally anchored vs. Likert) and rater training on leniency and halo in student ratings of instruction were investigated. The subjects (N=269) were students enrolled in required courses at a graduate theological seminary in the Southwest United States. A repeated measures design controlling for teacher and course was used. Findings indicated: (a) training was effective in reducing leniency and halo in ratings from both instruments; (b) trained raters exhibited less leniency on two rating dimensions when using behaviorally anchored rating scales (BARS's) than when using the Likert scale; and (c) trained raters exhibited less halo when using the Likert than when using the BARS. The findings demonstrate the importance of focusing efforts to improve quality of ratings on the students rather than on the format of the instrument.Presented at the Twenty-Eighth Annual Forum of the Association for Institutional Research, Phoenix, Ariz., May 1988.  相似文献   

10.
Recent work in reading and writing theory, research and pedagogy has raised questions about relationships between Fluent reading processes and holistic scoring of essays (e.g., Huot 1993). In holistic scoring settings, are the raters behaving as normal Fluent readers (i.e., readers interacting critically and personally with the text) or, are they somehow disconnected From their normal reader responses because they are using reliable scoring guides? Related questions concern the behavior of such holistic raters when they are teachers (e.g., Barritt, Stock, & Clark, 1986), and when those teachers respond to student writing (Connors & Lunsford, 1993). How are teachers/raters behaving, and what are they responding to in judging the writing? Previous research has suggested a role for personality type in the study of the process of writing evaluation (Jensen & DiTiberio, 1984, 1989).Thus, it seems reasonable to ask what role personality types play in the holistic evaluation of writing.This empirical study addressed the general question: What role, if any, do personality types of writers and of raters play in the holistic rating of writing? Moreover, is there a relationship between writers' personalities and raters' personalities?Writers were native English-speaking university freshman composition students; raters were native English-speaking university freshman composition instructors.Results indicate that the personality types of writers affect the ratings their essays receive, and the personality types of raters affect the ratings they give to essays. However, there is no significant relationship between writers' styles and raters' styles. Implications for future research, as well as classroom implications of these results are discussed.  相似文献   

11.
The problem of measuring instructional effectiveness was examined, and a rationale was offered for employing “student progress on relevant objectives” for this purpose. To assess such progress, it was suggested that instructor ratings of the importance of objectives be combined with student ratings of progress on these objectives. On the basis of this suggestion, data were collected from 708 undergraduate classes at Kansas State University. An analysis of these data resulted in the following conclusions:
  1. Faculty members appeared to make reliable judgments of the relative importance of these objectives.
  2. Student progress ratings were made with acceptable reliability when there were 20–25 raters. Reliability of the overall progress measure was satisfactory when only 10 raters were used.
  3. Students used some discrimination in rating progress on various objectives, but their ratings were also noticeably subject to the halo effect.
  4. An indirect test of the validity of class progress ratings yielded positive results.
The proposed method of evaluating instruction appears generally feasible and useful. Its application would provide a practical approach to judging teaching success. Such an approach is essential before investigations can be undertaken of how teaching might be improved.  相似文献   

12.
本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现:培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。  相似文献   

13.
In an essay rating study multiple ratings may be obtained by having different raters judge essays or by having the same rater(s) repeat the judging of essays. An important question in the analysis of essay ratings is whether multiple ratings, however obtained, may be assumed to represent the same true scores. When different raters judge the same essays only once, it is impossible to answer this question. In this study 16 raters judged 105 essays on two occasions; hence, it was possible to test assumptions about true scores within the framework of linear structural equation models. It emerged that the ratings of a given rater on the two occasions represented the same true scores. However, the ratings of different raters did not represent the same true scores. The estimated intercorrelations of the true scores of different raters ranged from .415 to .910. Parameters of the best fitting model were used to compute coefficients of reliability, validity, and invalidity. The implications of these coefficients are discussed.  相似文献   

14.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

15.
When multiple raters score a writing sample, on occasion they will award discrepant scores. To report a single score to the examinee, some method of resolving those differences must be applied to the ratings before an operational score can be reported. Several forms of resolving score discrepancies have been described in the literature. Initial studies of the various methods, however, have demonstrated that decisions about student performance may differ depending on the resolution method applied. Thus, studies are needed to investigate the quality of the scores associated with each model. To study score quality associated with each model, we conducted a Monte Carlo study and varied the factors associated with scoring and resolution to determine the conditions under which a particular resolution method might be superior.  相似文献   

16.
The purpose of this study was to elucidate the psychological concomitants of discrepancies between fourth- to sixth-grade children's perceptions of academic competence and 2 measures of their "actual" competence in this domain: teacher ratings and achievement test scores. Over-, under-, and congruent child raters were identified on the basis of the 2 external standards and then compared on child and teacher ratings of self-esteem, self-regulatory style, and coping with perceived failure. 6 teachers and 121 lower- to upper-middle-class suburban students participated. As predicted, no differences were obtained between congruent and distorted (combined over- and under-) raters on these self-system variables. Consistent with previous research, overrating children showed higher self-esteem on self- and teacher ratings than underraters. After controlling for level of perceived competence, overraters scored higher on anxiety, and, when overrating occurred against the teacher standard, these children were rated by teachers as having lower self-esteem, poorer coping strategies, and less internalized self-regulatory styles. Comparing the 2 standards, self-reported difficulties were associated with underrating against the teacher's standard but not the achievement standard. Teacher reported difficulties were associated with the opposite pattern of underrating against the 2 standards. Motivational factors contributing to patterns of discrepancies are discussed, as are the educational implications of mismatches between teacher and student perceptions of objective and intrapsychic aspects of school experience.  相似文献   

17.
ABSTRACT

A Bayesian IRT-model approach was used to investigate the validity and reliability of student perceptions of teaching quality. Furthermore, the student perceptions were compared with ratings of teaching quality by external observers. Grade 4 students (n = 675) filled out a questionnaire that was used to measure their opinions about the lessons of their teachers. Three lessons of 39 teachers were recorded and rated by 4 raters. The analyses showed that student perception and lesson observation scales fit best in an 11-dimensional model, which was an indication of construct validity and discriminant validity. Student perception scales were reliable, although not all items contributed to the scales to the same extent. Student ratings and lesson observations scores generally correlated moderately (ranging from r = .18 to r = .50). Higher correlations were found for scales with a similar content; however, no clear pattern was apparent. Suggestions for future research are presented.  相似文献   

18.
评分标准在写作测试中非常重要,使用不同的评分方法会影响评卷者的评分行为。研究显示,虽然整体法和分析法两种英语写作评分方法都可靠,但是在两种评分中,评卷者的严厉程度以及考生的写作成绩发生很大变化。总体上,整体法评分中,评卷者的严厉程度趋于一致,接近理想值;分析法评分中,考生的写作成绩更高,同时评卷者的严厉程度也存在显著差异。因而,在决定考生前途命运的重大考试中,整体评分法更受推崇。  相似文献   

19.
This study of the reliability and validity of scales from the Child's Report of Parental Behavior (CRPBI) presents data on the utility of aggregating the ratings of multiple observers. Subjects were 680 individuals from 170 families. The participants in each family were a college freshman student, the mother, the father, and 1 sibling. The results revealed moderate internal consistency (M = .71) for all rater types on the 18 subscales of the CRPBI, but low interrater agreement (M = .30). The same factor structure was observed across the 4 rater types; however, aggregation within raters across salient scales to form estimated factor scores did not improve rater convergence appreciably (M = .36). Aggregation of factor scores across 2 raters yields much higher convergence (M = .51), and the 4-rater aggregates yielded impressive generalizability coefficients (M = .69). These and other analyses suggested that the responses of each family member contained a small proportion of true variance and a substantial proportion of factor-specific systematic error. The latter can be greatly reduced by aggregating scores across multiple raters.  相似文献   

20.
Abstract

The purposes of the present study were to investigate the influence of three sets of instructions, class level, and academic rank on teacher/course evaluation by student raters. Students did not differ in their teacher/course evaluation ratings when the instructions specified the evaluation results would be used: (a) only by the instructor, (b) by the administration, or (c) by students for course selection purposes. The evaluation of graduate courses did not differ from that of undergraduate courses. A statistical difference was found between the academic ranks examined. Specifically, graduate teaching assistants received higher ratings than did either assistant or full professors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号