期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An Investigation of Undefined Cut Scores With the Hofstee Standard‐Setting Method

Adam E. Wyse Ben Babcock 《Educational Measurement》2017,36(4):28-34

This article provides an overview of the Hofstee standard‐setting method and illustrates several situations where the Hofstee method will produce undefined cut scores. The situations where the cut scores will be undefined involve cases where the line segment derived from the Hofstee ratings does not intersect the score distribution curve based on actual exam performance data. Data from 15 standard settings performed by a credentialing organization are used to investigate how common undefined cut scores are with the Hofstee method and to compare cut scores derived from the Hofstee method with those from the Beuk method. Results suggest that when Hofstee cut scores exist that the Hofstee and Beuk methods often yield fairly similar results. However, it is shown that undefined Hofstee cut scores did occur in a few situations. When Hofstee cut scores are undefined, it is suggested that one extend the Hofstee line segment so that it intersects the score distribution curve to estimate cut scores. Analyses show that extending the line segment to estimate cut scores often yields similar results to the Beuk method. The article concludes with a discussion of what these results may imply for people who want to employ the Hofstee method. 相似文献

2.

Using Diagnostic Profiles to Describe Borderline Performance in Standard Setting

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Educational Measurement》2020,39(1):45-51

In test-centered standard-setting methods, borderline performance can be represented by many different profiles of strengths and weaknesses. As a result, asking panelists to estimate item or test performance for a hypothetical group study of borderline examinees, or a typical borderline examinee, may be an extremely difficult task and one that can lead to questionable results in setting cut scores. In this study, data collected from a previous standard-setting study are used to deduce panelists’ conceptions of profiles of borderline performance. These profiles are then used to predict cut scores on a test of algebra readiness. The results indicate that these profiles can predict a very wide range of cut scores both within and between panelists. Modifications are proposed to existing training procedures for test-centered methods that can account for the variation in borderline profiles. 相似文献

3.

An Experimental Study of the Internal Consistency of Judgments Made in Bookmark Standard Setting

下载免费PDF全文

Brian E. Clauser Peter Baldwin Melissa J. Margolis Janet Mee Marcia Winward 《Journal of Educational Measurement》2017,54(4):481-497

Validating performance standards is challenging and complex. Because of the difficulties associated with collecting evidence related to external criteria, validity arguments rely heavily on evidence related to internal criteria—especially evidence that expert judgments are internally consistent. Given its importance, it is somewhat surprising that evidence of this kind has rarely been published in the context of the widely used bookmark standard‐setting procedure. In this article we examined the effect of ordered item booklet difficulty on content experts’ bookmark judgments. If panelists make internally consistent judgments, their resultant cut scores should be unaffected by the difficulty of their respective booklets. This internal consistency was not observed: the results suggest that substantial systematic differences in the resultant cut scores can arise when the difficulty of the ordered item booklets varies. These findings raise questions about the ability of content experts to make the judgments required by the bookmark procedure. 相似文献

4.

Maintaining Equivalent Cut Scores for Small Sample Test Forms

Andrew C. Dwyer 《Journal of Educational Measurement》2016,53(1):3-22

This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common‐item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common‐item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (N examinees = 4,397; N panelists = 13) were split into content‐equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common‐item equating (circle‐arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (N = 8) and by simulating panelists (N = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small‐sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard. 相似文献

5.

Multivariate Generalizability Analysis of the Impact of Training and Examinee Performance Information on Judgments Made in an Angoff-Style Standard-Setting Procedure

Brian E. Clauser David B. Swanson Polina Harik 《Journal of Educational Measurement》2002,39(4):269-290

Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent. 相似文献

6.

Evaluating the Operational Feasibility of Using Subsets of Items to Recommend Minimal Competency Cut Scores

Priya Kannan Adrienne Sgammato Richard J. Tannenbaum 《教育实用测度》2015,28(4):292-307

Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test. 相似文献

7.

Diagnostic Profiles: A Standard Setting Method for Use With a Cognitive Diagnostic Model

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Journal of Educational Measurement》2016,53(4):448-458

This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard. 相似文献

8.

Combining Data on Criticality and Frequency in Developing Test Plans for Licensure and Certification Examinations

Michael T. Kane Carole Kingshury Dean Colton Carmen Estes 《Journal of Educational Measurement》1989,26(1):17-27

Job analysis is a critical component in evaluating the validity of many high-stakes testing programs, particularly those used for licensure or certification. The ratings of criticality and frequency of various activities that are derived from such job analyses can be combined in a number of ways. This paper develops a multiplicative model as a natural and effective way to combine ratings o f frequency and criticality in order to obtain estimates of the relative importance of different activities for practice. An example of the model's use is presented. The multiplicative model incorporates adjustments to ensure that the effective weights of frequency and criticality are appropriate. 相似文献

9.

The Impact of Examinee Performance Information on Judges’ Cut Scores in Modified Angoff Standard‐Setting Exercises

Melissa J. Margolis Brian E. Clauser 《Educational Measurement》2014,33(1):15-22

This research evaluated the impact of a common modification to Angoff standard‐setting exercises: the provision of examinee performance data. Data from 18 independent standard‐setting panels across three different medical licensing examinations were examined to investigate whether and how the provision of performance information impacted judgments and the resulting cut scores. Results varied by panel but in general indicated that both the variability among the panelists and the resulting cut scores were affected by the data. After the review of performance data, panelist variability generally decreased. In addition, for all panels and examinations pre‐ and post‐data cut scores were significantly different. Investigation of the practical significance of the findings indicated that nontrivial fail rate changes were associated with the cut score changes for a majority of standard‐setting exercises. This study is the first to provide a large‐scale, systematic evaluation of the impact of a common standard setting practice, and the results can provide practitioners with insight into how the practice influences panelist variability and resulting cut scores. 相似文献

10.

Weighing the difference: the validity of multiplicative and subtractive approaches to item weights in an instrument assessing college choice decisions

Michael J. Roszkowski Scott Spreat 《Journal of Marketing for HIGHER EDUCATION》2013,23(2):209-239

The Admitted Student Questionnaire Plus (ASQ⁺) is a standardised measure that provides an analysis of the student's college selection process. Among other things, the instrument inquires about the importance of 16 college characteristics, followed by quality ratings of specific colleges that the student considered on these same characteristics. This study investigated the utility of importance weights in the assessment of college choice, examining how much the importance rating would improve one's ability to predict the student's actual college choice over and above what is possible with just the quality ratings. Another purpose of the study was to determine if importance ratings and quality ratings were independent of each other or associated in some way. Two types of weights were studied: (1) standardised weights created by averaging the importance ratings of the entire sample; and (2) subjective weights unique to each respondent. The weights were combined with quality ratings by either: (1) multiplying the quality rating by the importance rating; or by (2) subtracting the quality rating from the importance rating (gap score). Standardised weights did not improve prediction at all, and subjective weights only improved the predictability of college choice by a very miniscule amount (about 1%). Importance and quality ratings were found to be associated, especially in the ratings of the college that the student decided to attend. Some correlations were linear in nature, but many were non-linear, such that characteristics rated high or low were perceived as more important than characteristics assigned mid-range quality ratings. It was concluded that importance weights do not enhance prediction of college choice, but they may be useful for administrators in prioritising interventions. 相似文献

11.

Examining How Professional Roles and Test Development Experiences Impact Angoff Ratings

Adam E. Wyse 《教育实用测度》2018,31(4):324-334

An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided. 相似文献

12.

Scoring a Performance-Based Assessment by Modeling the Judgments of Experts

Brian E. Clauser Raja G. Subhiyah Ronald J. Nungester Douglas R. Ripkey Stephen G. Clyman Danette McKinley 《Journal of Educational Measurement》1995,32(4):397-415

Performance assessments typically require expert judges to individually rate each performance. This results in a limitation in the use of such assessments because the rating process may be extremely time consuming. This article describes a scoring algorithm that is based on expert judgments but requires the rating of only a sample of performances. A regression-based policy capturing procedure was implemented to model the judgment policies of experts. The data set was a seven-case performance assessment of physician patient management skills. The assessment used a computer-based simulation of the patient care environment. The results showed a substantial improvement in correspondence between scores produced using the algorithm and actual ratings, when compared to raw scores. Scores based on the algorithm were also shown to be superior to raw scores and equal to expert ratings for making pass/fail decisions which agreed with those made by an independent committee of experts 相似文献

13.

GI Forum v. Texas Education Agency: Observations for States

《教育实用测度》2013,26(4):411-418

Seven conclusions for professionals who administer state assessment programs are drawn from the GI Forum v. Texas Education Agency ruling: (a) the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (1999) standards are appropriate to use; (b) items showing different p values for subgroups may be used if they are selected as adequate for sound educational reasons; (c) a cut score setting process should be educationally justified; (d) a high-stakes testing program can appropriately address unfair access to education; (e) multiple opportunities to pass satisfies the standard that a single test score should not be the sole basis for a high-stakes decision; (f) a conjunctive decision-making model can appropriately motivate both students and schools; and (g) an 80% pass rate criterion applied to eventual, as opposed to initial, success rates for subgroups is a reasonable threshold for adverse impact. Caution is recommended because circumstances in other states may not parallel those in Texas in important ways. 相似文献

14.

It's Not Just Angoff: Misperceptions of Hard and Easy Items in Bookmark-Type Ratings

Adam E. Wyse Ben Babcock 《Educational Measurement》2020,39(1):22-29

A common belief is that the Bookmark method is a cognitively simpler standard-setting method than the modified Angoff method. However, a limited amount of research has investigated panelist's ability to perform well the Bookmark method, and whether some of the challenges panelists face with the Angoff method may also be present in the Bookmark method. This article presents results from three experiments where panelists were asked to give Bookmark-type ratings to separate items into groups based on item difficulty data. Results of the experiments showed, consistent with results often observed with the Angoff method, that panelists typically and paradoxically perceived hard items to be too easy and easy items to be too hard. These perceptions were reflected in panelists often placing their Bookmarks too early for hard items and often placing their Bookmarks too late for easy items. The article concludes with a discussion of what these results imply for educators and policymakers using the Bookmark standard-setting method. 相似文献

15.

The relative importance of teaching evaluation criteria from the points of view of students and faculty

Yossi Hadad Gali Naveh 《Assessment & Evaluation in Higher Education》2020,45(3):447-459

Abstract

The student evaluation of teaching (SET) tool is widely used to measure student satisfaction in institutions of higher education. A SET typically includes several criteria, which are assigned equal weights. The motivation for this research is to examine student and lecturer perceptions and the behaviour of the students (i.e. ratings given by them to lecturers) of various criteria on a SET. To this end, an analytic hierarchy process methodology was used to capture the importance (weights) of SET criteria from the points of view of students and lecturers; the students' actual ratings on the SET were then analysed. Results revealed statistically significant differences in the weights of the SET criteria; those weights differ for students and lecturers. However, analysis of 1436 SET forms of the same population revealed that, although students typically rate instructors very similarly on all criteria, they rate instructors higher on the criteria that are more important to them. The practical implications of this research is the reduction of the number of criteria on the SETs used for personnel decisions, while identifying for instructors and administrators those criteria that are perceived by students to be more important. 相似文献

16.

Five Methods for Estimating Angoff Cut Scores with IRT

下载免费PDF全文

Adam E. Wyse 《Educational Measurement》2017,36(4):16-27

This article illustrates five different methods for estimating Angoff cut scores using item response theory (IRT) models. These include maximum likelihood (ML), expected a priori (EAP), modal a priori (MAP), and weighted maximum likelihood (WML) estimators, as well as the most commonly used approach based on translating ratings through the test characteristic curve (i.e., the IRT true‐score (TS) estimator). The five methods are compared using a simulation study and a real data example. Results indicated that the application of different methods can sometimes lead to different estimated cut scores, and that there can be some key differences in impact data when using the IRT TS estimator compared to other methods. It is suggested that one should carefully think about their choice of methods to estimate ability and cut scores because different methods have distinct features and properties. An important consideration in the application of Bayesian methods relates to the choice of the prior and the potential bias that priors may introduce into estimates. 相似文献

17.

Differential Use of Item Information by Judges Using Angoff and Nedeisky Procedures

Robert L. Smith Jeffrey K. Smith 《Journal of Educational Measurement》1988,25(4):259-274

Competency examinations in a variety of domains require setting a minimum standard of performance. This study examines the issue of whether judges using the two most popular methods for setting cut scores (Angoff and Nedelsky methods) use different sources of information when making their judgments. Thirty-one judges were assigned randomly to the two methods to set cut scores for a high school graduation test in reading comprehension. These ratings were then related to characteristics of the items as well as to empirically obtained p values. Results indicate that judges using the Angoff method use a wider variety of information and yield estimates closer to the actual p values. The characteristics of items used in the study were effective predictors of judges' ratings, but were far less effective in predicting p values 相似文献

18.

Faculty ratings of course evaluation items

John B. Francis 《Research in higher education》1976,4(1):23-40

Instructors whose teaching was evaluated by students were given the opportunity to rate how applicable the evaluation items were to their classes. This study examined the kinds of items which instructors felt to be applicable or inapplicable, the relationships between the student ratings and the instructor applicability ratings, and the effect on an overall evaluation score of using the instructor applicability judgments as weights.Results generally support the consensus procedure of establishing rating forms; they suggest that the common criticism that faculty judgments of item applicability are influenced by anticipation of student ratings may be true for specific items and that while weighting composite evaluation scores by means of faculty applicability judgments does not affect those overall scores, the distributions of certain items may be altered. 相似文献

19.

Regression Effects in Angoff Ratings: Examples from Credentialing Exams

Adam E. Wyse 《教育实用测度》2018,31(1):68-78

This article discusses regression effects that are commonly observed in Angoff ratings where panelists tend to think that hard items are easier than they are and easy items are more difficult than they are in comparison to estimated item difficulties. Analyses of data from two credentialing exams illustrate these regression effects and the persistence of these regression effects across rounds of standard setting, even after panelists have received feedback information and have been given the opportunity to discuss their ratings. Additional analyses show that there tended to be a relationship between the average item ratings provided by panelists and the standard deviations of those item ratings and that the relationship followed a quadratic form with peak variation in average item ratings found toward the middle of the item difficulty scale. The study concludes with discussion of these findings and what they may imply for future standard settings. 相似文献

20.

Teacher ratings in the validation of Guglielmino's Self-Directed Learning Readiness Scale

Huey B. Long Stephen K. Agyekum 《Higher Education》1984,13(6):709-715

The complexity of life and the increasing importance of learning across the lifespan puts an added emphasis on self-direction in learning. Guglielmino's Self-Directed Learning Readiness Scale (SDLRS) is one of the most frequently reported instruments designed to measure self-directed learning readiness. Therefore, the validity of the instrument is an important topic. This is the second study by the authors that is designed to contribute to knowledge of the validity of the SDLRS. In the first study, the authors concluded that general findings support the validity of the instrument. Questions generated by the lack of association between faculty ratings on self-direction and student scores on the SDLRS, however, needed further study.The second study, reported here, was specifically designed to examine the effects of two teacher rating scales as used in the two investigations. The extremely low(0.03) correlation between faculty ratings and the SDLRS scores noted in the first study is compared to the findings in the second study. A correlation of 0.20, significant at the 0.056 level, was noted in this study. It is concluded, therefore, that the rating scale as used in the first study may have been seriously flawed. A persistent tendency of the faculty to rate black students lower in self-direction and older students higher in self-direction raises additional questions concerning faculty rating procedures. Other findings reported in this study are similar to those reported in the earlier study 相似文献