首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
This article discusses regression effects that are commonly observed in Angoff ratings where panelists tend to think that hard items are easier than they are and easy items are more difficult than they are in comparison to estimated item difficulties. Analyses of data from two credentialing exams illustrate these regression effects and the persistence of these regression effects across rounds of standard setting, even after panelists have received feedback information and have been given the opportunity to discuss their ratings. Additional analyses show that there tended to be a relationship between the average item ratings provided by panelists and the standard deviations of those item ratings and that the relationship followed a quadratic form with peak variation in average item ratings found toward the middle of the item difficulty scale. The study concludes with discussion of these findings and what they may imply for future standard settings.  相似文献   

2.
The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e., probability judgments) from previously conducted standard setting studies were used first in a re-sampling study, followed by D-studies. For the re-sampling study, proportionally stratified subsets of items were extracted under various sampling and test-length conditions. The mean cut score, variance components, expected standard error (SE) around the mean cut score, and root-mean-squared deviation (RMSD) across 1,000 replications were estimated at each study condition. The SE and the RMSD decreased as the number of items increased, but this reduction tapered off after approximately 45 items. Subsequently, D-studies were performed on the same datasets. The expected SE was computed at various test lengths. Results from both studies are consistent with previous research indicating that between 40–50 items are sufficient to make generalizable cut score recommendations.  相似文献   

3.
Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice.  相似文献   

4.
The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a measure of internal consistency by previous researchers. Results indicated that judges varied in the degree to which they were able to produce internally consistent ratings; some judges produced ratings that were highly correlated with empirical conditional probabilities and other judges’ ratings had essentially no correlation with the conditional probabilities. The results also showed that weighting procedures applied to individual judgments both increased panel-level internal consistency and produced convergence across panels.  相似文献   

5.
An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability-theory–based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion-correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance.  相似文献   

6.
The credibility of standard‐setting cut scores depends in part on two sources of consistency evidence: intrajudge and interjudge consistency. Although intrajudge consistency feedback has often been provided to Angoff judges in practice, more evidence is needed to determine whether it achieves its intended effect. In this randomized experiment with 36 judges, non‐numeric item‐level intrajudge consistency feedback was provided to treatment‐group judges after the first and second rounds of Angoff ratings. Compared to the judges in the control condition, those receiving the feedback significantly improved their intrajudge consistency, with the effect being stronger after the first round than after the second round. To examine whether this feedback has deleterious effects on between‐judge consistency, I also examined interjudge consistency at the cut score level and the item level using generalizability theory. The results showed that without the feedback, cut score variability worsened; with the feedback, idiosyncratic item‐level variability improved. These results suggest that non‐numeric intrajudge consistency feedback achieves its intended effect and potentially improves interjudge consistency. The findings contribute to standard‐setting feedback research and provide empirical evidence for practitioners planning Angoff procedures.  相似文献   

7.
The purpose of this research was to recommend an item bias procedure when the number of minority examinees is too small to use preferred three-parameter IRT methods. The chi-square, Angoff delta-plot, andpseudo-IRT indices were compared with both real and simulated data. For the real test data a criterion of known bias had been established by cross-validated IRT-3 results. The findings from the Math Test and the simulated test were consistent. The pseudo-IRT approach was best (measured by both correlations and percent agreement) in delecting criterion bias. The chi-square was close in accuracy to the pseudo-IRT index. The Angoff delta-plot method was found to be inadequate on both heuristic and empirical grounds. In extreme cases it even identified items as biased against whites that were simulated to be biased against blacks. However, a modified Angoff index, where p-value differences were regressed on item point biserials (and the residualized values used as the index), was nearly as good as the chi-square in identifying known bias. A final caution was offered regarding the use of item bias techniques. The statistical flags should never be used mechanically to discard items; rather they should be used to inspect items for possible differences in meaning.  相似文献   

8.
A conceptual framework is proposed for a psychometric theory of standard setting. The framework suggests that participants in a standard setting process (panelists) develop an internal, intended standard as a result of training and the participant's background. The goal of a standard setting process is to convert panelists' intended standards to points on a test's score scale. Psychometrics is involved in this process because the points on the score scale are estimated from ratings provided by participants. The conceptual framework is used to derive three criteria for evaluating standard setting processes. The use of these criteria is demonstrated by applying them to variations of bookmark and modified Angoff standard setting methods.  相似文献   

9.
Evidence to support the credibility of standard setting procedures is a critical part of the validity argument for decisions made based on tests that are used for classification. One area in which there has been limited empirical study is the impact of standard setting judge selection on the resulting cut score. One important issue related to judge selection is whether the extent of judges’ content knowledge impacts their perceptions of the probability that a minimally proficient examinee will answer the item correctly. The present article reports on two studies conducted in the context of Angoff‐style standard setting for medical licensing examinations. In the first study, content experts answered and subsequently provided Angoff judgments for a set of test items. After accounting for perceived item difficulty and judge stringency, answering the item correctly accounted for a significant (and potentially important) impact on expert judgment. The second study examined whether providing the correct answer to the judges would result in a similar effect to that associated with knowing the correct answer. The results suggested that providing the correct answer did not impact judgments. These results have important implications for the validity of standard setting outcomes in general and on judge recruitment specifically.  相似文献   

10.
Judgmental standard-setting methods, such as the Angoff(1971) method, use item performance estimates as the basis for determining the minimum passing score (MPS). Therefore, the accuracy, of these item peformance estimates is crucial to the validity of the resulting MPS. Recent researchers (Shepard, 1995; Impara & Plake, 1998; National Research Council. 1999) have called into question the ability of judges to make accurate item performance estimates for target subgroups of candidates, such as minimally competent candidates. The propose of this study was to examine the intra- and inter-rater consistency of item performance estimates from an Angoff standard setting. Results provide evidence that item pelformance estimates were consistent within and across panels within and across years. Factors that might have influenced this high degree of reliability, in the item performance estimates in a standard setting study are discussed.  相似文献   

11.
The Angoff (1971) standard setting method requires expert panelists to (a) conceptualize candidates who possess the qualifications of interest (e.g., the minimally qualified) and (b) estimate actual item performance for these candidates. Past and current research (Bejar, 1983; Shepard, 1994) suggests that estimating item performance is difficult for panelists. If panelists cannot perform this task, the validity of the standard based on these estimates is in question. This study tested the ability of 26 classroom teachers to estimate item performance for two groups of their students on a locally developed district-wide science test. Teachers were more accurate in estimating the performance of the total group than of the "borderline group," but in neither case was their accuracy level high. Implications of this finding for the validity of item performance estimates by panelists using the Angoff standard setting method are discussed.  相似文献   

12.
In judgmental standard setting procedures (e.g., the Angoff procedure), expert raters establish minimum pass levels (MPLs) for test items, and these MPLs are then combined to generate a passing score for the test. As suggested by Van der Linden (1982), item response theory (IRT) models may be useful in analyzing the results of judgmental standard setting studies. This paper examines three issues relevant to the use of lRT models in analyzing the results of such studies. First, a statistic for examining the fit of MPLs, based on judges' ratings, to an IRT model is suggested. Second, three methods for setting the passing score on a test based on item MPLs are analyzed; these analyses, based on theoretical models rather than empirical comparisons among the three methods, suggest that the traditional approach (i.e., setting the passing score on the test equal to the sum of the item MPLs) does not provide the best results. Third, a simple procedure, based on generalizability theory, for examining the sources of error in estimates of the passing score is discussed.  相似文献   

13.
Competency examinations in a variety of domains require setting a minimum standard of performance. This study examines the issue of whether judges using the two most popular methods for setting cut scores (Angoff and Nedelsky methods) use different sources of information when making their judgments. Thirty-one judges were assigned randomly to the two methods to set cut scores for a high school graduation test in reading comprehension. These ratings were then related to characteristics of the items as well as to empirically obtained p values. Results indicate that judges using the Angoff method use a wider variety of information and yield estimates closer to the actual p values. The characteristics of items used in the study were effective predictors of judges' ratings, but were far less effective in predicting p values  相似文献   

14.
This article presents a comparison of simplified variations on two prevalent methods, Angoff and Bookmark, for setting cut scores on educational assessments. The comparison is presented through an application with a Grade 7 Mathematics Assessment in a midwestem school district. Training and operational methods and procedures for each method are described in detail along with comparative results for the application. An alternative item ordering strategy for the Bookmark method that may increase its usability is also introduced. Although the Angoff method is more widely used, the Bookmark method has some promising features, specifically in educational settings. Teachers are able to focus on the expected performance of the "barely proficient" student without the additional challenge of estimating absolute item dificulty.  相似文献   

15.
16.
This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard.  相似文献   

17.
A look at real data shows that Reckase's psychometric theory for standard setting is not applicable to bookmark and that his simulations cannot explain actual differences between methods. It is suggested that exclusively test-centered, criterion-referenced approaches are too idealized and that a psychophysics paradigm and a theory of group behavior could be more useful in thinking about the standard setting process. In this view, item mapping methods such as bookmark are reasonable adaptations to fundamental limitations in human judgments of item difficulty. They make item ratings unnecessary and have unique potential for integrating external validity data and student performance data more fully into the standard setting process.  相似文献   

18.
A common belief is that the Bookmark method is a cognitively simpler standard-setting method than the modified Angoff method. However, a limited amount of research has investigated panelist's ability to perform well the Bookmark method, and whether some of the challenges panelists face with the Angoff method may also be present in the Bookmark method. This article presents results from three experiments where panelists were asked to give Bookmark-type ratings to separate items into groups based on item difficulty data. Results of the experiments showed, consistent with results often observed with the Angoff method, that panelists typically and paradoxically perceived hard items to be too easy and easy items to be too hard. These perceptions were reflected in panelists often placing their Bookmarks too early for hard items and often placing their Bookmarks too late for easy items. The article concludes with a discussion of what these results imply for educators and policymakers using the Bookmark standard-setting method.  相似文献   

19.
This research evaluated the impact of a common modification to Angoff standard‐setting exercises: the provision of examinee performance data. Data from 18 independent standard‐setting panels across three different medical licensing examinations were examined to investigate whether and how the provision of performance information impacted judgments and the resulting cut scores. Results varied by panel but in general indicated that both the variability among the panelists and the resulting cut scores were affected by the data. After the review of performance data, panelist variability generally decreased. In addition, for all panels and examinations pre‐ and post‐data cut scores were significantly different. Investigation of the practical significance of the findings indicated that nontrivial fail rate changes were associated with the cut score changes for a majority of standard‐setting exercises. This study is the first to provide a large‐scale, systematic evaluation of the impact of a common standard setting practice, and the results can provide practitioners with insight into how the practice influences panelist variability and resulting cut scores.  相似文献   

20.
Setting performance standards is a judgmental process involving human opinions and values as well as technical and empirical considerations. Although all cut score decisions are by nature somewhat arbitrary, they should not be capricious. Judges selected for standard‐setting panels should have the proper qualifications to make the judgments asked of them; however, even qualified judges vary in expertise and in some cases, such as highly specialized areas or when members of the public are involved, it may be difficult to ensure that each member of a standard‐setting panel has the requisite expertise to make qualified judgments. Given the subjective nature of these types of judgments, and that a large part of the validity argument for an exam lies in the robustness of its passing standard, an examination of the influence of judge proficiency on the judgments is warranted. This study explores the use of the many‐facet Rasch model as a method for adjusting modified Angoff standard‐setting ratings based on judges’ proficiency levels. The results suggest differences in the severity and quality of standard‐setting judgments across levels of judge proficiency, such that judges who answered easy items incorrectly tended to perceive them as easier, but those who answered correctly tended to provide ratings within normal stochastic limits.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号