首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In judgmental standard setting procedures (e.g., the Angoff procedure), expert raters establish minimum pass levels (MPLs) for test items, and these MPLs are then combined to generate a passing score for the test. As suggested by Van der Linden (1982), item response theory (IRT) models may be useful in analyzing the results of judgmental standard setting studies. This paper examines three issues relevant to the use of lRT models in analyzing the results of such studies. First, a statistic for examining the fit of MPLs, based on judges' ratings, to an IRT model is suggested. Second, three methods for setting the passing score on a test based on item MPLs are analyzed; these analyses, based on theoretical models rather than empirical comparisons among the three methods, suggest that the traditional approach (i.e., setting the passing score on the test equal to the sum of the item MPLs) does not provide the best results. Third, a simple procedure, based on generalizability theory, for examining the sources of error in estimates of the passing score is discussed.  相似文献   

2.
《教育实用测度》2013,26(3):203-205
Many credentialing agencies today are either administering their examinations by computer or are likely to be doing so in the coming years. Unfortunately, although several promising computer-based test designs are available, little is known about how well they function in examination settings. The goal of this study was to compare fixed-length examinations (both operational forms and newly constructed forms) with several variations of multistage test designs for making pass-fail decisions. Results were produced for 3 passing scores. Four operational 60-item examinations were compared to (a) 3 new 60-item forms, (b) 60-item 3-stage tests, and (c) 40-item 2-stage tests; all were constructed using automated test assembly software. The study was carried out using computer simulation techniques that were set to mimic common examination practices. All 60-item tests, regardless of design or passing score, produced accurate ability estimates and acceptable and similar levels of decision consistency and decision accuracy. One interesting finding was that the 40-item test results were poorer than the 60-item test results, as expected, but were in the range of acceptability. This raises the practical policy question of whether content-valid 40-item tests with lower item exposure levels and/or savings in item development costs are an acceptable trade-off for a small loss in decision accuracy and consistency.  相似文献   

3.
This article explores the amount of equating error at a passing score when equating scores from exams with small samples sizes. This article focuses on equating using classical test theory methods of Tucker linear, Levine linear, frequency estimation, and chained equipercentile equating. Both simulation and real data studies were used in the investigation. The results of the study supported past findings that as the sample sizes increase, the amount of bias in the equating at the passing score decreases. The research also highlights the importance for practitioners to understand the data, to have an informed expectation of the results, and to have a documented rationale for an acceptable amount of equating error.  相似文献   

4.
Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test.  相似文献   

5.
《教育实用测度》2013,26(3):231-244
For any testing program intended for licensure, certification, competency, or proficiency, the estimation of content relevant test scores for pass/fail decision making is necessary. This study compares number-correct scoring to empirical option weighting in the context of such tests. The study was conducted under two test design conditions, three test length conditions, and four passing score levels. Two criteria were used to evaluate the effectiveness of empirical option weighting versus number-correct scoring. Empirical option weighting typically produced slightly more reliable domain score estimates and more consistent pass/fail decisions than number-correct scoring, particularly in the lower half of the test score distribution. For many types of testing programs where the passing scores are established in the lower half of the test score distribution, the empirical option weighting method used in this study seems both appropriate and effective in improving the depend- ability of test scores and the consistency of pass/fail decisions. Test users, however, must weigh the effort required to use option weighting against the small gains obtained with this method. Other problems are discussed that may limit the usefulness of option weighting.  相似文献   

6.
Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM-Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale.  相似文献   

7.
Domain scores have been proposed as a user-friendly way of providing instructional feedback about examinees' skills. Domain performance typically cannot be measured directly; instead, scores must be estimated using available information. Simulation studies suggest that IRT-based methods yield accurate group domain score estimates. Because simulations can represent best-case scenarios for methodology, it is important to verify results with a real data application. This study administered a domain of elementary algebra (EA) items created from operational test forms. An IRT-based group-level domain score was estimated from responses to a subset of taken items (comprised of EA items from a single operational form) and compared to the actual observed domain score. Domain item parameters were calibrated both using item responses from the special study and from national operational administrations of the items. The accuracy of the domain score estimates were evaluated within schools and across school sizes for each set of parameters. The IRT-based domain score estimates typically were closer to the actual domain score than observed performance on the EA items from the single form. Previously simulated findings for the IRT-based domain score estimation procedure were supported by the results of the real data application.  相似文献   

8.
Examining committees often need to reach a compromise between absolute and relative standards. Unfortunately, the way in which the compromise is achieved is usually unclear. This paper proposes a systematic method for reaching a compromise. In this method, the estimated passing score (level of minimum knowledge) is assumed to be related to the expected pass rate (percentage of successful candidates) through a simple linear function. The examination results define a function relating the percentage of candidates who would be successful given a specified passing score to the passing score. The intersection of both functions gives the required compromise.  相似文献   

9.
Judgmental standard-setting methods, such as the Angoff(1971) method, use item performance estimates as the basis for determining the minimum passing score (MPS). Therefore, the accuracy, of these item peformance estimates is crucial to the validity of the resulting MPS. Recent researchers (Shepard, 1995; Impara & Plake, 1998; National Research Council. 1999) have called into question the ability of judges to make accurate item performance estimates for target subgroups of candidates, such as minimally competent candidates. The propose of this study was to examine the intra- and inter-rater consistency of item performance estimates from an Angoff standard setting. Results provide evidence that item pelformance estimates were consistent within and across panels within and across years. Factors that might have influenced this high degree of reliability, in the item performance estimates in a standard setting study are discussed.  相似文献   

10.
This study describes the development and validation of the Homan-Hewitt Readability Formula. This formula estimates the readability level of single-sentence test items. Its initial development is based on the assumption that differences in readability level will affect item difficulty. The validation of the formula is achieved by (a) estimating the readability levels of sets of test items predicted to be written at 2nd- through 8th-grade levels; (b) administering the tests to 782 students in grades 2 through 5; (3) using the class means as the unit of analyses and subjecting the data to a two-factor repeated measures ANOVA. Significant differences were found on class mean performance scores across the levels of readability. These results indicated that a relationship exists between students'reading grade levels and their responses to test items written at higher readability levels.  相似文献   

11.
A section of the secondary chemistry curriculum was analyzed to determine the level of cognitive demand of the various aspects of the selected topic. Piagetian levels of thinking of 71 pupils were initially assessed by two group tests, a unit on the mole was taught, and guidelines were used to estimate the level of cognitive operations required by each concept and problem type in the unit. Results of a 23-item test were used to compare the estimated level of cognitive demand of each test item with the Piagetian cognitive level of pupils who were able to answer the item correctly. It was found that pupil cognitive level was positively associated with overall unit test score and with percent success on all test items. Predicted levels of cognitive demand were confirmed for eight items and were within one level for nine additional items.  相似文献   

12.
Numerous methods have been proposed and investigated for estimating · the standard error of measurement (SEM) at specific score levels. Consensus on the preferred method has not been obtained, in part because there is no standard criterion. The criterion procedure in previous investigations has been a single test occasion procedure. This study compares six estimation techniques. Two criteria were calculated by using test results obtained from a test-retest or parallel forms design. The relationship between estimated score level standard errors and the score scale was similar for the six procedures. These relationships were also congruent to findings from previous investigations. Similarity between estimates and criteria varied over methods and criteria. For test-retest conditions, the estimation techniques are interchangeable. The user's selection could be based on personal preference. However, for parallel forms conditions, the procedures resulted in estimates that were meaningfully different. The preferred estimation technique would be Feldt's method (cited in Gupta, 1965; Feldt, 1984).  相似文献   

13.
Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to random sampling of items and/or responses in the validation sets. Any statistical hypothesis test of the differences in rankings needs to be appropriate for use with rater statistics and adjust for multiple comparisons. This study considered different statistical methods to evaluate differences in performance across multiple raters and items. These methods are illustrated leveraging data from the 2012 Automated Scoring Assessment Prize competitions. Using average rankings to test for significant differences in performance between automated and human raters, findings show that most automated raters did not perform statistically significantly different from human-to-human inter-rater agreement for essays but they did perform differently on short-answer items. Differences in average rankings between most automated raters were not statistically significant, even when their observed performance differed substantially.  相似文献   

14.
This study investigated the effectiveness of a computer-based study guide using hypertext software to increase textbook comprehension among four learning disabled students enrolled in a remedial high school social studies class. The program provided four levels of instructional cues that matched students to their highest level of independent interaction with a textbook passage, based on item-to-item responses to computer-generated questions. Using alternative forms of a 45-item multiple-choice test, a pre-test/post-test design was arranged, with a retention test given after a 30-day period. Fifteen questions were designated as control items by placing them in the 45-item tests but not in the computer treatment. The computer program consisted of three separate lessons administered across consecutive class sessions, with each followed by a written 15-item multiple choice test containing 10 computer questions and 5 control items. Results indicated a significant gain for pupils on computer items from pre-test to post-test and from pre-test to retention test, while no significant change occurred on control items across measures. A single-case analysis revealed a consistent relationship between gain scores on computer items, reading time on computer, and the number of instructional cues required by students. Two types of non-linear pathways that teacher might consider when constructing study guides are discussed.  相似文献   

15.
Person reliability parameters (PRPs) model temporary changes in individuals’ attribute level perceptions when responding to self‐report items (higher levels of PRPs represent less fluctuation). PRPs could be useful in measuring careless responding and traitedness. However, it is unclear how well current procedures for estimating PRPs can recover parameter estimates. This study assesses these procedures in terms of mean error (ME), average absolute difference (AAD), and reliability using simulated data with known values. Several prior distributions for PRPs were compared across a number of conditions. Overall, our results revealed little differences between using the χ or lognormal distributions as priors for estimated PRPs. Both distributions produced estimates with reasonable levels of ME; however, the AAD of the estimates was high. AAD did improve slightly as the number of items increased, suggesting that increasing the number of items would ameliorate this problem. Similarly, a larger number of items were necessary to produce reasonable levels of reliability. Based on our results, several conclusions are drawn and implications for future research are discussed.  相似文献   

16.
Studies of differential item functioning under item response theory require that item parameter estimates be placed on the same metric before comparisons can be made. The present study compared the effects of three methods for linking metrics: a weighted mean and sigma method (WMS); the test characteristic curve method (TCC); and the minimum chi-square method (MCS), on detection of differential item functioning. Both iterative and noniterative linking procedures were compared for each method. Results indicated that detection of differentially functioning items following linking via the test characteristic curve method gave the most accurate results when the sample size was small. When the sample size was large, results for the three linking methods were essentially the same. Iterative linking provided an improvement in detection of differentially functioning items over noniterative linking particularly with the .05 alpha level. The weighted mean and sigma method showed greater improvement with iterative linking than either the test characteristic curve or minimum chi-square method.  相似文献   

17.
A simulation study was performed to determine whether a group's average percent correct in a content domain could be accurately estimated for groups taking a single test form and not the entire domain of items. Six Item Response Theory based domain score estimation methods were evaluated, under conditions of few items per content area perform taken, small domains, and small group sizes. The methods used item responses to a single form taken to estimate examinee or group ability; domain scores were then computed using the ability estimates and domain item characteristics. The IRT-based domain score estimates typically showed greater accuracy and greater consistency across forms taken than observed performance on the form taken. For the smallest group size and least number of items taken, the accuracy of most IRT-based estimates was questionable; however, a procedure that operates on an estimated distribution of group ability showed promise under most conditions.  相似文献   

18.
This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference) and, for triplets, is the Triplet-2PLM. Fisher's information and directional information are described, and the test information for Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) trait score estimates is distinguished. Expected item/test information indexes at various levels are proposed and plotted to provide diagnostic information on items and tests. The expected test information indexes for EAP scores may be difficult to compute due to a typical test's vast number of item response patterns. The relationships of item/test information with discrimination parameters of statements, standard error, and reliability estimates of trait score estimates are discussed and demonstrated using real data. Practical suggestions for checking the various expected item/test information indexes and plots are provided.  相似文献   

19.
Examined in this study were three procedures for estimating the standard errors of school passing rates using a generalizability theory model. Also examined was how these procedures behaved for student samples that differed in size. The procedures differed in terms of their assumptions about the populations from which students were sampled, and it was found that student sample size generally had a notable effect on the size of the standard error estimates they produced. Also the three procedures produced markedly different standard error estimates when student sample size was small.  相似文献   

20.
Performance assessments are typically scored by having experts rate individual performances. The cost associated with using expert raters may represent a serious limitation in many large-scale testing programs. The use of raters may also introduce an additional source of error into the assessment. These limitations have motivated development of automated scoring systems for performance assessments. Preliminary research has shown these systems to have application across a variety of tasks ranging from simple mathematics to architectural problem solving. This study extends research on automated scoring by comparing alternative automated systems for scoring a computer simulation test of physicians'patient management skills; one system uses regression-derived weights for components of the performance, the other uses complex rules to map performances into score levels. The procedures are evaluated by comparing the resulting scores to expert ratings of the same performances.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号