首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 718 毫秒
1.
This article presents a method for evaluating equating results. Within the kernel equating framework, the percent relative error (PRE) for chained equipercentile equating was computed under the nonequivalent groups with anchor test (NEAT) design. The method was applied to two data sets to obtain the PRE, which can be used to measure equating effectiveness. The study compared the PRE results for chained and poststratification equating. The results indicated that the chained method transformed the new form score distribution to the reference form scale more effectively than the poststratification method. In addition, the study found that in chained equating, the population weight had impact on score distributions over the target population but not on the equating and PRE results.  相似文献   

2.
In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating.  相似文献   

3.
Tucker and chained linear equatings were evaluated in two testing scenarios. In Scenario 1, referred to as rater comparability scoring and equating, the anchor‐to‐total correlation is often very high for the new form but moderate for the reference form. This may adversely affect the results of Tucker equating, especially if the new and reference form samples differ in ability. In Scenario 2, the new and reference form samples are randomly equivalent but the correlation between the anchor and total scores is low. When the correlation between the anchor and total scores is low, Tucker equating assumes that the new and reference form samples are similar in ability (which, with randomly equivalents groups, is the correct assumption). Thus Tucker equating should produce accurate results. Results indicated that in Scenario 1, the Tucker results were less accurate than the chained linear equating results. However, in Scenario 2, the Tucker results were more accurate than the chained linear equating results. Some implications are discussed.  相似文献   

4.
A goal for any linking or equating of two or more tests is that the linking function be invariant to the population used in conducting the linking or equating. Violations of population invariance in linking and equating jeopardize the fairness and validity of test scores, and pose particular problems for test‐based accountability programs that require schools, districts, and states to report annual progress on academic indicators disaggregated by demographic group membership. This instructional module provides a comprehensive overview of population invariance in linking and equating and the relevant methodology developed for evaluating violations of invariance. A numeric example is used to illustrate the comparative properties of available methods, and important considerations for evaluating population invariance in linking and equating are presented.  相似文献   

5.
Two methods of local linear observed‐score equating for use with anchor‐test and single‐group designs are introduced. In an empirical study, the two methods were compared with the current traditional linear methods for observed‐score equating. As a criterion, the bias in the equated scores relative to true equating based on Lord's (1980) definition of equity was used. The local method for the anchor‐test design yielded minimum bias, even for considerable variation of the relative difficulties of the two test forms and the length of the anchor test. Among the traditional methods, the method of chain equating performed best. The local method for single‐group designs yielded equated scores with bias comparable to the traditional methods. This method, however, appears to be of theoretical interest because it forces us to rethink the relationship between score equating and regression.  相似文献   

6.
The current study proposed several variants of simple-structure multidimensional item response theory equating procedures. Four distinct sets of data were used to demonstrate feasibility of proposed equating methods for two different equating designs: a random groups design and a common-item nonequivalent groups design. Findings indicated some notable differences between the multidimensional and unidimensional approaches when data exhibited evidence for multidimensionality. In addition, some of the proposed methods were successful in providing equating results for both section-level and composite-level scores, which has not been achieved by most of the existing methodologies. The traditional method of using a set of quadrature points and weights for equating turned out to be computationally intensive, particularly for the data with higher dimensions. The study suggested an alternative way of using the Monte-Carlo approach for such data. This study also proposed a simple-structure true-score equating procedure that does not rely on a multivariate observed-score distribution.  相似文献   

7.
The impact of log‐linear presmoothing on the accuracy of small sample chained equipercentile equating was evaluated under two conditions . In the first condition the small samples differed randomly in ability from the target population. In the second condition the small samples were systematically different from the target population. Results showed that equating with small samples (e.g., N < 25 or 50) using either raw or smoothed score distributions led to considerable large random equating error (although smoothing reduced random equating error). Moreover, when the small samples were not representative of the target population, the amount of equating bias also was quite large. It is concluded that although presmoothing can reduce random equating error, it is not likely to reduce equating bias caused by using an unrepresentative sample. Other alternatives to the small sample equating problem (e.g., the SiGNET design) which focus more on improving data collection are discussed.  相似文献   

8.
This study investigated differences between two approaches to chained equipercentile (CE) equating (one‐ and bi‐direction CE equating) in nearly equal groups and relatively unequal groups. In one‐direction CE equating, the new form is linked to the anchor in one sample of examinees and the anchor is linked to the reference form in the other sample. In bi‐direction CE equating, the anchor is linked to the new form in one sample of examinees and to the reference form in the other sample. The two approaches were evaluated in comparison to a criterion equating function (i.e., equivalent groups equating) using indexes such as root expected squared difference, bias, standard error of equating, root mean squared error, and number of gaps and bumps. The overall results across the equating situations suggested that the two CE equating approaches produced very similar results, whereas the bi‐direction results were slightly less erratic, smoother (i.e., fewer gaps and bumps), usually closer to the criterion function, and also less variable.  相似文献   

9.
Five methods for equating in a random groups design were investigated in a series of resampling studies with samples of 400, 200, 100, and 50 test takers. Six operational test forms, each taken by 9,000 or more test takers, were used as item pools to construct pairs of forms to be equated. The criterion equating was the direct equipercentile equating in the group of all test takers. Equating accuracy was indicated by the root-mean-squared deviation, over 1,000 replications, of the sample equatings from the criterion equating. The methods investigated were equipercentile equating of smoothed distributions, linear equating, mean equating, symmetric circle-arc equating, and simplified circle-arc equating. The circle-arc methods produced the most accurate results for all sample sizes investigated, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

10.
Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

11.
The study examined two approaches for equating subscores. They are (1) equating subscores using internal common items as the anchor to conduct the equating, and (2) equating subscores using equated and scaled total scores as the anchor to conduct the equating. Since equated total scores are comparable across the new and old forms, they can be used as an anchor to equate the subscores. Both chained linear and chained equipercentile methods were used. Data from two tests were used to conduct the study and results showed that when more internal common items were available (i.e., 10–12 items), then using common items to equate the subscores is preferable. However, when the number of common items is very small (i.e., five to six items), then using total scaled scores to equate the subscores is preferable. For both tests, not equating (i.e., using raw subscores) is not reasonable as it resulted in a considerable amount of bias.  相似文献   

12.
Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman (2008b) suggested reporting an augmented subscore that is a linear combination of a subscore and the total score. Sinharay and Haberman (2008) and Sinharay (2010) showed that augmented subscores often lead to more accurate diagnostic information than subscores. In order to report augmented subscores operationally, they should be comparable across the different forms of a test. One way to achieve comparability is to equate them. We suggest several methods for equating augmented subscores. Results from several operational and simulated data sets show that the error in the equating of augmented subscores appears to be small in most practical situations.  相似文献   

13.
The Non-Equivalent-groups Anchor Test (NEAT) design has been in wide use since at least the early 1940s. It involves two populations of test takers, P and Q, and makes use of an anchor test to link them. Two linking methods used for NEAT designs are those (a) based on chain equating and (b) that use the anchor test to post-stratify the distributions of the two operational test scores to a common population (i.e., Tucker equating and frequency estimation). We show that, under different sets of assumptions, both methods are observed score equating methods and we give conditions under which the methods give identical results. In addition, we develop analogues of the Dorans and Holland (2000) RMSD measures of population invariance of equating methods for the NEAT design for both chain and post-stratification equating methods.  相似文献   

14.
This study addressed the sampling error and linking bias that occur with small samples in a nonequivalent groups anchor test design. We proposed a linking method called the synthetic function, which is a weighted average of the identity function and a traditional equating function (in this case, the chained linear equating function). Specifically, we compared the synthetic, identity, and chained linear functions for various‐sized samples from two types of national assessments. One design used a highly reliable test and an external anchor, and the other used a relatively low‐reliability test and an internal anchor. The results from each of these methods were compared to the criterion equating function derived from the total samples with respect to linking bias and error. The study indicated that the synthetic functions might be a better choice than the chained linear equating method when samples are not large and, as a result, unrepresentative.  相似文献   

15.
The synthetic function is a weighted average of the identity (the linking function for forms that are known to be completely parallel) and a traditional equating method. The purpose of the present study was to investigate the benefits of the synthetic function on small-sample equating using various real data sets gathered from different administrations of tests from a licensure testing program. We investigated the chained linear, Tucker, Levine, and mean equating methods, along with the identity and the synthetic functions with small samples (N = 19 to 70). The synthetic function did not perform as well as did other linear equating methods because test forms differed markedly in difficulty; thus, the use of the identity function produced substantial bias. The effectiveness of the synthetic function depended on the forms' similarity in difficulty.  相似文献   

16.
The nonequivalent groups with anchor test (NEAT) design involves missing data that are missing by design. Three equating methods that can be used with a NEAT design are the frequency estimation equipercentile equating method, the chain equipercentile equating method, and the item-response-theory observed-score-equating method. We suggest an approach to perform a fair comparison of the three methods. The approach is then applied to compare the three equating methods using three data sets from operational tests. For each data set, we examine how the three equating methods perform when the missing data satisfy the assumptions made by only one of these equating methods. The chain equipercentile equating method is somewhat more satisfactory overall than the other methods.  相似文献   

17.
This study investigated the effectiveness of equating with very small samples using the random groups design. Of particular interest was equating accuracy at specific scores where performance standards might be set. Two sets of simulations were carried out, one in which the two forms were identical and one in which they differed by a tenth of a standard deviation in overall difficulty. These forms were equated using mean equating, linear equating, unsmoothed equipercentile equating, and equipercentile equating using two through six moments of log-linear presmoothing with samples of 25, 50, 75, 100, 150, and 200. The results indicated that identity equating was preferable to any equating method when samples were as small as 25. For samples of 50 and above, the choice of an equating method over identity equating depended on the location of the passing score relative to examinee performance. If passing scores were located below the mean, where data were sparser, mean equating produced the smallest percentage of misclassified examinees. For passing scores near the mean, all methods produced similar results with linear equating being the most accurate. For passing scores above the mean, equipercentile equating with 2- and 3-moment presmoothing were the best equating methods. Higher levels of presmoothing did not improve the results.  相似文献   

18.
In this study I compared results of chained linear, Tucker, and Levine-observed score equatings under conditions where the new and old forms samples were similar in ability and also when they were different in ability. The length of the anchor test was also varied to examine its effect on the three different equating methods. The three equating methods were compared to a criterion equating to obtain estimates of random equating error, bias, and root mean squared error (RMSE). Results showed that, for most studied conditions, chained linear equating produced fairly good equating results in terms of low bias and RMSE. Levine equating also produced low bias and RMSE in some conditions. Although the Tucker method always produced the lowest random equating error, it produced a larger bias and RMSE than either of the other equating methods. As noted in the literature, these results also suggest that either chained linear or Levine equating be used when new and old form samples differ on ability and/or when the anchor-to-total correlation is not very high. Finally, by testing the missing data assumptions of the three equating methods, this study also shows empirically why an equating method is more or less accurate under certain conditions .  相似文献   

19.
ABSTRACT

The main purposes of this study were to develop bi-factor multidimensional item response theory (BF-MIRT) observed-score equating procedures for mixed-format tests and to investigate relative appropriateness of the proposed procedures. Using data from a large-scale testing program, three types of pseudo data sets were formulated: matched samples, pseudo forms, and simulated data sets. Very minor within-format residual dependence in mixed-format tests was found after controlling for the influence of the primary general factor. The unidimensional IRT and BF-MIRT equating methods produced similar equating results for the data used in this study. When a BF-MIRT model is implemented, we recommend the use of observed-score equating instead of true-score equating because the latter requires an arbitrary approximation or reduction process to relate true scores on test forms.  相似文献   

20.
Using data from a large-scale exam, in this study we compared various designs for equating constructed-response (CR) tests to determine which design was most effective in producing equivalent scores across the two tests to be equated. In the context of classical equating methods, four linking designs were examined: (a) an anchor set containing common CR items, (b) an anchor set incorporating common CR items rescored, (c) an external multiple-choice (MC) anchor test, and (d) an equivalent groups design incorporating rescored CR items (no anchor test). The use of CR items without rescoring resulted in much larger bias than the other designs. The use of an external MC anchor resulted in the next largest bias. The use of a rescored CR anchor and the equivalent groups design led to similar levels of equating error.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号