首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
An Extension of Four IRT Linking Methods for Mixed-Format Tests   总被引:1,自引:0,他引:1  
Under item response theory (IRT), linking proficiency scales from separate calibrations of multiple forms of a test to achieve a common scale is required in many applications. Four IRT linking methods including the mean/mean, mean/sigma, Haebara, and Stocking-Lord methods have been presented for use with single-format tests. This study extends the four linking methods to a mixture of unidimensional IRT models for mixed-format tests. Each linking method extended is intended to handle mixed-format tests using any mixture of the following five IRT models: the three-parameter logistic, graded response, generalized partial credit, nominal response (NR), and multiple-choice (MC) models. A simulation study is conducted to investigate the performance of the four linking methods extended to mixed-format tests. Overall, the Haebara and Stocking-Lord methods yield more accurate linking results than the mean/mean and mean/sigma methods. When the NR model or the MC model is used to analyze data from mixed-format tests, limitations of the mean/mean, mean/sigma, and Stocking-Lord methods are described.  相似文献   

2.
As item response theory has been more widely applied, investigating the fit of a parametric model becomes an important part of the measurement process. There is a lack of promising solutions to the detection of model misfit in IRT. Douglas and Cohen introduced a general nonparametric approach, RISE (Root Integrated Squared Error), for detecting model misfit. The purposes of this study were to extend the use of RISE to more general and comprehensive applications by manipulating a variety of factors (e.g., test length, sample size, IRT models, ability distribution). The results from the simulation study demonstrated that RISE outperformed G2 and S‐X2 in that it controlled Type I error rates and provided adequate power under the studied conditions. In the empirical study, RISE detected reasonable numbers of misfitting items compared to G2 and S‐X2, and RISE gave a much clearer picture of the location and magnitude of misfit for each misfitting item. In addition, there was no practical consequence to classification before and after replacement of misfitting items detected by three fit statistics.  相似文献   

3.
The posterior predictive model checking method is a flexible Bayesian model‐checking tool and has recently been used to assess fit of dichotomous IRT models. This paper extended previous research to polytomous IRT models. A simulation study was conducted to explore the performance of posterior predictive model checking in evaluating different aspects of fit for unidimensional graded response models. A variety of discrepancy measures (test‐level, item‐level, and pair‐wise measures) that reflected different threats to applications of graded IRT models to performance assessments were considered. Results showed that posterior predictive model checking exhibited adequate power in detecting different aspects of misfit for graded IRT models when appropriate discrepancy measures were used. Pair‐wise measures were found more powerful in detecting violations of the unidimensionality and local independence assumptions.  相似文献   

4.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

5.
ABSTRACT

The main purposes of this study were to develop bi-factor multidimensional item response theory (BF-MIRT) observed-score equating procedures for mixed-format tests and to investigate relative appropriateness of the proposed procedures. Using data from a large-scale testing program, three types of pseudo data sets were formulated: matched samples, pseudo forms, and simulated data sets. Very minor within-format residual dependence in mixed-format tests was found after controlling for the influence of the primary general factor. The unidimensional IRT and BF-MIRT equating methods produced similar equating results for the data used in this study. When a BF-MIRT model is implemented, we recommend the use of observed-score equating instead of true-score equating because the latter requires an arbitrary approximation or reduction process to relate true scores on test forms.  相似文献   

6.
Orlando and Thissen's S‐X 2 item fit index has performed better than traditional item fit statistics such as Yen's Q1 and McKinley and Mill's G2 for dichotomous item response theory (IRT) models. This study extends the utility of S‐X 2 to polytomous IRT models, including the generalized partial credit model, partial credit model, and rating scale model. The performance of the generalized S‐X 2 in assessing item model fit was studied in terms of empirical Type I error rates and power and compared to G2. The results suggest that the generalized S‐X 2 is promising for polytomous items in educational and psychological testing programs.  相似文献   

7.
In this study we examined procedures for assessing model-data fit of item response theory (IRT) models for mixed format data. The model fit indices used in this study include PARSCALE's G2 , Orlando and Thissen's SX2 and SG2 , and Stone's χ2* and G2* . To investigate the relative performance of the fit statistics at the item level, we conducted two simulation studies: Type I error and power studies. We evaluated the performance of the item fit indices for various conditions of test length, sample size, and IRT models. Among the competing measures, the summed score-based indices SX2 and SG2 were found to be the sensible and efficient choice for assessing model fit for mixed format data. These indices performed well, particularly with short tests. The pseudo-observed score indices, χ2* and G2* , showed inflated Type I error rates in some simulation conditions. Consistent with the findings of current literature, the PARSCALE's G2 index was rarely useful, although it provided reasonable results for long tests.  相似文献   

8.
When cut scores for classifications occur on the total score scale, popular methods for estimating classification accuracy (CA) and classification consistency (CC) require assumptions about a parametric form of the test scores or about a parametric response model, such as item response theory (IRT). This article develops an approach to estimate CA and CC nonparametrically by replacing the role of the parametric IRT model in Lee's classification indices with a modified version of Ramsay's kernel‐smoothed item response functions. The performance of the nonparametric CA and CC indices are tested in simulation studies in various conditions with different generating IRT models, test lengths, and ability distributions. The nonparametric approach to CA often outperforms Lee's method and Livingston and Lewis's method, showing robustness to nonnormality in the simulated ability. The nonparametric CC index performs similarly to Lee's method and outperforms Livingston and Lewis's method when the ability distributions are nonnormal.  相似文献   

9.
Linear factor analysis (FA) models can be reliably tested using test statistics based on residual covariances. We show that the same statistics can be used to reliably test the fit of item response theory (IRT) models for ordinal data (under some conditions). Hence, the fit of an FA model and of an IRT model to the same data set can now be compared. When applied to a binary data set, our experience suggests that IRT and FA models yield similar fits. However, when the data are polytomous ordinal, IRT models yield a better fit because they involve a higher number of parameters. But when fit is assessed using the root mean square error of approximation (RMSEA), similar fits are obtained again. We explain why. These test statistics have little power to distinguish between FA and IRT models; they are unable to detect that linear FA is misspecified when applied to ordinal data generated under an IRT model.  相似文献   

10.
The idea that test scores may not be valid representations of what students know, can do, and should learn next is well known. Person fit provides an important aspect of validity evidence. Person fit analyses at the individual student level are not typically conducted and person fit information is not communicated to educational stakeholders. In this study, we focus on a promising method for detecting and conveying person fit for large-scale educational assessments. This method uses multilevel logistic regression (MLR) to model the slopes of the person response functions, a potential source of person misfit for IRT models. We apply the method to a representative sample of students who took the writing section of the SAT (N = 19,341). The findings suggest that the MLR approach is useful for providing supplemental evidence of model–data fit in large-scale educational test settings. MLR can be useful for detecting general misfit at global and individual levels. However, as with other model–data fit indices, the MLR approach is limited in providing information regarding only some types of person misfit.  相似文献   

11.
ABSTRACT

Based on concerns about the item response theory (IRT) linking approach used in the Programme for International Student Assessment (PISA) until 2012 as well as the desire to include new, more complex, interactive items with the introduction of computer-based assessments, alternative IRT linking methods were implemented in the 2015 PISA round. The new linking method represents a concurrent calibration using all available data, enabling us to find item parameters that maximize fit across all groups and allowing us to investigate measurement invariance across groups. Apart from the Rasch model that historically has been used in PISA operational analyses, we compared our method against more general IRT models that can incorporate item-by-country interactions. The results suggest that our proposed method holds promise not only to provide a strong linkage across countries and cycles but also to serve as a tool for investigating measurement invariance.  相似文献   

12.
《教育实用测度》2013,26(2):199-210
When the item response theory (IRT) model uses the marginal maximum likelihood estimation, person parameters are usually treated as random parameters following a certain distribution as a prior distribution to estimate the structural parameters in the model. For example, both PARSCALE (Muraki &; Bock, 1999) and BILOG 3 (Mislevy &; Bock, 1990) use a standard normal distribution as a default person prior. When the fixed-item linking method is used with an IRT program having a fixed-person prior distribution, it biases person ability growth downward or upward depending on the direction of the growth due to the misspecification of the prior. This study demonstrated by simulation how much biasing impact there is on person ability growth from the use of the fixed prior distribution in fixed-item linking for mixed-format test data. In addition, the study demonstrated how to recover growth through an iterative prior update calibration procedure. This shows that fixed-item linking is still a viable linking method for a fixed-person prior IRT calibration.  相似文献   

13.
Drawing valid inferences from item response theory (IRT) models is contingent upon a good fit of the data to the model. Violations of model‐data fit have numerous consequences, limiting the usefulness and applicability of the model. This instructional module provides an overview of methods used for evaluating the fit of IRT models. Upon completing this module, the reader will have an understanding of traditional and Bayesian approaches for evaluating model‐data fit of IRT models, the relative advantages of each approach, and the software available to implement each method.  相似文献   

14.
Background:?Although on-demand testing is being increasingly used in many areas of assessment, it has not been adopted in high stakes examinations like the General Certificate of Secondary Education (GCSE) and General Certificate of Education Advanced level (GCE A level) offered by awarding organisations (AOs) in the UK. One of the major issues with on-demand testing is that some of the methods used for maintaining the comparability of standards over time in conventional testing are no longer available and the development of new methods is required.

Purpose:?This paper proposes an item response theory (IRT) framework for implementing on-demand testing and maintaining the comparability of standards over time for general qualifications, including GCSEs and GCE A levels, in the UK and discusses procedures for its practical implementation.

Sources of evidence:?Sources of evidence include literature from the fields of on-demand testing, the design of computer-based assessment, the development of IRT, and the application of IRT in educational measurement.

Main argument:?On-demand testing presents many advantages over conventional testing. In view of the nature of general qualifications, including the use of multiple components and multiple question types, the advances made in item response modelling over the past 30 years, and the availability of complex IRT analysis software systems, coupled with increasing IRT expertise in awarding organisations, IRT models could be used to implement on-demand testing in high stakes examinations in the UK. The proposed framework represents a coherent and complete approach to maintaining standards in on-demand testing. The procedures for implementing the framework discussed in the paper could be adapted by people to suit their own needs and circumstances.

Conclusions:?The use of IRT to implement on-demand testing could prove to be one of the viable approaches to maintaining standards over time or between test sessions for UK general qualifications.  相似文献   

15.
This study explores classification consistency and accuracy for mixed-format tests using real and simulated data. In particular, the current study compares six methods of estimating classification consistency and accuracy for seven mixed-format tests. The relative performance of the estimation methods is evaluated using simulated data. Study results from real data analysis showed that the procedures exhibited similar patterns across various exams, but some tended to produce lower estimates of classification consistency and accuracy than others. As data became more multidimensional, unidimensional and multidimensional item response theory (IRT) methods tended to produce different results, with the unidimensional approach yielding lower estimates than the multidimensional approach. Results from simulated data analysis demonstrated smaller estimation error for the multidimensional IRT methods than for the unidimensional IRT method. The unidimensional approach yielded larger error as tests became more multidimensional, whereas a reverse relationship was observed for the multidimensional IRT approach. Among the non-IRT approaches, the normal approximation and Livingston-Lewis methods performed well, whereas the compound multinomial method tended to produce relatively larger error.  相似文献   

16.
Mokken scale analysis (MSA) is a probabilistic‐nonparametric approach to item response theory (IRT) that can be used to evaluate fundamental measurement properties with less strict assumptions than parametric IRT models. This instructional module provides an introduction to MSA as a probabilistic‐nonparametric framework in which to explore measurement quality, with an emphasis on its application in the context of educational assessment. The module describes both dichotomous and polytomous formulations of the MSA model. Examples of the application of MSA to educational assessment are provided using data from a multiple‐choice physical science assessment and a rater‐mediated writing assessment.  相似文献   

17.
The power of the chi-square test statistic used in structural equation modeling decreases as the absolute value of excess kurtosis of the observed data increases. Excess kurtosis is more likely the smaller the number of item response categories. As a result, fit is likely to improve as the number of item response categories decreases, regardless of the true underlying factor structure or χ2-based fit index used to examine model fit. Equivalently, given a target value of approximate fit (e.g., root mean square error of approximation ≤ .05) a model with more factors is needed to reach it as the number of categories increases. This is true regardless of whether the data are treated as continuous (common factor analysis) or as discrete (ordinal factor analysis). We recommend using a large number of response alternatives (≥ 5) to increase the power to detect incorrect substantive models.  相似文献   

18.
Given the relationships of item response theory (IRT) models to confirmatory factor analysis (CFA) models, IRT model misspecifications might be detectable through model fit indexes commonly used in categorical CFA. The purpose of this study is to investigate the sensitivity of weighted least squares with adjusted means and variance (WLSMV)-based root mean square error of approximation, comparative fit index, and Tucker–Lewis Index model fit indexes to IRT models that are misspecified due to local dependence (LD). It was found that WLSMV-based fit indexes have some functional relationships to parameter estimate bias in 2-parameter logistic models caused by violations of LD. Continued exploration into these functional relationships and development of LD-detection methods based on such relationships could hold much promise for providing IRT practitioners with global information on violations of local independence.  相似文献   

19.
Standard 3.9 of the Standards for Educational and Psychological Testing ( 1999 ) demands evidence of model fit when item response theory (IRT) models are employed to data from tests. Hambleton and Han ( 2005 ) and Sinharay ( 2005 ) recommended the assessment of practical significance of misfit of IRT models, but few examples of such assessment can be found in the literature concerning IRT model fit. In this article, practical significance of misfit of IRT models was assessed using data from several tests that employ IRT models to report scores. The IRT model did not fit any data set considered in this article. However, the extent of practical significance of misfit varied over the data sets.  相似文献   

20.
A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号