首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 21 毫秒
1.
Orlando and Thissen's S‐X 2 item fit index has performed better than traditional item fit statistics such as Yen's Q1 and McKinley and Mill's G2 for dichotomous item response theory (IRT) models. This study extends the utility of S‐X 2 to polytomous IRT models, including the generalized partial credit model, partial credit model, and rating scale model. The performance of the generalized S‐X 2 in assessing item model fit was studied in terms of empirical Type I error rates and power and compared to G2. The results suggest that the generalized S‐X 2 is promising for polytomous items in educational and psychological testing programs.  相似文献   

2.
As item response theory has been more widely applied, investigating the fit of a parametric model becomes an important part of the measurement process. There is a lack of promising solutions to the detection of model misfit in IRT. Douglas and Cohen introduced a general nonparametric approach, RISE (Root Integrated Squared Error), for detecting model misfit. The purposes of this study were to extend the use of RISE to more general and comprehensive applications by manipulating a variety of factors (e.g., test length, sample size, IRT models, ability distribution). The results from the simulation study demonstrated that RISE outperformed G2 and S‐X2 in that it controlled Type I error rates and provided adequate power under the studied conditions. In the empirical study, RISE detected reasonable numbers of misfitting items compared to G2 and S‐X2, and RISE gave a much clearer picture of the location and magnitude of misfit for each misfitting item. In addition, there was no practical consequence to classification before and after replacement of misfitting items detected by three fit statistics.  相似文献   

3.
Investigating the fit of a parametric model plays a vital role in validating an item response theory (IRT) model. An area that has received little attention is the assessment of multiple IRT models used in a mixed-format test. The present study extends the nonparametric approach, proposed by Douglas and Cohen (2001), to assess model fit of three IRT models (three- and two-parameter logistic model, and generalized partial credit model) used in a mixed-format test. The statistical properties of the proposed fit statistic were examined and compared to S-X2 and PARSCALE’s G2. Overall, RISE (Root Integrated Square Error) outperformed the other two fit statistics under the studied conditions in that the Type I error rate was not inflated and the power was acceptable. A further advantage of the nonparametric approach is that it provides a convenient graphical inspection of the misfit.  相似文献   

4.
A rapidly expanding arena for item response theory (IRT) is in attitudinal and health‐outcomes survey applications, often with polytomous items. In particular, there is interest in computer adaptive testing (CAT). Meeting model assumptions is necessary to realize the benefits of IRT in this setting, however. Although initial investigations of local item dependence have been studied both for polytomous items in fixed‐form settings and for dichotomous items in CAT settings, there have been no publications applying local item dependence detection methodology to polytomous items in CAT despite its central importance to these applications. The current research uses a simulation study to investigate the extension of widely used pairwise statistics, Yen's Q3 Statistic and Pearson's Statistic X2, in this context. The simulation design and results are contextualized throughout with a real item bank of this type from the Patient‐Reported Outcomes Measurement Information System (PROMIS).  相似文献   

5.
The use of sample covariance matrices constructed with pairwise deletion for data missing completely at random (SPW) is addressed in a simulation study based on 3 sample sizes (n = 200, 500, 1,000) and 5 levels of missing data (%miss = 0, 1, 10, 25, and 50). Parameter estimates were unbiased, parameter variability was largely explicable in terms of the number of nonmissing cases, and no sample covariance matrices were nonpositive definite except when %miss was 50 and the sample size was 200. However, nominal χ2 test statistics (and, thus, fit indices based on χ2s) were substantially biased by %miss and its interaction with N. Corrected χ2s based on the minimum, mean, and maximum number of nonmissing cases per measured variables and cases per covariance term (NPC) reduced but did not eliminate the bias. Empirically derived power functions did substantially better but may not generalize to other situations. Whereas the minimum NPC (the default in the SPSS version of LISREL) is probably better than most simple alternatives in many applications, the problem of how to assess fit for models fit to SPWS has no simple solution; caution is recommended, and there is need for further research with more suitable methods for this problem.  相似文献   

6.
The posterior predictive model checking method is a flexible Bayesian model‐checking tool and has recently been used to assess fit of dichotomous IRT models. This paper extended previous research to polytomous IRT models. A simulation study was conducted to explore the performance of posterior predictive model checking in evaluating different aspects of fit for unidimensional graded response models. A variety of discrepancy measures (test‐level, item‐level, and pair‐wise measures) that reflected different threats to applications of graded IRT models to performance assessments were considered. Results showed that posterior predictive model checking exhibited adequate power in detecting different aspects of misfit for graded IRT models when appropriate discrepancy measures were used. Pair‐wise measures were found more powerful in detecting violations of the unidimensionality and local independence assumptions.  相似文献   

7.
When cut scores for classifications occur on the total score scale, popular methods for estimating classification accuracy (CA) and classification consistency (CC) require assumptions about a parametric form of the test scores or about a parametric response model, such as item response theory (IRT). This article develops an approach to estimate CA and CC nonparametrically by replacing the role of the parametric IRT model in Lee's classification indices with a modified version of Ramsay's kernel‐smoothed item response functions. The performance of the nonparametric CA and CC indices are tested in simulation studies in various conditions with different generating IRT models, test lengths, and ability distributions. The nonparametric approach to CA often outperforms Lee's method and Livingston and Lewis's method, showing robustness to nonnormality in the simulated ability. The nonparametric CC index performs similarly to Lee's method and outperforms Livingston and Lewis's method when the ability distributions are nonnormal.  相似文献   

8.
Marsh and Balla (1986) and Marsh, Balla, and McDonald (1988) proposed an index of fit called χ2I2, but McDonald and Marsh (1990) subsequently demonstrated that the index is biased and recommended that it not be used. Bollen (1989) independently proposed Δ2 which is the same as χ2I2 (hereafter referred to as χ2I2‐Δ2), indicating that it adjusts for sample size and degrees of freedom (df). Gerbing and Anderson (1992), apparently based on the assumption that the χ2I2‐Δ2 index is unbiased and appropriately corrects for df (penalizes a lack of parsimony), recommended its use, and the index is routinely presented by major computer programs (e.g., EQS and LISREL 8). However, a more critical evaluation of the χ2I2‐Δ2 index reveals that: (a) it is systematically biased (i.e., its value varies systematically with N) although the size of the bias may be small; (b) the adjustment for df is inappropriate in that it penalizes model parsimony instead of model complexity; and (c) the inappropriate penalty for model parsimony is larger for small N. Because of these undesirable properties, the χ2I2‐Δ2 index is not recommended for routine use.  相似文献   

9.
In this article I describe and evaluate an alternative baseline model for comparative fit assessment of structural equation models and compare it to the standard “null” baseline model. The new “equal correlation” baseline model constrains all variables to have equal, rather than zero, correlations, whereas all variances are free. The new baseline model reflects the reality of atheoretical background correlation in nonex‐perimental data sets, and it improves the ability of comparative fit indices to distinguish between better and worse target models. It also helps to preserve the statistical link between these indices and the noncentral χ2 distribution. Also, computing the same comparative fit indices using different baseline models will provide more information about model fit than computing multiple comparative fit indices using the same baseline. I also point out some limitations of the proposed baseline model.  相似文献   

10.
Testing the goodness of fit of item response theory (IRT) models is relevant to validating IRT models, and new procedures have been proposed. These alternatives compare observed and expected response frequencies conditional on observed total scores, and use posterior probabilities for responses across θ levels rather than cross-classifying examinees using point estimates of θ and score responses. This research compared these alternatives with regard to their methods, properties (Type 1 error rates and empirical power), available research, and practical issues (computational demands, treatment of missing data, effects of sample size and sparse data, and available computer programs). Different advantages and disadvantages related to these characteristics are discussed. A simulation study provided additional information about empirical power and Type 1 error rates.  相似文献   

11.
The utility of Orlando and Thissen’s (2000, 2003) S-X2 fit index was extended to the model-fit analysis of the graded response model (GRM). The performance of a modified S-X2 in assessing item-fit of the GRM was investigated in light of empirical Type I error rates and power with a simulation study having various conditions typically encountered in applied testing situations. The results show that the Type I error rates were controlled adequately around the nominal alpha by S-X2. The power of the S-X2 statistic was much lower when the source of misfit was multidimensionality than when it was due to discrepancy from the true GRM curves. Once the data size increased sufficiently, however, appropriate power was obtained regardless of the source of the item-misfit. In summary, the generalized S-X2 appears to be a promising index for investigating item fit for polytomous items in educational and psychological assessments.  相似文献   

12.
Let M‘ be a closed submanifold isometrically immersed in a unit sphere S^n p. Denote by R, H and S, the normalized scalar curvature, the mean curvature, and the square of the length of the second fundamental form of M‘, respectively. Suppose R is constant and ≥1. We study the pinching problem on S and prove a rigidity theorem for M‘ immersed in S^n p with parallel normalized mean curvature vector field. When n≥8 or, n=7 and p≤2, the pinching constant is best.  相似文献   

13.
The Math Essential Skill Screener–Elementary Version (MESS-E) is a screener devised to identify primary grade students at risk for math difficulties. Item analysis, interitem consistency, test–retest reliability, decision efficiency, and construct validity of the MESS-E were studied using four independent samples of boys and girls grades 1–3 (aged 6–8). Item analysis revealed median item difficulty of .64 and median item discrimination of .75. Interitem consistency was .92 (n = 171) and .94 (n = 711), while 30-day test–retest reliability was .86 (n = 125). Exploratory factor analysis indicated a one-factor solution accounting for 37% of observed variance. LISREL 7 confirmatory factor analysis procedures determined that the one-factor model fit the standardization sample data poorly (goodness-of-fit index = .729, χ2 to df ratio = 9.91). The MESS-E yielded concurrent validity coefficients (n = 171) of .74 with the Woodcock–Johnson: Tests of Achievement–Revised (WJ-R) Math Cluster, .80 with the Wide-Range Achievement Test–Revised (WRAT-R) Arithmetic subtest and .73 with the KeyMath-R Operations Area standard scores. A diagnostic efficiency study yielded a total predictive value (TPV) of .93, sensitivity = .98, specificity = .88, positive predictive power (PPP) = .89, negative predictive power (NPP) = .98, and incremental validity = 39%. The MESS-E displayed a slight tendency to overidentify children potentially at risk for math difficulties. © 1998 John Wiley & Sons, Inc.  相似文献   

14.
Arguments favoring free- over forced-distribution Q sorts have assumed that forcing leads to loss of important statistical information and interferes with interval properties, rendering Pearson's r inappropriate for analysis. Q sorts with identical item orderings but with varied distributions are shown to provide essentially the same correlations and factor structures when coefficients are computed using Spearman's rs, Kendall's τ, and Pearson's r, leading to the conclusion that the same results are obtained, despite distribution and whether interval or ordinal statistics are used.  相似文献   

15.
ABSTRACT

Based on concerns about the item response theory (IRT) linking approach used in the Programme for International Student Assessment (PISA) until 2012 as well as the desire to include new, more complex, interactive items with the introduction of computer-based assessments, alternative IRT linking methods were implemented in the 2015 PISA round. The new linking method represents a concurrent calibration using all available data, enabling us to find item parameters that maximize fit across all groups and allowing us to investigate measurement invariance across groups. Apart from the Rasch model that historically has been used in PISA operational analyses, we compared our method against more general IRT models that can incorporate item-by-country interactions. The results suggest that our proposed method holds promise not only to provide a strong linkage across countries and cycles but also to serve as a tool for investigating measurement invariance.  相似文献   

16.
Background:?Although on-demand testing is being increasingly used in many areas of assessment, it has not been adopted in high stakes examinations like the General Certificate of Secondary Education (GCSE) and General Certificate of Education Advanced level (GCE A level) offered by awarding organisations (AOs) in the UK. One of the major issues with on-demand testing is that some of the methods used for maintaining the comparability of standards over time in conventional testing are no longer available and the development of new methods is required.

Purpose:?This paper proposes an item response theory (IRT) framework for implementing on-demand testing and maintaining the comparability of standards over time for general qualifications, including GCSEs and GCE A levels, in the UK and discusses procedures for its practical implementation.

Sources of evidence:?Sources of evidence include literature from the fields of on-demand testing, the design of computer-based assessment, the development of IRT, and the application of IRT in educational measurement.

Main argument:?On-demand testing presents many advantages over conventional testing. In view of the nature of general qualifications, including the use of multiple components and multiple question types, the advances made in item response modelling over the past 30 years, and the availability of complex IRT analysis software systems, coupled with increasing IRT expertise in awarding organisations, IRT models could be used to implement on-demand testing in high stakes examinations in the UK. The proposed framework represents a coherent and complete approach to maintaining standards in on-demand testing. The procedures for implementing the framework discussed in the paper could be adapted by people to suit their own needs and circumstances.

Conclusions:?The use of IRT to implement on-demand testing could prove to be one of the viable approaches to maintaining standards over time or between test sessions for UK general qualifications.  相似文献   

17.
Compared to unidimensional item response models (IRMs), cognitive diagnostic models (CDMs) based on latent classes represent examinees' knowledge and item requirements using discrete structures. This study systematically examines the viability of retrofitting CDMs to IRM‐based data with a linear attribute structure. The study utilizes a procedure to make the IRM and CDM frameworks comparable and investigates how estimation accuracy is affected by test diagnosticity and the match between the true and fitted models. The study shows that comparable results can be obtained when highly diagnostic IRM data are retrofitted with CDM, and vice versa, retrofitting CDMs to IRM‐based data in some conditions can result in considerable examinee misclassification, and model fit indices provide limited indication of the accuracy of item parameter estimation and attribute classification.  相似文献   

18.
While morale among the elderly has been widely and extensively studied, results are varied and at times conflicting. Hence, the purpose of this study is to explore the factors affecting elderly morale of a select group of Filipinos in a community setting. A 64-item questionnaire was utilized to survey 323 Filipinos aged 60 and above residing in the National Capital Region of the Philippines in May 2013. Respondents completed a robotfoto, a checklist of chronic illnesses, and measures of the social support, functional ability, geriatric depression, and morale. Structural equation modeling was used to test the hypothesized model. Two competing models emerged in the study. Model 1 followed causal relationships indicated in the hypothesized model while model 2 considered modification indices that surfaced more acceptable fit indices (X2/df = 1.414, GFI [goodness of fit index] = 0.988, CFI [comparative fit index] = 0.987, RMSEA [root mean square error of approximation] = 0.036). Chronic illness, social support, and depression were found to be major predictors of morale. Number of chronic illnesses and depression were also found to have a negative relationship with functional ability, and chronic illness and social support were negatively correlated. Findings can assist health professionals such as nurses to identify the factors that shape elderly morale vis-a-vis the use of effective strategies that promote the well-being of elderly people. The emerging model can serve as reference to assess the effectiveness of quality of care rendered as manifested by morale.  相似文献   

19.
School climate surveys are widely applied in school districts across the nation to collect information about teacher efficacy, principal leadership, school safety, students' activities, and so forth. They enable school administrators to understand and address many issues on campus when used in conjunction with other student and staff data. However, these days each district develops the questionnaire according to its own needs and rarely provides supporting evidence for the reliability of items in the scale, that is, whether an individual item contributes significant information to the questionnaire. The Item Response Theory (IRT) is a useful tool that helps examine how much information each item and the whole scale can provide. Our study applied IRT to examine individual items in a school climate survey and assessed the efficiency of the survey after the removal of items that contributed little to the scale. The purpose of this study is to show how IRT can be applied to empirically validate school climate surveys.  相似文献   

20.
Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号