首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 635 毫秒
1.
The present study examines bias in parameter estimates and standard error in cross-classified random effect modeling (CCREM) caused by omitting the random interaction effects of the cross-classified factors, focusing on the effect of a sample size within cells and ratio of a small cell. A Monte Carlo simulation study was conducted to compare the correctly specified and the misspecified CCREM. While there was negligible bias in fixed effects, substantial biases were found in the random effects of the misspecified model depending on the number of samples within a cell and the proportion of small cells. However, in the case of the correctly specified model, no bias occurred. The present study suggests considering the random interaction effects when conducting CCREM to avoid overestimation of variance components and to calculate an accurate value of estimation. The implications of this study are to illuminate the conditions of cross-classification ratio and to provide a meaningful reference for applied researchers using CCREM.  相似文献   

2.
Value-added assessment methods have been criticized by researchers and policy makers for a number of reasons. One issue includes the sensitivity of model results across different outcome measures. This study examined the utility of incorporating multivariate latent variable approaches within a traditional value-added framework. We evaluated the stability of teacher rankings across univariate and multivariate measurement model structures and scaling metric combinations using a cumulative cross-classified mixed effect model. Our results showed multivariate models were more stable across modeling conditions than univariate approaches. These findings suggest there is potential utility in incorporating multiple measures with teacher evaluation systems, yet future research will need to evaluate the degree to which models recover known population parameters via Monte Carlo simulation.  相似文献   

3.
In most U.S. schools, teachers are evaluated using observation of teaching practice (OTP). This study investigates rater effects on OTP ratings among 421 principals in an authentic teacher evaluation system. Many-facet Rasch analysis (MFR) using a block of shared ratings revealed that principals generally (a) differentiated between more and less effective teachers, (b) rated their teachers with leniency (i.e., overused higher rating categories), and (c) differentiated between teaching practices (e.g., Cognitive Engagement vs. Classroom Management) with minimal halo effect. Individual principals varied significantly in degree of leniency, and approximately 12% of principals exhibited severe rater bias. Implications for use of OTP ratings for evaluating teachers’ effectiveness are discussed. Strengths and limitations of MFR to analyze rater effects in OTP are also discussed.  相似文献   

4.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

5.
An item-preequating design and a random groups design were used to equate forms of the American College Testing (ACT) Assessment Mathematics Test. Equipercentile and 3-parameter logistic model item-response theory (IRT) procedures were used for both designs. Both pretest methods produced inadequate equating results, and the IRT item preequating method resulted in more equating error than had no equating been conducted. Although neither of the item preequating methods performed well, the results from the equipercentile preequating method were more consistent with those from the random groups method than were the results from the IRT item pretest method. Item context and position effects were likely responsible, at least in part, for the inadequate results for item preequating. Such effects need to be either controlled or modeled, and the design further researched before the item preequating design can be recommended for operational use.  相似文献   

6.
The usefulness of item response theory (IRT) models depends, in large part, on the accuracy of item and person parameter estimates. For the standard 3 parameter logistic model, for example, these parameters include the item parameters of difficulty, discrimination, and pseudo-chance, as well as the person ability parameter. Several factors impact traditional marginal maximum likelihood (ML) estimation of IRT model parameters, including sample size, with smaller samples generally being associated with lower parameter estimation accuracy, and inflated standard errors for the estimates. Given this deleterious impact of small samples on IRT model performance, use of these techniques with low-incidence populations, where it might prove to be particularly useful, estimation becomes difficult, especially with more complex models. Recently, a Pairwise estimation method for Rasch model parameters has been suggested for use with missing data, and may also hold promise for parameter estimation with small samples. This simulation study compared item difficulty parameter estimation accuracy of ML with the Pairwise approach to ascertain the benefits of this latter method. The results support the use of the Pairwise method with small samples, particularly for obtaining item location estimates.  相似文献   

7.
Mokken scale analysis (MSA) is a probabilistic‐nonparametric approach to item response theory (IRT) that can be used to evaluate fundamental measurement properties with less strict assumptions than parametric IRT models. This instructional module provides an introduction to MSA as a probabilistic‐nonparametric framework in which to explore measurement quality, with an emphasis on its application in the context of educational assessment. The module describes both dichotomous and polytomous formulations of the MSA model. Examples of the application of MSA to educational assessment are provided using data from a multiple‐choice physical science assessment and a rater‐mediated writing assessment.  相似文献   

8.
An approach called generalizability in item response modeling (GIRM) is introduced in this article. The GIRM approach essentially incorporates the sampling model of generalizability theory (GT) into the scaling model of item response theory (IRT) by making distributional assumptions about the relevant measurement facets. By specifying a random effects measurement model, and taking advantage of the flexibility of Markov Chain Monte Carlo (MCMC) estimation methods, it becomes possible to estimate GT variance components simultaneously with traditional IRT parameters. It is shown how GT and IRT can be linked together, in the context of a single-facet measurement design with binary items. Using both simulated and empirical data with the software WinBUGS, the GIRM approach is shown to produce results comparable to those from a standard GT analysis, while also producing results from a random effects IRT model.  相似文献   

9.
The present study evaluated the multiple imputation method, a procedure that is similar to the one suggested by Li and Lissitz (2004), and compared the performance of this method with that of the bootstrap method and the delta method in obtaining the standard errors for the estimates of the parameter scale transformation coefficients in item response theory (IRT) equating in the context of the common‐item nonequivalent groups design. Two different estimation procedures for the variance‐covariance matrix of the IRT item parameter estimates, which were used in both the delta method and the multiple imputation method, were considered: empirical cross‐product (XPD) and supplemented expectation maximization (SEM). The results of the analyses with simulated and real data indicate that the multiple imputation method generally produced very similar results to the bootstrap method and the delta method in most of the conditions. The differences between the estimated standard errors obtained by the methods using the XPD matrices and the SEM matrices were very small when the sample size was reasonably large. When the sample size was small, the methods using the XPD matrices appeared to yield slight upward bias for the standard errors of the IRT parameter scale transformation coefficients.  相似文献   

10.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

11.
《教育实用测度》2013,26(2):125-141
Item parameter instability can threaten the validity of inferences about changes in student achievement when using Item Response Theory- (IRT) based test scores obtained on different occasions. This article illustrates a model-testing approach for evaluating the stability of IRT item parameter estimates in a pretest-posttest design. Stability of item parameter estimates was assessed for a random sample of pretest and posttest responses to a 19-item math test. Using MULTILOG (Thissen, 1986), IRT models were estimated in which item parameter estimates were constrained to be equal across samples (reflecting stability) and item parameter estimates were free to vary across samples (reflecting instability). These competing models were then compared statistically in order to test the invariance assumption. The results indicated a moderately high degree of stability in the item parameter estimates for a group of children assessed on two different occasions.  相似文献   

12.
Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades.  相似文献   

13.
The analytically derived asymptotic standard errors (SEs) of maximum likelihood (ML) item estimates can be approximated by a mathematical function without examinees' responses to test items, and the empirically determined SEs of marginal maximum likelihood estimation (MMLE)/Bayesian item estimates can be obtained when the same set of items is repeatedly estimated from the simulation (or resampling) test data. The latter method will result in rather stable and accurate SE estimates as the number of replications increases, but requires cumbersome and time-consuming calculations. Instead of using the empirically determined method, the adequacy of using the analytical-based method in predicting the SEs for item parameter estimates was examined by comparing results produced from both approaches. The results indicated that the SEs yielded from both approaches were, in most cases, very similar, especially when they were applied to a generalized partial credit model. This finding encourages test practitioners and researchers to apply the analytically asymptotic SEs of item estimates to the context of item-linking studies, as well as to the method of quantifying the SEs of equating scores for the item response theory (IRT) true-score method. Three-dimensional graphical presentation for the analytical SEs of item estimates as the bivariate function of item difficulty together with item discrimination was also provided for a better understanding of several frequently used IRT models.  相似文献   

14.
A developmental scale for the North Carolina End-of-Grade Mathematics Tests was created using a subset of identical test forms administered to adjacent grade levels. Thurstone scaling and item response theory (IRT) techniques were employed to analyze the changes in grade distributions across these linked forms.Three variations of Thurstone scaling were examined, one based on Thurstone's 1925 procedure and two based on Thurstone's 1938 procedure. The IRT scaling was implemented using both B i M ain and M ultilog .All methods indicated that average mathematics performance improved from Grade 3 to Grade 8, with similar results for the two IRT analyses and one version of Thurstone's 1938 method.The standard deviations of the IRT scales did not show a consistent pattern across grades, whereas those produced by Thurstone's 1925 procedure generally decreased; one version of the 1938 method exhibited slightly increasing variation with increasing grade level, while the other version displayed inconsistent trends.  相似文献   

15.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

16.
This study demonstrated the equivalence between the Rasch testlet model and the three‐level one‐parameter testlet model and explored the Markov Chain Monte Carlo (MCMC) method for model parameter estimation in WINBUGS. The estimation accuracy from the MCMC method was compared with those from the marginalized maximum likelihood estimation (MMLE) with the expectation‐maximization algorithm in ConQuest and the sixth‐order Laplace approximation estimation in HLM6. The results indicated that the estimation methods had significant effects on the bias of the testlet variance and ability variance estimation, the random error in the ability parameter estimation, and the bias in the item difficulty parameter estimation. The Laplace method best recovered the testlet variance while the MMLE best recovered the ability variance. The Laplace method resulted in the smallest random error in the ability parameter estimation while the MCMC method produced the smallest bias in item parameter estimates. Analyses of three real tests generally supported the findings from the simulation and indicated that the estimates for item difficulty and ability parameters were highly correlated across estimation methods.  相似文献   

17.
A mixed‐effects item response theory (IRT) model is presented as a logical extension of the generalized linear mixed‐effects modeling approach to formulating explanatory IRT models. Fixed and random coefficients in the extended model are estimated using a Metropolis‐Hastings Robbins‐Monro (MH‐RM) stochastic imputation algorithm to accommodate for increased dimensionality due to modeling multiple design‐ and trait‐based random effects. As a consequence of using this algorithm, more flexible explanatory IRT models, such as the multidimensional four‐parameter logistic model, are easily organized and efficiently estimated for unidimensional and multidimensional tests. Rasch versions of the linear latent trait and latent regression model, along with their extensions, are presented and discussed, Monte Carlo simulations are conducted to determine the efficiency of parameter recovery of the MH‐RM algorithm, and an empirical example using the extended mixed‐effects IRT model is presented.  相似文献   

18.
This article illustrates five different methods for estimating Angoff cut scores using item response theory (IRT) models. These include maximum likelihood (ML), expected a priori (EAP), modal a priori (MAP), and weighted maximum likelihood (WML) estimators, as well as the most commonly used approach based on translating ratings through the test characteristic curve (i.e., the IRT true‐score (TS) estimator). The five methods are compared using a simulation study and a real data example. Results indicated that the application of different methods can sometimes lead to different estimated cut scores, and that there can be some key differences in impact data when using the IRT TS estimator compared to other methods. It is suggested that one should carefully think about their choice of methods to estimate ability and cut scores because different methods have distinct features and properties. An important consideration in the application of Bayesian methods relates to the choice of the prior and the potential bias that priors may introduce into estimates.  相似文献   

19.
In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号