首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Two item selection algorithms were compared in simulated linear and adaptive tests of cognitive ability. One algorithm selected items that maximally differentiated between examinees. The other used item response theory (IRT) to select items having maximum information for each examinee. Normally distributed populations of 1,000 cases were simulated, using test lengths of 4, 5, 6, and 7 items. Overall, adaptive tests based on maximum information provided the most information over the widest range of ability values and, in general, differentiated among examinees slightly better than the other tests. Although the maximum differentiation technique may be adequate in some circumstances, adaptive tests based on maximum information are clearly superior.  相似文献   

2.
This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item‐level multidimensionality, and (2) whether a Projection IRT model can provide a useful remedy. A real‐data example is used to illustrate the problem and also is used as a base model for a simulation study. The results suggest that ignoring item‐level multidimensionality might lead to inflated item discrimination parameter estimates when the proportion of multidimensional test items to unidimensional test items is as low as 1:5. The Projection IRT model appears to be a useful tool for updating unidimensional item parameter estimates of multidimensional test items for a purified unidimensional interpretation.  相似文献   

3.
Bock, Muraki, and Pfeiffenberger (1988) proposed a dichotomous item response theory (IRT) model for the detection of differential item functioning (DIF), and they estimated the IRT parameters and the means and standard deviations of the multiple latent trait distributions. This IRT DIF detection method is extended to the partial credit model (Masters, 1982; Muraki, 1993) and presented as one of the multiple-group IRT models. Uniform and non-uniform DIF items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups. The DIF method was applied to this simulated data using a stepwise procedure. The standardized DIF measures for slope and item location parameters successfully detected the non-uniform and uniform DIF items as well as recovered the means and standard deviations of the latent trait distributions.This stepwise DIF analysis based on the multiple-group partial credit model was then applied to the National Assessment of Educational Progress (NAEP) writing trend data.  相似文献   

4.
Cognitive diagnosis models (CDMs) continue to generate interest among researchers and practitioners because they can provide diagnostic information relevant to classroom instruction and student learning. However, its modeling component has outpaced its complementary component??test construction. Thus, most applications of cognitive diagnosis modeling involve retrofitting of CDMs to assessments constructed using classical test theory (CTT) or item response theory (IRT). This study explores the relationship between item statistics used in the CTT, IRT, and CDM frameworks using such an assessment, specifically a large-scale mathematics assessment. Furthermore, by highlighting differences between tests with varying levels of diagnosticity using a measure of item discrimination from a CDM approach, this study empirically uncovers some important CTT and IRT item characteristics. These results can be used to formulate practical guidelines in using IRT- or CTT-constructed assessments for cognitive diagnosis purposes.  相似文献   

5.
Using Muraki's (1992) generalized partial credit IRT model, polytomous items (responses to which can be scored as ordered categories) from the 1991 field test of the NAEP Reading Assessment were calibrated simultaneously with multiple-choice and short open-ended items. Expected information of each type of item was computed. On average, four-category polytomous items yielded 2.1 to 3.1 times as much IRT information as dichotomous items. These results provide limited support for the ad hoc rule of weighting k-category polytomous items the same as k - 1 dichotomous items for computing total scores. Polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored dichotomously, information in polytomous items sharply decreased, but they still provided more expected information than did the other response formats. For reference, a derivation of the information function for the generalized partial credit model is included.  相似文献   

6.
An approach called generalizability in item response modeling (GIRM) is introduced in this article. The GIRM approach essentially incorporates the sampling model of generalizability theory (GT) into the scaling model of item response theory (IRT) by making distributional assumptions about the relevant measurement facets. By specifying a random effects measurement model, and taking advantage of the flexibility of Markov Chain Monte Carlo (MCMC) estimation methods, it becomes possible to estimate GT variance components simultaneously with traditional IRT parameters. It is shown how GT and IRT can be linked together, in the context of a single-facet measurement design with binary items. Using both simulated and empirical data with the software WinBUGS, the GIRM approach is shown to produce results comparable to those from a standard GT analysis, while also producing results from a random effects IRT model.  相似文献   

7.
In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed.  相似文献   

8.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

9.
A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of polytomous IRT models. The module presents commonly encountered polytomous IRT models, describes their properties, and contrasts their defining principles and assumptions. After completing this module, the reader should have a sound understating of what a polytomous IRT model is, the manner in which the equations of the models are generated from the model's underlying step functions, how widely used polytomous IRT models differ with respect to their definitional properties, and how to interpret the parameters of polytomous IRT models.  相似文献   

10.
Single‐best answers to multiple‐choice items are commonly dichotomized into correct and incorrect responses, and modeled using either a dichotomous item response theory (IRT) model or a polytomous one if differences among all response options are to be retained. The current study presents an alternative IRT‐based modeling approach to multiple‐choice items administered with the procedure of elimination testing, which asks test‐takers to eliminate all the response options they consider to be incorrect. The partial credit model is derived for the obtained responses. By extracting more information pertaining to test‐takers’ partial knowledge on the items, the proposed approach has the advantage of providing more accurate estimation of the latent ability. In addition, it may shed some light on the possible answering processes of test‐takers on the items. As an illustration, the proposed approach is applied to a classroom examination of an undergraduate course in engineering science.  相似文献   

11.
Item analysis is an integral part of operational test development and is typically conducted within two popular statistical frameworks: classical test theory (CTT) and item response theory (IRT). In this digital ITEMS module, Hanwook Yoo and Ronald K. Hambleton provide an accessible overview of operational item analysis approaches within these frameworks. They review the different stages of test development and associated item analyses to identify poorly performing items and effective item selection. Moreover, they walk through the computational and interpretational steps for CTT‐ and IRT‐based evaluation statistics using simulated data examples and review various graphical displays such as distractor response curves, item characteristic curves, and item information curves. The digital module contains sample data, Excel sheets with various templates and examples, diagnostic quiz questions, data‐based activities, curated resources, and a glossary.  相似文献   

12.
Many innovative item formats have been proposed over the past decade, but little empirical research has been conducted on their measurement properties. This study examines the reliability, efficiency, and construct validity of two innovative item formats—the figural response (FR) and constructed response (CR) formats used in a K–12 computerized science test. The item response theory (IRT) information function and confirmatory factor analysis (CFA) were employed to address the research questions. It was found that the FR items were similar to the multiple-choice (MC) items in providing information and efficiency, whereas the CR items provided noticeably more information than the MC items but tended to provide less information per minute. The CFA suggested that the innovative formats and the MC format measure similar constructs. Innovations in computerized item formats are reviewed, and the merits as well as challenges of implementing the innovative formats are discussed.  相似文献   

13.
The nature of anatomy education has changed substantially in recent decades, though the traditional multiple‐choice written examination remains the cornerstone of assessing students' knowledge. This study sought to measure the quality of a clinical anatomy multiple‐choice final examination using item response theory (IRT) models. One hundred seventy‐six students took a multiple‐choice clinical anatomy examination. One‐ and two‐parameter IRT models (difficulty and discrimination parameters) were used to assess item quality. The two‐parameter IRT model demonstrated a wide range in item difficulty, with a median of ?1.0 and range from ?2.0 to 0.0 (25th to 75th percentile). Similar results were seen for discrimination (median 0.6; range 0.4–0.8). The test information curve achieved maximum discrimination for an ability level one standard deviation below the average. There were 15 items with standardized loading less than 0.3, which was due to several factors: two items had two correct responses, one was not well constructed, two were too easy, and the others revealed a lack of detailed knowledge by students. The test used in this study was more effective in discriminating students of lower ability than those of higher ability. Overall, the quality of the examination in clinical anatomy was confirmed by the IRT models. Anat Sci Educ 3:17–24, 2010. © 2009 American Association of Anatomists.  相似文献   

14.
《教育实用测度》2013,26(4):311-330
Referral, placement, and retention decisions were analyzed using item response theory (IRT) to investigate whether classification decisions could be placed on the latent continuum of ability normally associated with test items. A second question pertained to the existence of classification differential item functioning (DIF) for the various decisions. When the decisions were calibrated, the resulting "item" parameters were similar to those that might be expected from conventional test items. For classification DIF analyses, referral decisions for ethnicity were found to be functioning differently for Whites versus non-Whites. Analyzing decisions represents a new unit of analysis for IRT and represents a powerful methodology that could be applied to a variety of new problem types.  相似文献   

15.
Item response models are finding increasing use in achievement and aptitude test development. Item response theory (IRT) test development involves the selection of test items based on a consideration of their item information functions. But a problem arises because item information functions are determined by their item parameter estimates, which contain error. When the "best" items are selected on the basis of their statistical characteristics, there is a tendency to capitalize on chance due to errors in the item parameter estimates. The resulting test, therefore, falls short of the test that was desired or expected. The purposes of this article are (a) to highlight the problem of item parameter estimation errors in the test development process, (b) to demonstrate the seriousness of the problem with several simulated data sets, and (c) to offer a conservative solution for addressing the problem in IRT-based test development.  相似文献   

16.
Bayesian methods incorporate model parameter information prior to data collection. Eliciting information from content experts is an option, but has seen little implementation in Bayesian item response theory (IRT) modeling. This study aims to use ethical reasoning content experts to elicit prior information and incorporate this information into Markov Chain Monte Carlo (MCMC) estimation. A six‐step elicitation approach is followed, with relevant details at each stage for two IRT items parameters: difficulty and guessing. Results indicate that using content experts is the preferred approach, rather than noninformative priors, for both parameter types. The use of a noninformative prior for small samples provided dramatically different results when compared to results from content expert–elicited priors. The WAMBS (When to worry and how to Avoid the Misuse of Bayesian Statistics) checklist is used to aid in comparisons.  相似文献   

17.
The analytically derived asymptotic standard errors (SEs) of maximum likelihood (ML) item estimates can be approximated by a mathematical function without examinees' responses to test items, and the empirically determined SEs of marginal maximum likelihood estimation (MMLE)/Bayesian item estimates can be obtained when the same set of items is repeatedly estimated from the simulation (or resampling) test data. The latter method will result in rather stable and accurate SE estimates as the number of replications increases, but requires cumbersome and time-consuming calculations. Instead of using the empirically determined method, the adequacy of using the analytical-based method in predicting the SEs for item parameter estimates was examined by comparing results produced from both approaches. The results indicated that the SEs yielded from both approaches were, in most cases, very similar, especially when they were applied to a generalized partial credit model. This finding encourages test practitioners and researchers to apply the analytically asymptotic SEs of item estimates to the context of item-linking studies, as well as to the method of quantifying the SEs of equating scores for the item response theory (IRT) true-score method. Three-dimensional graphical presentation for the analytical SEs of item estimates as the bivariate function of item difficulty together with item discrimination was also provided for a better understanding of several frequently used IRT models.  相似文献   

18.
In this article we present a general approach not relying on item response theory models (non‐IRT) to detect differential item functioning (DIF) in dichotomous items with presence of guessing. The proposed nonlinear regression (NLR) procedure for DIF detection is an extension of method based on logistic regression. As a non‐IRT approach, NLR can be seen as a proxy of detection based on the three‐parameter IRT model which is a standard tool in the study field. Hence, NLR fills a logical gap in DIF detection methodology and as such is important for educational purposes. Moreover, the advantages of the NLR procedure as well as comparison to other commonly used methods are demonstrated in a simulation study. A real data analysis is offered to demonstrate practical use of the method.  相似文献   

19.
This study sought a scientific way to examine whether item response curves are influenced systematically by the cognitive processes underlying solution of the items in a procedural domain (addition of fractions). Starting from an expert teacher's logical task analysis and prediction of various erroneous rules and sources of misconceptions, an error diagnostic program was developed. This program was used to carry out an error analysis of test performance by three samples of students. After the cognitive structure of the subtasks was validated by a majority of the students, the items were characterized by their underlying subtask patterns. It was found that item response curves for items in the same categories were significantly more homogeneous than those in different categories. In other words, underlying cognitive subtasks appeared to systematically influence the slopes and difficulties of item response curves.  相似文献   

20.
Four item response theory (IRT) models were compared using data from tests where multiple items were grouped into testlets focused on a common stimulus. In the bi-factor model each item was treated as a function of a primary trait plus a nuisance trait due to the testlet; in the testlet-effects model the slopes in the direction of the testlet traits were constrained within each testlet to be proportional to the slope in the direction of the primary trait; in the polytomous model the item scores were summed into a single score for each testlet; and in the independent-items model the testlet structure was ignored. Using the simulated data, reliability was overestimated somewhat by the independent-items model when the items were not independent within testlets. Under these nonindependent conditions, the independent-items model also yielded greater root mean square error (RMSE) for item difficulty and underestimated the item slopes. When the items within testlets were instead generated to be independent, the bi-factor model yielded somewhat higher RMSE in difficulty and slope. Similar differences between the models were illustrated with real data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号