首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items.  相似文献   

2.
Item analysis is an integral part of operational test development and is typically conducted within two popular statistical frameworks: classical test theory (CTT) and item response theory (IRT). In this digital ITEMS module, Hanwook Yoo and Ronald K. Hambleton provide an accessible overview of operational item analysis approaches within these frameworks. They review the different stages of test development and associated item analyses to identify poorly performing items and effective item selection. Moreover, they walk through the computational and interpretational steps for CTT‐ and IRT‐based evaluation statistics using simulated data examples and review various graphical displays such as distractor response curves, item characteristic curves, and item information curves. The digital module contains sample data, Excel sheets with various templates and examples, diagnostic quiz questions, data‐based activities, curated resources, and a glossary.  相似文献   

3.
It has long been argued that U.S. states’ differential performance on nationwide assessments may reflect differences in students’ opportunity to learn the tested content that is primarily due to variation in curricular content standards, rather than in instructional quality or educational investment. To quantify the effect of differences in states’ intended curricular goals on test item performance in the mid-to-late 2000s, we use fractional logit regression of state-specific mathematics item difficulty values on a measure of content emphasis in state elementary school mathematics curricular standards documents. Finding weak but positive associations between content emphasis in state standards and proportion-correct item difficulty, we conclude that variations in states’ intended curriculum content, alone, appear to have had limited influence on cross-state mathematics test item performance during the time frame examined. Implications for cross-state assessment are discussed.  相似文献   

4.
Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice.  相似文献   

5.
Simulation and real data studies are used to investigate the value of modeling multiple-choice distractors on item response theory linking. Using the characteristic curve linking procedure for Bock's (1972) nominal response model presented by Kim and Hanson (2002) , all-category linking (i.e., a linking based on all category characteristic curves of the linking items) is compared against correct-only (CO) linking (i.e., linking based on the correct category characteristic curves only) using a common-item nonequivalent groups design. The CO linking is shown to represent an approximation to what occurs when using a traditional correct/incorrect item response model for linking. Results suggest that the number of linking items needed to achieve an equivalent level of linking precision declines substantially when incorporating the distractor categories.  相似文献   

6.
A potential concern for individuals interested in using item response theory (IRT) with achievement test data is that such tests have been specifically designed to measure content areas related to course curriculum and students taking the tests at different points in their coursework may not constitute samples from the same population. In this study, data were obtained from three administrations of two forms of a Biology achievement test. Data from the newer of the two forms were collected at a spring administration, made up of high school sophomores just completing the Biology course, and at a fall administration, made up mostly of seniors who completed their instruction in the course from 6–18 months prior to the test administration. Data from the older form, already on scale, were collected at only a fall administration, where the sample was comparable to the newer form fall sample. IRT and conventional item difficulty parameter estimates for the common items across the two forms were compared for each of the two form/sample combinations. In addition, conventional and IRT score equatings were performed between the new and old forms for each o f the form sample combinations. Widely disparate results were obtained between the equatings based on the two form/sample combinations. Conclusions are drawn about the use o f both classical test theory and IRT in situations such as that studied, and implications o f the results for achievement test validity are also discussed  相似文献   

7.
The paper deals with the investigation of gender differences in performances in mathematics for Italian students at the end of lower secondary school. The study is based on a new large-scale assessment test developed and administered by the National Evaluation Institute for the School System. Given the evidence in the literature which favors males, performances of female and male students are compared using different approaches. Scores proposed by educational experts based on item subgroups were considered, while a model-based approach was used within item response theory. The results revealed a significant advantage to males in overall performance, while no meaningful differences were observed with respect to item domain and type. An interpretable item map was developed crossing expert opinions with IRT abilities, and plausible proficiency levels were defined. According to the map-based student classification, a relatively lower percentage of females fell into the highest proficiency groups with respect to males.  相似文献   

8.
基于项目实训的课程实践   总被引:3,自引:0,他引:3  
根据建构主义教育理论,经过多年的计算机基础教育实践,提出以项目实训为主的课程实践改革思路。在教学实施过程中,强调以应用为目标,依托网络教学平台,利用丰富的教学资源,引导学生自主学习。通过完成项目实训,培养学生的实际应用能力、创新能力和团队协作能力,并促进学生对基础知识的理解和掌握,真正实现计算机基础教育的教学目标。  相似文献   

9.
Item response models are finding increasing use in achievement and aptitude test development. Item response theory (IRT) test development involves the selection of test items based on a consideration of their item information functions. But a problem arises because item information functions are determined by their item parameter estimates, which contain error. When the "best" items are selected on the basis of their statistical characteristics, there is a tendency to capitalize on chance due to errors in the item parameter estimates. The resulting test, therefore, falls short of the test that was desired or expected. The purposes of this article are (a) to highlight the problem of item parameter estimation errors in the test development process, (b) to demonstrate the seriousness of the problem with several simulated data sets, and (c) to offer a conservative solution for addressing the problem in IRT-based test development.  相似文献   

10.
基于项目反应理论的测验编制方法研究   总被引:3,自引:0,他引:3  
本文在简单介绍项目反应理论的基础上,从计量分析的角度,深入探讨了应用项目反应理论编制各种测验的一般步骤;探讨了项目反应理论题库建设方法及基于题库的测验编制方法;探讨了标准参照测验合格分数线的划分方法。  相似文献   

11.
以概化理论和项目反应理论为代表的现代测验理论是在克服经典测验理论缺陷的基础上产生的。概化理论是在经典测验理论的基础上,引入实验设计和方差分析技术,对测评情境中的各类误差进行分解和控制的一种现代测量理论,其发展主要经历了一元概化理论和多元概化理论两个阶段。目前,其应用主要集中在评价、考试和评定量表编制三个领域。项目反应理论是在克服经典测验理论题目参数等指标的变异性基础上发展起来的一种现代测验理论,其发展经历了早期理论探索、理论初步形成和理论逐渐完善三个阶段。它主要用于处理分数等值和测验项目参数、测验和项目的质量的分析,剥离测验情境中评委特征对测验结果的影响,以及测查项目功能差异、编制适应性测验等。  相似文献   

12.
13.
This study investigates the comparability of two item response theory based equating methods: true score equating (TSE), and estimated true equating (ETE). Additionally, six scaling methods were implemented within each equating method: mean-sigma, mean-mean, two versions of fixed common item parameter, Stocking and Lord, and Haebara. Empirical test data were examined to investigate the consistency of scores resulting from the two equating methods, as well as the consistency of the scaling methods both within equating methods and across equating methods. Results indicate that although the degree of correlation among the equated scores was quite high, regardless of equating method/scaling method combination, non-trivial differences in equated scores existed in several cases. These differences would likely accumulate across examinees making group-level differences greater. Systematic differences in the classification of examinees into performance categories were observed across the various conditions: ETE tended to place lower ability examinees into higher performance categories than TSE, while the opposite was observed for high ability examinees. Because the study was based on one set of operational data, the generalizability of the findings is limited and further study is warranted.  相似文献   

14.
In this study, we examine the degree of construct comparability and possible sources of incomparability of the English and French versions of the Programme for International Student Assessment (PISA) 2003 problem-solving measure administered in Canada. Several approaches were used to examine construct comparability at the test- (examination of test data structure, reliability comparisons and test characteristic curves) and item-levels (differential item functioning, item parameter correlations, and linguistic comparisons). Results from the test-level analyses indicate that the two language versions of PISA are highly similar as shown by similarity of internal consistency coefficients, test data structure (same number of factors and item factor loadings) and test characteristic curves for the two language versions of the tests. However, results of item-level analyses reveal several differences between the two language versions as shown by large proportions of items displaying differential item functioning, differences in item parameter correlations (discrimination parameters) and number of items found to contain linguistic differences.  相似文献   

15.
本文从知识观在教育科学中的应用角度 ,对中、日、英三国信息技术课程标准中的课程目标、课程内容进行比较研究 ,并得出以下结论 :由于对信息技术课程地位、作用及未来人才观的共识 ,不同国家的信息技术教育知识观已经趋同和融合 ;国家体制、文化传统的差异致使在“情感和态度”方面的知识价值观存在着显著的差异 ;发展为本的课程已经成为世界课程改革的基本趋势 ;我国的课程已经转向更加关注人的全面发展 ,探索能够适应未来社会需要的 ,具有创新精神的人才培养策略。对于深入理解我国信息技术课程新标准 ,进一步做好课程标准的实施等工作有一定的指导意义。  相似文献   

16.
Many standardized tests are now administered via computer rather than paper‐and‐pencil format. The computer‐based delivery mode brings with it certain advantages. One advantage is the ability to adapt the difficulty level of the test to the ability level of the test taker in what has been termed computerized adaptive testing (CAT). A second advantage is the ability to record not only the test taker's response to each item (i.e., question), but also the amount of time the test taker spends considering and answering each item. Combining these two advantages, various methods were explored for utilizing response time data in selecting appropriate items for an individual test taker. Four strategies for incorporating response time data were evaluated, and the precision of the final test‐taker score was assessed by comparing it to a benchmark value that did not take response time information into account. While differences in measurement precision and testing times were expected, results showed that the strategies did not differ much with respect to measurement precision but that there were differences with regard to the total testing time.  相似文献   

17.
Although there is a common understanding of instructional sensitivity, it lacks a common operationalization. Various approaches have been proposed, some focusing on item responses, others on test scores. As approaches often do not produce consistent results, previous research has created the impression that approaches to instructional sensitivity are noticeably fragmented. To counter this impression, we present an item response theory–based framework that can help us to understand similarities and differences between existing approaches. Using empirical data for illustration, this article identifies three perspectives on instructional sensitivity: One perspective views instructional sensitivity as the capacity to detect differences in students' stages of learning across points of time. A second perspective treats instructional sensitivity as the capacity to detect differences between groups that have received different instruction. For a third perspective, the previous two are combined to consider differences between both time points and groups. We discuss linking sensitivity indices to measures of instruction.  相似文献   

18.
Six procedures for combining sets of IRT item parameter estimates obtained from different samples were evaluated using real and simulated response data. In the simulated data analyses, true item and person parameters were used to generate response data for three different-sized samples. Each sample was calibrated separately to obtain three sets of item parameter estimates for each item. The six procedures for combining multiple estimates were each applied, and the results were evaluated by comparing the true and estimated item characteristic curves. For the real data, the two best methods from the simulation data analyses were applied to three different-sized samples and the resulting estimated item characteristic curves were compared to the curves obtained when the three samples were combined and calibrated simultaneously. The results support the use of covariance matrix-weighted averaging and a procedure that involves sample-size-weighted averaging of estimated item characteristic curves at the center of the ability distribution  相似文献   

19.
Contrasts between constructed-response items and multiple-choice counterparts have yielded but a few weak generalizations. Such contrasts typically have been based on the statistical properties of groups of items, an approach that masks differences in properties at the item level and may lead to inaccurate conclusions. In this article, we examine item-level differences between a certain type of constructed-response item (called figural response) and comparable multiple-choice items in the domain of architecture. Our data show that in comparing two item formats, item-level differences in difficulty correspond to differences in cognitive processing requirements and that relations between processing requirements and psychometric properties are systematic. These findings illuminate one aspect of construct validity that is frequently neglected in comparing item types, namely the cognitive demand of test items.  相似文献   

20.
A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号