期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design

Jason L. Meyers G. Edward Miller Walter D. Way 《教育实用测度》2013,26(1):38-60

In operational testing programs using item response theory (IRT), item parameter invariance is threatened when an item appears in a different location on the live test than it did when it was field tested. This study utilizes data from a large state's assessments to model change in Rasch item difficulty (RID) as a function of item position change, test level, test content, and item format. As a follow-up to the real data analysis, a simulation study was performed to assess the effect of item position change on equating. Results from this study indicate that item position change significantly affects change in RID. In addition, although the test construction procedures used in the investigated state seem to somewhat mitigate the impact of item position change, equating results might be impacted in testing programs where other test construction practices or equating methods are utilized. 相似文献

2.

Modeling Item‐Position Effects Within an IRT Framework

Dries Debeer Rianne Janssen 《Journal of Educational Measurement》2013,50(2):164-185

相似文献

3.

Multilevel Modeling of Item Position Effects

Anthony D. Albano 《Journal of Educational Measurement》2013,50(4):408-426

In many testing programs it is assumed that the context or position in which an item is administered does not have a differential effect on examinee responses to the item. Violations of this assumption may bias item response theory estimates of item and person parameters. This study examines the potentially biasing effects of item position. A hierarchical generalized linear model is formulated for estimating item‐position effects. The model is demonstrated using data from a pilot administration of the GRE wherein the same items appeared in different positions across the test form. Methods for detecting and assessing position effects are discussed, as are applications of the model in the contexts of test development and item analysis. 相似文献

4.

Embedded Field Test Item Statistics: Can They Be Trusted for Estimating Student Proficiency?

Jeffrey T. Steedle Kristin M. Morrison 《Educational Assessment》2019,24(1):1-12

Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential. 相似文献

5.

Answer Changing on Objective Tests

《The Journal of educational research》2012,105(6):313-315

Abstract

In an attempt to identify some of the causes of answer changing behavior, the effects of four tests and item specific variables were evaluated. Three samples of New Zealand school children of different ages were administered tests of study skills. The number of answer changes per item was compared with the position of each item in a group of items, the position of each item in the test, the discrimination index and the difficulty index of each item. It is shown that answer changes were more likely to be made on items occurring early in a group of items and toward the end of a test. There was also a tendency for difficult items and items with poor discriminations to be changed more frequently. Some implications of answer changing in the design of tests are discussed. 相似文献

6.

Investigating the Effect of Item Position in Computer‐Based Tests

Feiming Li Allan Cohen Linjun Shen 《Journal of Educational Measurement》2012,49(4):362-379

Computer‐based tests (CBTs) often use random ordering of items in order to minimize item exposure and reduce the potential for answer copying. Little research has been done, however, to examine item position effects for these tests. In this study, different versions of a Rasch model and different response time models were examined and applied to data from a CBT administration of a medical licensure examination. The models specifically were used to investigate whether item position affected item difficulty and item intensity estimates. Results indicated that the position effect was negligible. 相似文献

7.

A Simulation and Comparison of Flexilevel and Bayesian Computerized Adaptive Testing

R. J. De Ayala Barbara G. Dodd William R. Koch 《Journal of Educational Measurement》1990,27(3):227-239

Computerized adaptive testing (CAT) is a testing procedure that adapts an examination to an examinee's ability by administering only items of appropriate difficulty for the examinee. In this study, the authors compared Lord's flexilevel testing procedure (flexilevel CAT) with an item response theory-based CAT using Bayesian estimation of ability (Bayesian CAT). Three flexilevel CATs, which differed in test length (36, 18, and 11 items), and three Bayesian CATs were simulated; the Bayesian CATs differed from one another in the standard error of estimate (SEE) used for terminating the test (0.25, 0.10, and 0.05). Results showed that the flexilevel 36- and 18-item CATs produced ability estimates that may be considered as accurate as those of the Bayesian CAT with SEE = 0.10 and comparable to the Bayesian CAT with SEE = 0.05. The authors discuss the implications for classroom testing and for item response theory-based CAT. 相似文献

8.

Estimating Non-Normal Latent Trait Distributions within Item Response Theory Using True and Estimated Item Parameters

D. A. Sass T. A. Schmitt C. M. Walker 《教育实用测度》2013,26(1):65-88

Item response theory (IRT) procedures have been used extensively to study normal latent trait distributions and have been shown to perform well; however, less is known concerning the performance of IRT with non-normal latent trait distributions. This study investigated the degree of latent trait estimation error under normal and non-normal conditions using four latent trait estimation procedures and also evaluated whether the test composition, in terms of item difficulty level, reduces estimation error. Most importantly, both true and estimated item parameters were examined to disentangle the effects of latent trait estimation error from item parameter estimation error. Results revealed that non-normal latent trait distributions produced a considerably larger degree of latent trait estimation error than normal data. Estimated item parameters tended to have comparable precision to true item parameters, thus suggesting that increased latent trait estimation error results from latent trait estimation rather than item parameter estimation. 相似文献

9.

Estimation Methods for One‐Parameter Testlet Models

Hong Jiao Shudong Wang Wei He 《Journal of Educational Measurement》2013,50(2):186-203

This study demonstrated the equivalence between the Rasch testlet model and the three‐level one‐parameter testlet model and explored the Markov Chain Monte Carlo (MCMC) method for model parameter estimation in WINBUGS. The estimation accuracy from the MCMC method was compared with those from the marginalized maximum likelihood estimation (MMLE) with the expectation‐maximization algorithm in ConQuest and the sixth‐order Laplace approximation estimation in HLM6. The results indicated that the estimation methods had significant effects on the bias of the testlet variance and ability variance estimation, the random error in the ability parameter estimation, and the bias in the item difficulty parameter estimation. The Laplace method best recovered the testlet variance while the MMLE best recovered the ability variance. The Laplace method resulted in the smallest random error in the ability parameter estimation while the MCMC method produced the smallest bias in item parameter estimates. Analyses of three real tests generally supported the findings from the simulation and indicated that the estimates for item difficulty and ability parameters were highly correlated across estimation methods. 相似文献

10.

Estimation of Classification Consistency When the Probability of a Correct Response Varies

Judith A. Spray Catherine J. Welch 《Journal of Educational Measurement》1990,27(1):15-25

The purpose of this study was to examine the effect that large, within-examinee item difficulty variability had on estimates of the proportion of consistent classification of examinees into mastery categories over two test administrations. The classification consistency estimate was based on a single test administration from an estimation procedure suggested by Subkoviak (1976). Analyses of simulated data revealed that the use of a single estimate for an examinee's probability of a correct response, even when that probability varied greatly within a test for an examinee, did not affect the estimation of the proportion of consistent classifications. 相似文献

11.

Attainment of skill in using science processes. I. Instrumentation,methodology and analysis

Carl F. Berger 《科学教学研究杂志》1982,19(3):249-260

Instrumentation, methodology, and analysis techniques were developed to measure changes in adolescent skills in the estimation of linear distance. Microcomputers were used to present and control the amount of information available to assist students in estimation. Three different levels of difficulty in estimating were developed and students estimated ten positions of a point on a vertical line within each level of difficulty. Strategies used by students to estimate were found, and a model was developed using regression analysis to allow for the separation of variance of trend and individual skill differences. This model was used to predict decreases in number of estimates and decreases in average time per estimate. The number of estimates per position decreased rapidly and a limit was soon achieved. Thus average time per estimate was used as a measure of skill. Students improved performance within the first two levels, but not within the most difficult level. Across levels, average performance improved in spite of increasing difficulty, and a transfer effect appeared to occur over time. It was concluded that micro-computers were a valuable instrument for gathering and recording data. The model using regression analysis was an effective tool to study estimation. Students used effective strategies and improved their estimation skill quickly. Learning did occur, but the level of difficulty of information or practice effect available while solving an estimation problem may limit the extent of improvement. 相似文献

12.

A Comparison of Self-Adapted and Computerized Adaptive Tests

Steven L. Wise Barbara S. Plake Phillip L. Johnson Linda L. Roos 《Journal of Educational Measurement》1992,29(4):329-339

According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing. 相似文献

13.

Can Examinees Use Judgments of Item Difficulty to Improve Proficiency Estimates on Computerized Adaptive Vocabulary Tests?

Walter P. Vispoel Sara J. Clough Timothy Bleiler Amy B. Hendrickson Damien Ihrig 《Journal of Educational Measurement》2002,39(4):311-330

Recent simulation studies indicate that there are occasions when examinees can use judgments of relative item difficulty to obtain positively biased proficiency estimates on computerized adaptive tests (CATs) that permit item review and answer change. Our purpose in the study reported here was to evaluate examinees' success in using these strategies while taking CATs in a live testing setting. We taught examinees two item difficulty judgment strategies designed to increase proficiency estimates. Examinees who were taught each strategy and examinees who were taught neither strategy were assigned at random to complete vocabulary CATs under conditions in which review was allowed after completing all items and when review was allowed only within successive blocks of items. We found that proficiency estimate changes following review were significantly higher in the regular review conditions than in the strategy conditions. Failure to obtain systematically higher scores in the strategy conditions was due in large part to errors examinees made in judging the relative difficulty of CAT items. 相似文献

14.

The Stability of IRT b Values

Robert C. Sykes Anne R. Fitzpatrick 《Journal of Educational Measurement》1992,29(3):201-211

This study investigated possible explanations for an observed change in Rasch item parameters (b values) obtained from consecutive administrations of a professional licensure examination. Considered in this investigation were variables related to item position, item type, item content, and elapsed time between administrations of the item. An analysis of covariance methodology was used to assess the relations between these variables and change in item b values, with the elapsed time index serving to control for differences that could be attributed to average or pool changes in b values over time. A series of analysis of covariance models were fitted to the data in an attempt to identify item characteristics that were significantly related to the change in b values after the time elapsed between item administrations had been controlled. The findings indicated that the change in item b values was not related either to item position or to item type. A small, positive relationship between this change and elapsed time indicated that the pool b values were increasing over time. A test of simple effects suggested the presence of greater change for one of the content categories analyzed. These findings are interpreted, and suggestions for future research are provided. 相似文献

15.

Teachers’ and students’ perceptions of assessments: A review and a study into the ability and accuracy of estimating the difficulty levels of assessment items

Gerard van de Watering Janine van der Rijt 《Educational Research Review》2006,1(2):133-147

In today's higher education, high quality assessments play an important role. Little is known, however, about the degree to which assessments are correctly aimed at the students’ levels of competence in relation to the defined learning goals. This article reviews previous research into teachers’ and students’ perceptions of item difficulty. It focuses on the item difficulty of assessments and students’ and teachers’ abilities to estimate item difficulty correctly. The review indicates that teachers tend to overestimate the difficulty of easy items and underestimate the difficulty of difficult items. Students seem to be better estimators of item difficulty. The accuracy of the estimates can be improved by: the information the estimators or teachers have about the target group and their earlier assessment results; defining the target group before the estimation process; the possibility of having discussions about the defined target group of students and their corresponding standards during the estimation process; and by the amount of training in item construction and estimating. In the subsequent study, the ability and accuracy of teachers and students to estimate the difficulty levels of assessment items was examined. In higher education, results show that teachers are able to estimate the difficulty levels correctly for only a small proportion of the assessment items. They overestimate the difficulty level of most of the assessment items. Students, on the other hand, underestimate their own performances. In addition, the relationships between the students’ perceptions of the difficulty levels of the assessment items and their performances on the assessments were investigated. Results provide evidence that the students who performed best on the assessments underestimated their performances the most. Several explanations are discussed and suggestions for additional research are offered. 相似文献

16.

Item and testlet position effects in computer-based alternate assessments for students with disabilities

Okan Bulut Ming Lei Qi Guo 《International Journal of Research & Method in Education》2018,41(2):169-183

Item positions in educational assessments are often randomized across students to prevent cheating. However, if altering item positions results in any significant impact on students’ performance, it may threaten the validity of test scores. Two widely used approaches for detecting position effects – logistic regression and hierarchical generalized linear modelling – are often inconvenient for researchers and practitioners due to some technical and practical limitations. Therefore, this study introduced a structural equation modeling (SEM) approach for examining item and testlet position effects. The SEM approach was demonstrated using data from a computer-based alternate assessment designed for students with cognitive disabilities from three grade bands (3–5, 6–8, and high school). Item and testlet position effects were investigated in the field-test (FT) items that were received by each student at different positions. Results indicated that the difficulty of some FT items in grade bands 3–5 and 6–8 differed depending on the positions of the items on the test. Also, the overall difficulty of the field-test task in grade bands 6–8 increased as students responded to the field-test task in later positions. The SEM approach provides a flexible method for examining different types of position effects. 相似文献

17.

命题者：影响阅读理解测试效度的一个因素

李雪曾用强《考试研究》2012,(4):49-60

本研究应用项目反应理论,从被试的阅读能力值和题目的难度值这两个方面,分析阅读理解测试中多项选择题命题者对考试效度的影响。实验设计中,将两组被试同时施测于一项“阅读水平测试”,根据测试结果估计出的两组被试能力值之间无显著性差异。再次将这两组被试分别施测于两位不同命题者所命制的题目,尽管这些题目均产生于相同的阅读材料,且题目的难度值之间并没有显著性差异,被试的表现却显著不同。Rasch模型认为,被试表现由被试能力和试题难度共同决定。因此,可以推测,这是由于不同命题者所命制的题目影响了被试的表现,并进而影响了使用多项选择题进行阅读理解测试的效度。相似文献

18.

ITEM DIFFICULTY LEVEL AND SEQUENCE EFFECTS IN MULTIPLE-CHOICE ACHIEVEMENT TESTS

SCHUYLER W. HUCK NORMAN D. BOWERS 《Journal of Educational Measurement》1972,9(2):105-111

Certain testing authorities have implied that the proportion of examinees who answer an item correctly may be influenced by the difficulty of the immediately preceding item. If present, such a "sequence effect" would cause p (as an estimate of item difficulty level) to misrepresent an item's "true" level of difficulty. To investigate this hypothesis, a balanced Latin square design was used to rearrange examination items into various test forms. A unique analysis of variance procedure was used to analyze the resulting data. The alleged sequence effect was not found. Certain limitations preclude the generalization of this finding to all students or to all testing situations. However, the evidence provided by this investigation does suggest that comments relating to sequence effects should be qualified as compared with presently appearing statements. 相似文献

19.

An Investigation of Examinee Test-Taking Effort on a Large-Scale Assessment

J. Carl Setzer Steven L. Wise Jill R. van den Heuvel Guangming Ling 《教育实用测度》2013,26(1):34-49

Assessment results collected under low-stakes testing situations are subject to effects of low examinee effort. The use of computer-based testing allows researchers to develop new ways of measuring examinee effort, particularly using response times. At the item level, responses can be classified as exhibiting either rapid-guessing behavior or solution behavior based on the item response time. Most previous research involving the study of response times has been conducted using locally developed instruments. The purpose of the current study was to examine the amount of rapid-guessing behavior within a commercially available, low-stakes instrument. Results indicate that smaller amounts of rapid-guessing behavior exist within the data compared to published results using other instruments. Additionally, rapid-guessing behavior varied by item and was significantly related to item length, item position, and presence of ancillary reading material. The amount of rapid-guessing behavior was consistently very low among various demographic subpopulations. On average, rapid-guessing behavior was observed on only 1% of item responses. Also found was that a small amount of rapid-guessing behavior can impact institutional rankings. 相似文献

20.

Do images influence assessment in anatomy? Exploring the effect of images on item difficulty and item discrimination

Marc A. T. M. Vorstenbosch Tim P. F. M. Klaassen Jan G. M. Kooloos Sanneke M. Bolhuis Roland F. J. M. Laan 《Anatomical sciences education》2013,6(1):29-41

Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists. 相似文献