首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 26 毫秒
1.
Sometimes, test‐takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to consider such testing behaviors. In this study, a new class of mixture IRT models was developed to account for such testing behavior in dichotomous and polytomous items, by assuming test‐takers were composed of multiple latent classes and by adding a decrement parameter to each latent class to describe performance decline. Parameter recovery, effect of model misspecification, and robustness of the linearity assumption in performance decline were evaluated using simulations. It was found that the parameters in the new models were recovered fairly well by using the freeware WinBUGS; the failure to account for such behavior by fitting standard IRT models resulted in overestimation of difficulty parameters on items located toward the end of the test and overestimation of test reliability; and the linearity assumption in performance decline was rather robust. An empirical example is provided to illustrate the applications and the implications of the new class of models.  相似文献   

2.
In this research, the author addresses whether the application of unidimensional item response models provides valid interpretation of test results when administering items sensitive to multiple latent dimensions. Overall, the present study found that unidimensional models are quite robust to the violation of the unidimensionality assumption due to secondary dimensions from sensitive items. When secondary dimensions are highly correlated with main construct, unidimensional models generally fit and the accuracy of ability estimation is comparable to that of strictly unidimensional tests. In addition, longer tests are more robust to the violation of the essential unidimensionality assumption than shorter ones. The author also shows that unidimensional item response theory models estimate item difficulty parameter better than item discrimination parameter in tests with secondary dimensions.  相似文献   

3.
Abstract

In an attempt to identify some of the causes of answer changing behavior, the effects of four tests and item specific variables were evaluated. Three samples of New Zealand school children of different ages were administered tests of study skills. The number of answer changes per item was compared with the position of each item in a group of items, the position of each item in the test, the discrimination index and the difficulty index of each item. It is shown that answer changes were more likely to be made on items occurring early in a group of items and toward the end of a test. There was also a tendency for difficult items and items with poor discriminations to be changed more frequently. Some implications of answer changing in the design of tests are discussed.  相似文献   

4.
Abstract

To combat problems of cheating arising from testing under crowed classroom conditions, instructors frequently use multiple arrangements of a set of test items. These different arrangements or forms should be nearly equivalent relative to mean total scores. This study reports data from comparisons involving eleven pairs of equivalent tests. There were no significant linear relationships between equivalent test forms on the ordering of item difficulties. Reliabilities differed little within pairs of equivalent tests. Nine of eleven t-tests comparing mean total test scores were insignificant. The bulk of these data supported the assumption that one may construct equivalent power tests by rearranging items, when the ordering of item difficulty is non-systematic on both arrangements.  相似文献   

5.
本文研究的是不同的测试方法-单项选择和信息转移-是否会在阅读理解考试中产生测试方法效应的问题.除对学生的考试成绩(分数)进行分析外,本研究还进一步对试题的难度值进行了分析,而本研究中试题难度是通过项目反应理论(Item Response Theory)计算得到的.结果显示不同测试方法的确会影响题目难度及考生的考试表现,就试题难度而言信息转移比单项选择更难.  相似文献   

6.
This study describes the development and validation of the Homan-Hewitt Readability Formula. This formula estimates the readability level of single-sentence test items. Its initial development is based on the assumption that differences in readability level will affect item difficulty. The validation of the formula is achieved by (a) estimating the readability levels of sets of test items predicted to be written at 2nd- through 8th-grade levels; (b) administering the tests to 782 students in grades 2 through 5; (3) using the class means as the unit of analyses and subjecting the data to a two-factor repeated measures ANOVA. Significant differences were found on class mean performance scores across the levels of readability. These results indicated that a relationship exists between students'reading grade levels and their responses to test items written at higher readability levels.  相似文献   

7.
One assumption common to all models for determining the optimal number of options per item (e. g., Lord, 1977) is that total testing time is proportional to the number of items and the number of options per item. Therefore, under this assumption given a fixed testing time, the test can be shortened or lengthened by deleting or adding a proportional number of options. The present study examines the validity of this assumption in three tests which were administered with 2, 3, 4, and 5 options per item. The number of items attempted in the first 10 and 15 minutes of the testing session and the time needed to complete the tests were recorded. Thus, the rate of performance for both fixed time and fixed test length was analyzed. A strong and consistently negative relationship between rate of performance and the number of options was detected in all tests. Thus, the empirical results did not support the assumption of proportionality. Furthermore, the data indicated that the method by which options are deleted can play a role in this context. A more realistic assumption of generalized proportionality, proposed by Grier (1976), was supported by the results from a Mathematical Reasoning test, but was only partially supported for a Vocabulary and a Reading Comprehension test.  相似文献   

8.
学业水平考试物理试题难度预估方法探究   总被引:1,自引:1,他引:0  
目前上海市普通高中学业水平考试未实行考前试测制度,因此试题难易度主要依据试题编制者的经验进行预估,尚无量化研究的方法。本研究根据国内外研究经验,从试题的物理概念、试题设计、数学运算三个项目出发,结合2011年上海市普通高中物理学业水平考试试题难度实测数据分析,构建试题难度预估的量化方法,并用2012年上海市普通高中物理学业水平考试试题难度实测数据检验其准确性,期望为今后物理试题难易度预估提供研究的基础。  相似文献   

9.
实用汉语水平认定考试(简称C.TEST)是用来测试母语非汉语的外籍人士在国际环境下社会生活以及日常工作中实际运用汉语能力的考试。由于C.TEST的考试题目公开,题库数量较小,所以通过一般标准化考试采用的在部分目标被试中实施预测(fieldtest)的方法来获取考试题目的难度参数存在困难。然而,人工神经网络技术作为现代人工智能研究的成果,在预测(prediction)领域发挥了很大作用。本文选取C.TEST(A—D级)的阅读理解题目作为研究材料,运用人工神经网络技术对其难度进行预测,得到了网络预测难度值与实际考试难度值显著相关的研究结果。这一结果表明,利用人工神经网络模型对语言测验的题目难度等参数进行预测是可行的。  相似文献   

10.
难度不是试题的固有属性,而是考生因素与试题特征之间互动的结果。很多试题分析者倾向于将试题难度偏高的原因仅仅归结于学生未掌握相关知识或技能,而忽视试题本身的特征。通过分析60道难度在0.6以下的高考英语试题,探究其难度来源。结果显示,除考生因素外,难题或偏难题的难度来源也与命题技术有关,比如答案的唯一性与可接受性、考查内容超纲、考点设置与评分标准欠妥等方面的问题。为此,提出考试机构应提高命题水平,加强试题质量监控,确保大规模考试科学选拔人才。  相似文献   

11.
《教育实用测度》2013,26(1):89-97
Research on the use of multiple-choice tests has presented conflicting evidence about the use of statistical item difficulty as a means of ordering items. An alternate method advocated by many texts is the use of cognitive difficulty. This study examined the effect of using both statistical and cognitive item difficulty in determining item order. Results indicated that those students who received items in an increasing cognitive order, no matter what the order of statistical difficulty, scored higher on hard items. Those students who received the forms with opposing cognitive and statistical difficulty orders scored the highest on medium-level items. The study concludes with a call for more research on the effects of cognitive difficulty and suggests that future studies examine subscores as well as total test results.  相似文献   

12.
The effects of training tests on subsequent achievement were studied using 2-test item characteristics: item difficulty and item complexity. Ninety Ss were randomly assigned to treatment conditions having easy or difficult items and calling for rote or complex skills. Each S was administered two training tests during the quarter containing only items defined by his treatment condition. The dependent measure was a sixty item final examination with fifteen items reflecting each of the four treatment condition item types. The results showed greater achievement for those trained with difficult items and with rote items. In addition, two interaction of treatment conditions with type of test items were found. The results are discussed as supporting a hierarchical model rather than a “similarity” transfer model of learning.  相似文献   

13.
本研究应用项目反应理论,从被试的阅读能力值和题目的难度值这两个方面,分析阅读理解测试中多项选择题命题者对考试效度的影响。实验设计中,将两组被试同时施测于一项“阅读水平测试”,根据测试结果估计出的两组被试能力值之间无显著性差异。再次将这两组被试分别施测于两位不同命题者所命制的题目,尽管这些题目均产生于相同的阅读材料,且题目的难度值之间并没有显著性差异,被试的表现却显著不同。Rasch模型认为,被试表现由被试能力和试题难度共同决定。因此,可以推测,这是由于不同命题者所命制的题目影响了被试的表现,并进而影响了使用多项选择题进行阅读理解测试的效度。  相似文献   

14.
Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

15.
Krarup  Niels  Naeraa  Noe  Olsen  Christian 《Higher Education》1974,3(2):157-164
In the few available studies on the use of books in examinations, open-book tests have been found to reduce pre-test memorization and anxiety during examinations without affecting academic performance. However, these studies were made with students in non-book systems, whereas systems which allowed books in all exams might be thought likely to create a non-fact-learning attitude in students. The present study was undertaken in a book-allowing system with 120 students during a regular course in physiology at a medical school. Each group sat two parallel 60-item multiple choice tests and used books in one test but not in the other. The tests took place about four weeks prior to the final examination, which is of the same type as the experimental tests. Recall items could yield less than 15% of maximum points, so that interpretation and problem-solving items predominated. Total test points with and without books did not differ significantly. An analysis of variance showed that the effect of books on recall items was only slight and that the two tests varied in difficulty, in spite of efforts to secure equality.  相似文献   

16.
Two experiments were conducted to determine if a relationship exists between test item arrangements and student performance on power tests. The primary hypotheses were: item arrangements based upon item difficulty, similarity of content, or order of class presentation do not influence test score or required testing time. In the first experiment 122 subjects were randomly assigned to three item difficulty arrangements of 139 test items with a 0–100% difficulty range, and in the second experiment 156 subjects were randomly assigned to three item content arrangements of 103 items. Results of analyses of variance with test anxiety used as a classification factor supported the hypotheses.  相似文献   

17.
Practical use of the matrix sampling (i.e. item sampling) technique requires the assumption that an examinee's response to an item is independent of the context in which the item occurs. This assumption was tested experimentally by comparing the responses of examinees to a population of items with the responses of examinees to item samples. Matrix sampling mean and variance estimates for verbal, quantitative, and attitude tests were used as dependent variables to test for differences between the “context” and “out-of-context” groups. The estimates obtained from both treatment groups were also compared with actual population values. No significant differences were found between treatments on matrix sample parameter estimates for any of the three types of tests.  相似文献   

18.
In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored.  相似文献   

19.
张军 《考试研究》2014,(1):56-61
单调匀质模型是非参数项目反应理论中使用最广泛的模型,它有三个基本假设,适用于小规模测验的分析。本研究使用MHM分析北京语言大学汉语进修学院某次测验,结果表明测验满足弱单维性假设与弱局部独立性假设,67个项目中有9个项目的量表适宜性系数低于0.3,需要修改或删除,删除后测验为中等强度的Mokken量表。另外,有2个项目违反了单调性假设,不符合Mokken量表的要求。  相似文献   

20.
Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号