期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Affordances of Item Formats and Their Effects on Test‐Taker Cognition under Uncertainty

Jung Aa Moon Madeleine Keehner Irvin R. Katz 《Educational Measurement》2019,38(1):54-62

The current study investigated how item formats and their inherent affordances influence test‐takers’ cognition under uncertainty. Adult participants solved content‐equivalent math items in multiple‐selection multiple‐choice and four alternative grid formats. The results indicated that participants’ affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test‐taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large‐scale educational assessments. 相似文献

2.

基于高考英语难题的试题命制技术探讨

程晓堂王瑶《中国考试》2021,(5):63-71

难度不是试题的固有属性,而是考生因素与试题特征之间互动的结果。很多试题分析者倾向于将试题难度偏高的原因仅仅归结于学生未掌握相关知识或技能,而忽视试题本身的特征。通过分析60道难度在0.6以下的高考英语试题,探究其难度来源。结果显示,除考生因素外,难题或偏难题的难度来源也与命题技术有关,比如答案的唯一性与可接受性、考查内容超纲、考点设置与评分标准欠妥等方面的问题。为此,提出考试机构应提高命题水平,加强试题质量监控,确保大规模考试科学选拔人才。相似文献

3.

命题者：影响阅读理解测试效度的一个因素

李雪曾用强《考试研究》2012,(4):49-60

本研究应用项目反应理论,从被试的阅读能力值和题目的难度值这两个方面,分析阅读理解测试中多项选择题命题者对考试效度的影响。实验设计中,将两组被试同时施测于一项“阅读水平测试”,根据测试结果估计出的两组被试能力值之间无显著性差异。再次将这两组被试分别施测于两位不同命题者所命制的题目,尽管这些题目均产生于相同的阅读材料,且题目的难度值之间并没有显著性差异,被试的表现却显著不同。Rasch模型认为,被试表现由被试能力和试题难度共同决定。因此,可以推测,这是由于不同命题者所命制的题目影响了被试的表现,并进而影响了使用多项选择题进行阅读理解测试的效度。相似文献

4.

A Paradox in the Study of the Benefits of Test‐Item Review

Wim J. van der Linden Minjeong Jeon Steve Ferrara 《Journal of Educational Measurement》2011,48(4):380-398

According to a popular belief, test takers should trust their initial instinct and retain their initial responses when they have the opportunity to review test items. More than 80 years of empirical research on item review, however, has contradicted this belief and shown minor but consistently positive score gains for test takers who changed answers they found to be incorrect during review. This study reanalyzed the problem of the benefits of answer changes using item response theory modeling of the probability of an answer change as a function of the test taker’s ability level and the properties of items. Our empirical results support the popular belief and reveal substantial losses due to changing initial responses for all ability levels. Both the contradiction of the earlier research and support of the popular belief are explained as a manifestation of Simpson’s paradox in statistics. 相似文献

5.

Modeling Partial Knowledge on Multiple‐Choice Items Using Elimination Testing

Qian Wu Tinne De Laet Rianne Janssen 《Journal of Educational Measurement》2019,56(2):391-414

Single‐best answers to multiple‐choice items are commonly dichotomized into correct and incorrect responses, and modeled using either a dichotomous item response theory (IRT) model or a polytomous one if differences among all response options are to be retained. The current study presents an alternative IRT‐based modeling approach to multiple‐choice items administered with the procedure of elimination testing, which asks test‐takers to eliminate all the response options they consider to be incorrect. The partial credit model is derived for the obtained responses. By extracting more information pertaining to test‐takers’ partial knowledge on the items, the proposed approach has the advantage of providing more accurate estimation of the latent ability. In addition, it may shed some light on the possible answering processes of test‐takers on the items. As an illustration, the proposed approach is applied to a classroom examination of an undergraduate course in engineering science. 相似文献

6.

Impact of Both Local Item Dependencies and Cut‐Point Locations on Examinee Classifications

下载免费PDF全文

Jonathan D. Rubright 《Educational Measurement》2018,37(3):40-45

Performance assessments, scenario‐based tasks, and other groups of items carry a risk of violating the local item independence assumption made by unidimensional item response theory (IRT) models. Previous studies have identified negative impacts of ignoring such violations, most notably inflated reliability estimates. Still, the influence of this violation on examinee ability estimates has been comparatively neglected. It is known that such item dependencies cause low‐ability examinees to have their scores overestimated and high‐ability examinees' scores underestimated. However, the impact of these biases on examinee classification decisions has been little examined. In addition, because the influence of these dependencies varies along the underlying ability continuum, whether or not the location of the cut‐point is important in regard to correct classifications remains unanswered. This simulation study demonstrates that the strength of item dependencies and the location of an examination systems’ cut‐points both influence the accuracy (i.e., the sensitivity and specificity) of examinee classifications. Practical implications of these results are discussed in terms of false positive and false negative classifications of test takers. 相似文献

7.

Assessment of Genetics Understanding

Philipp Schmiemann Ross H. Nehm Robyn E. Tornabene 《Science & Education》2017,26(10):1161-1191

Understanding how situational features of assessment tasks impact reasoning is important for many educational pursuits, notably the selection of curricular examples to illustrate phenomena, the design of formative and summative assessment items, and determination of whether instruction has fostered the development of abstract schemas divorced from particular instances. The goal of our study was to employ an experimental research design to quantify the degree to which situational features impact inferences about participants’ understanding of Mendelian genetics. Two participant samples from different educational levels and cultural backgrounds (high school, n = 480; university, n = 444; Germany and USA) were used to test for context effects. A multi-matrix test design was employed, and item packets differing in situational features (e.g., plant, animal, human, fictitious) were randomly distributed to participants in the two samples. Rasch analyses of participant scores from both samples produced good item fit, person reliability, and item reliability and indicated that the university sample displayed stronger performance on the items compared to the high school sample. We found, surprisingly, that in both samples, no significant differences in performance occurred among the animal, plant, and human item contexts, or between the fictitious and “real” item contexts. In the university sample, we were also able to test for differences in performance between genders, among ethnic groups, and by prior biology coursework. None of these factors had a meaningful impact upon performance or context effects. Thus some, but not all, types of genetics problem solving or item formats are impacted by situational features. 相似文献

8.

Skilled but unaware of it: CAT undermines a test taker's metacognitive competence

Tuulia M. Ortner Eva Weißkopf Friederike X. R. Gerstenberg 《European Journal of Psychology of Education - EJPE》2013,28(1):37-51

We investigated students' metacognitive experiences with regard to feelings of difficulty (FD), feelings of satisfaction (FS), and estimate of effort (EE), employing either computerized adaptive testing (CAT) or computerized fixed item testing (FIT). In an experimental approach, 174 students in grades 10 to 13 were tested either with a CAT or a FIT version of a matrices test. Data revealed that metacognitive experiences were not related to the resulting test scores for CAT: test takers who took the matrices test in an adaptive mode were paradoxically more satisfied with their performance the worse they had performed in terms of the resulting ability parameter. They also rated the test as easier the lower they had performed, but their estimates of effort were higher the better they had performed. For test takers who took the FIT version, completely different results were revealed. In line with previous results, test takers were supposed to base these experiences on the subjectively estimated percentage of items solved. This moderated mediation hypothesis was in parts confirmed, as the relation between the percentage of items solved and FD, FS, and EE was revealed to be mediated by the estimated percentage of items solved. Results are discussed with reference to feedback acceptance, errant self-estimations, and test fairness with regard to a possible false regulation of effort in lower ability groups when using CAT. 相似文献

9.

Cognitive Design Principles and the Successful Performer: A Study on Spatial Ability

Susan E. Embretson 《Journal of Educational Measurement》1996,33(1):29-39

An important trend in educational measurement is the use of principles of cognitive psychology to design achievement and ability test items. Many studies show that manipulating the stimulus features of items influences the processes, strategies, and knowledge structures that are involved in solution. However, little is known about how cognitive design influences individual differences. That is, does applying cognitive design principles change the background skills and abilities that are associated with successful performance? This study compared the correlates of two spatial ability tests that used the same item type but different test design principles (cognitive design versus psychometric design). The results indicated differences in factorial complexity in the two tests; specifically, the impact of verbal abilities was substantially reduced by applying the cognitive design principles. 相似文献

10.

An Investigation of Different Treatment Strategies for Item Category Collapsing in Calibration: An Empirical Study

Brenda Siok-Hoon Tay-lim Jinming Zhang 《教育实用测度》2015,28(2):143-155

To ensure the statistical result validity, model-data fit must be evaluated for each item. In practice, certain actions or treatments are needed for misfit items. If all misfit items are treated, much item information would be lost during calibration. On the other hand, if only severely misfit items are treated, the inclusion of misfit items may invalidate the statistical inferences based on the estimated item response models. Hence, given response data, one has to find a balance between treating too few and too many misfit items. In this article, misfit items are classified into three categories based on the extent of misfit. Accordingly, three different item treatment strategies are proposed in determining which categories of misfit items should be treated. The impact of using different strategies is investigated. The results show that the test information functions obtained under different strategies can be substantially different in some ability ranges. 相似文献

11.

When Does Scale Anchoring Work? A Case Study

Sandip Sinharay Shelby J. Haberman Yi‐Hsuan Lee 《Journal of Educational Measurement》2011,48(1):61-80

Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement. Scale anchoring, a technique which describes what students at different points on a score scale know and can do, is a tool to provide such information. Scale anchoring for a test involves a substantial amount of work, both by the statistical analysts and test developers involved with the test. In addition, scale anchoring involves considerable use of subjective judgment, so its conclusions may be questionable. We describe statistical procedures that can be used to determine if scale anchoring is likely to be successful for a test. If these procedures indicate that scale anchoring is unlikely to be successful, then there is little reason to perform a detailed scale anchoring study. The procedures are applied to several data sets from a teachers’ licensing test. 相似文献

12.

嵌入式评分标准对考生写作行为的影响

谢昌香曾用强《考试研究》2009,(3):79-94

本文采用对比研究实验分析嵌入式评分标准对考生写作行为的影响,运用统计软件SPSS13．0进行独立样本t检验。研究结果表明,嵌入式评分标准能够加强考生对出题者意图的理解,写出符合写作要求的作文,但只对语言能力水平在一定阈值内的学生发生作用。这一结果丰富了Bachman和Palmer关于影响测试行为因素及途径的图式,也使出题者与考生的沟通更加具体直接,使写作考试更加人性化。相似文献

13.

Item Response Theory Models for Performance Decline During Testing

Kuan‐Yu Jin Wen‐Chung Wang 《Journal of Educational Measurement》2014,51(2):178-200

Sometimes, test‐takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to consider such testing behaviors. In this study, a new class of mixture IRT models was developed to account for such testing behavior in dichotomous and polytomous items, by assuming test‐takers were composed of multiple latent classes and by adding a decrement parameter to each latent class to describe performance decline. Parameter recovery, effect of model misspecification, and robustness of the linearity assumption in performance decline were evaluated using simulations. It was found that the parameters in the new models were recovered fairly well by using the freeware WinBUGS; the failure to account for such behavior by fitting standard IRT models resulted in overestimation of difficulty parameters on items located toward the end of the test and overestimation of test reliability; and the linearity assumption in performance decline was rather robust. An empirical example is provided to illustrate the applications and the implications of the new class of models. 相似文献

14.

Manipulating Processing Difficulty of Reading Comprehension Questions: The Feasibility of Verbal Item Generation

Joanna S. Gorin 《Journal of Educational Measurement》2005,42(4):351-373

Based on a previously validated cognitive processing model of reading comprehension, this study experimentally examines potential generative components of text-based multiple-choice reading comprehension test questions. Previous research ( Embretson & Wetzel, 1987 ; Gorin & Embretson, 2005 ; Sheehan & Ginther, 2001 ) shows text encoding and decision processes account for significant proportions of variance in item difficulties. In the current study, Linear Logistic Latent Trait Model (LLTM; Fischer, 1973 ) parameter estimates of experimentally manipulated items are examined to further verify the impact of encoding and decision processes on item difficulty. Results show that manipulation of some passage features, such as increased use of negative wording, significantly increases item difficulty in some cases, whereas others, such as altering the order of information presentation in a passage, did not significantly affect item difficulty, but did affect reaction time. These results suggest that reliable changes in difficulty and response time through algorithmic manipulation of certain task features is feasible. However, non-significant results for several manipulations highlight potential challenges to item generation in establishing direct links between theoretically relevant item features and individual item processing. Further examination of these relationships will be informative to item writers as well as test developers interested in the feasibility of item generation as an assessment tool. 相似文献

15.

Measuring Variability in Proctor Decision Making on High-Stakes Assessments: Improving Test Security in the Digital Age

William Belzak J. R. Lockwood Yigal Attali 《Educational Measurement》2024,43(1):52-65

Remote proctoring, or monitoring test takers through internet-based, video-recording software, has become critical for maintaining test security on high-stakes assessments. The main role of remote proctors is to make judgments about test takers' behaviors and decide whether these behaviors constitute rule violations. Variability in proctor decision making, or the degree to which humans/proctors make different decisions about the same test-taking behaviors, can be problematic for both test takers and test users (e.g., universities). In this paper, we measure variability in proctor decision making over time on a high-stakes English language proficiency test. Our results show that (1) proctors systematically differ in their decision making and (2) these differences are trait-like (i.e., ranging from lenient to strict), but (3) systematic variability in decisions can be reduced. Based on these findings, we recommend that test security providers conduct regular measurements of proctors’ judgments and take actions to reduce variability in proctor decision making. 相似文献

16.

Flawed Items in Computerized Adaptive Testing

Maria T. Potenza Martha L. Stocking 《Journal of Educational Measurement》1997,34(1):79-96

Educational Testing Service A multiple-choice test item is identified as flawed if it has no single best answer. In spite of extensive quality control procedures, the administration of flawed items to test takers is inevitable. A limited set of common strategies for dealing with flawed items in conventional testing, grounded in the principle of fairness to examinees, is reexamined in the context of adaptive testing. An additional strategy, available for adaptive testing, of retesting from a pool cleansed of flawed items, is compared to the existing strategies. Retesting was found to be no practical improvement over current strategies. 相似文献

17.

Reducing the need for guesswork in multiple-choice tests

Martin Bush 《Assessment & Evaluation in Higher Education》2015,40(2):218-231

The humble multiple-choice test is very widely used within education at all levels, but its susceptibility to guesswork makes it a suboptimal assessment tool. The reliability of a multiple-choice test is partly governed by the number of items it contains; however, longer tests are more time consuming to take, and for some subject areas, it can be very hard to create new test items that are sufficiently distinct from previously used items. A number of more sophisticated multiple-choice test formats have been proposed dating back at least 60?years, many of which offer significantly improved test reliability. This paper offers a new way of comparing these alternative test formats, by modelling each one in terms of the range of possible test taker responses it enables. Looking at the test formats in this way leads to the realisation that the need for guesswork is reduced when test takers are given more freedom to express their beliefs. Indeed, guesswork is eliminated entirely when test takers are able to partially order the answer options within each test item. The paper aims to strengthen the argument for using more sophisticated multiple-choice test formats, especially for high-stakes summative assessment. 相似文献

18.

Multiple audience rating form strategies for student evaluation of college teaching 总被引：1，自引：0，他引：1

Ken Peterson G. Manny Gunne Paul Miller Orlando Rivera 《Research in higher education》1984,20(3):309-321

Michael Scriven has suggested that student rating forms, for the purpose of evaluating college teaching, be designed for multiple audiences (instructor, administrator, student), and with a single global item for summative functions (determination of merit, retention, or promotion). This study reviewed approaches to rating form construction, e.g., factor analytic strategies of Marsh, and recommended the multiple audience design of Scriven. An empirical test of the representativeness of the single global item was reported from an analysis of 1,378 forms collected in a university department of education. The global item correlated most satisfactorily with other items, a computed total of items, items that represented underlying factors, and various triplets of items selected to represent all possible combinations of items. It was concluded that a multiple audience rating form showed distinct advantages in design and that the single global item most fairly and highly represented the overall teaching performance, as judged by students, for decisions about retention, promotion, and merit made by administrators. 相似文献

19.

A Didactic Explanation of Item Bias, Item Impact, and Item Validity From a Multidimensional Perspective

Terry A. Ackerman 《Journal of Educational Measurement》1992,29(1):67-91

Many researchers have suggested that the main cause of item bias is the misspecification of the latent ability space, where items that measure multiple abilities are scored as though they are measuring a single ability. If two different groups of examinees have different underlying multidimensional ability distributions and the test items are capable of discriminating among levels of abilities on these multiple dimensions, then any unidimensional scoring scheme has the potential to produce item bias. It is the purpose of this article to provide the testing practitioner with insight about the difference between item bias and item impact and how they relate to item validity. These concepts will be explained from a multidimensional item response theory (MIRT) perspective. Two detection procedures, the Mantel-Haenszel (as modified by Holland and Thayer, 1988) and Shealy and Stout's Simultaneous Item Bias (SIB; 1991) strategies, will be used to illustrate how practitioners can detect item bias. 相似文献

20.

预估难度的理论模型及应用探析 总被引：7，自引：1，他引：6

柳博《中国考试》2009,(4)

难度决定考试分数的分布,直接影响考试的评价与选拔功能,受到较高的关注。难度一般通过考后对考试数据的统计分析得到,这时合格分数线巴经确定,无法再进行调整。预估难度是在命题阶段由命题专家结合试题内容,通过构建标准常模,进行合理地评定而得到的试题难度。在当前应用愿始分报告考试成绩的情况下,预估难度对确保考试稳定与公平尤其重要。预估难度不同于实测难度,可以进行控制与调整。命题工作不仅是编制试题的过程,而且是预估难度的设计过程。相似文献