期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Modeling Item Response Times With a Two-State Mixture Model: A New Method of Measuring Speededness

Deborah L. Schnipke David J. Scrams 《Journal of Educational Measurement》1997,34(3):213-232

Speededness refers to the extent to which time limits affect examinees'test performance, and it is often measured by calculating the proportion of examinees who do not reach a certain percentage of test items. However, when tests are number-right scored (i.e., no points are subtracted for incorrect responses), examinees are likely to rapidly guess on items rather than leave them blank. Therefore, this traditional measure of speededness probably underestimates the true amount of speededness on such tests. A more accurate assessment of speededness should also reflect the tendency of examinees to rapidly guess on items as time expires. This rapid-guessing component of speededness can be estimated by modeling response times with a two-state mixture model, as demonstrated with data from a computer- administered reasoning test. Taking into account the combined effect of unreached items and rapid guessing provides a more complete measure of speededness than has previously been available. 相似文献

2.

Detecting Differential Speededness in Multistage Testing

Wim J. van der Linden Krista Breithaupt Siang Chee Chuah Yanwei Zhang 《Journal of Educational Measurement》2007,44(2):117-130

A potential undesirable effect of multistage testing is differential speededness, which happens if some of the test takers run out of time because they receive subtests with items that are more time intensive than others. This article shows how a probabilistic response-time model can be used for estimating differences in time intensities and speed between subtests and test takers and detecting differential speededness. An empirical data set for a multistage test in the computerized CPA Exam was used to demonstrate the procedures. Although the more difficult subtests appeared to have items that were more time intensive than the easier subtests, an analysis of the residual response times did not reveal any significant differential speededness because the time limit appeared to be appropriate. In a separate analysis, within each of the subtests, we found minor but consistent patterns of residual times that are believed to be due to a warm-up effect, that is, use of more time on the initial items than they actually need. 相似文献

3.

Item Parameter Estimation Under Conditions of Test Speededness: Application of a Mixture Rasch Model With Ordinal Constraints

Daniel M. Bolt Allan S. Cohen James A. Wollack 《Journal of Educational Measurement》2002,39(4):331-348

When tests are administered under fixed time constraints, test performances can be affected by speededness. Among other consequences, speededness can result in inaccurate parameter estimates in item response theory (IRT) models, especially for items located near the end of tests (Oshima, 1994). This article presents an IRT strategy for reducing contamination in item difficulty estimates due to speededness. Ordinal constraints are applied to a mixture Rasch model (Rost, 1990) so as to distinguish two latent classes of examinees: (a) a "speeded" class, comprised of examinees that had insufficient time to adequately answer end-of-test items, and (b) a "nonspeeded" class, comprised of examinees that had sufficient time to answer all items. The parameter estimates obtained for end-of-test items in the nonspeeded class are shown to more accurately approximate their difficulties when the items are administered at earlier locations on a different form of the test. A mixture model can also be used to estimate the class memberships of individual examinees. In this way, it can be determined whether membership in the speeded class is associated with other student characteristics. Results are reported for gender and ethnicity. 相似文献

4.

A Comparison of Experimental and Observational Approaches to Assessing the Effects of Time Constraints in a Medical Licensing Examination

下载免费PDF全文

Polina Harik Brian E. Clauser Irina Grabovsky Peter Baldwin Melissa J. Margolis Deniz Bucak Michael Jodoin William Walsh Steven Haist 《Journal of Educational Measurement》2018,55(2):308-327

Test administrators are appropriately concerned about the potential for time constraints to impact the validity of score interpretations; psychometric efforts to evaluate the impact of speededness date back more than half a century. The widespread move to computerized test delivery has led to the development of new approaches to evaluating how examinees use testing time and to new metrics designed to provide evidence about the extent to which time limits impact performance. Much of the existing research is based on these types of observational metrics; relatively few studies use randomized experiments to evaluate the impact time limits on scores. Of those studies that do report on randomized experiments, none directly compare the experimental results to evidence from observational metrics to evaluate the extent to which these metrics are able to sensitively identify conditions in which time constraints actually impact scores. The present study provides such evidence based on data from a medical licensing examination. The results indicate that these observational metrics are useful but provide an imprecise evaluation of the impact of time constraints on test performance. 相似文献

5.

Effects of Calculator Use on Scores on a Test of Mathematical Reasoning

Brent Bridgeman Anne Harvey James Braswell 《Journal of Educational Measurement》1995,32(4):323-340

A sample of college-bound juniors from 275 high schools took a test consisting of 70 math questions from the SAT. A random half of the sample was allowed to use calculators on the test. Both genders and three ethnic groups (White, African American, and Asian American) benefitted about equally from being allowed to use calculators; Latinos benefitted slightly more than the other groups. Students who routinely used calculators on classroom mathematics tests were relatively advantaged on the calculator test. Test speededness was about the same whether or not students used calculators. Calculator effects on individual items ranged from positive through neutral to negative and could either increase or decrease the validity of an item as a measure of mathematical reasoning skills. Calculator effects could be either present or absent in both difficult and easy items 相似文献

6.

Test Design and Speededness

Wim J. van der Linden 《Journal of Educational Measurement》2011,48(1):44-60

A critical component of test speededness is the distribution of the test taker’s total time on the test. A simple set of constraints on the item parameters in the lognormal model for response times is derived that can be used to control the distribution when assembling a new test form. As the constraints are linear in the item parameters, they can easily be included in a mixed integer programming model for test assembly. The use of the constraints is demonstrated for the problems of assembling a new test form to be equally speeded as a reference form, test assembly in which the impact of a change in the content specifications on speededness is to be neutralized, and the assembly of test forms with a revised level of speededness. 相似文献

7.

Effects of Differentially Time-Consuming Tests on Computer-Adaptive Test Scores

Brent Bridgeman Frederick Cline 《Journal of Educational Measurement》2004,41(2):137-148

Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test. 相似文献

8.

A STUDY OF SPEEDEDNESS AS A SOURCE OF TEST BIAS1

FRANKLIN R. EVANS RICHARD R. REILLY 《Journal of Educational Measurement》1972,9(2):123-131

Specially constructed “speeded” and “unspeeded” forms of a Reading Comprehension test were administered to both regular center and fee-free center LSAT candidates in an effort to determine: (1) if the test was more speeded for fee-free candidates, and (2) if reducing the amount of speededness was more beneficial to fee-free candidates. Results of the analyses show: (1) the test is somewhat more speeded for fee-free candidates than for regular candidates, (2) reducing the amount of speededness produces higher scores for both regular and fee-free center candidates, and (3) reducing speededness is not significantly more beneficial (in terms of increasing the number of items answered correctly) to fee-free than to regular center candidates. Lower KR-20 reliability was observed under speeded conditions in the fee-free sample. 相似文献

9.

The Effect of Speededness on Parameter Estimation in Item Response Theory

T. C. Oshima 《Journal of Educational Measurement》1994,31(3):200-219

There is a paucity of research in item response theory (IRT) examining the consequences of violating the implicit assumption of nonspeededness. In this study, test data were simulated systematically under various speeded conditions. The three factors considered in relation to speededness were proportion of test not reached (5%, 10%, and 15%), response to not reached (blank vs. random response), and item ordering (random vs. easy to hard). The effects of these factors on parameter estimation were then examined by comparing the item and ability parameter estimates with the known true parameters. Results indicated that the ability estimation was least affected by speededness in terms of the correlation between true and estimated ability parameters. On the other hand, substantial effects of speededness were observed among item parameter estimates. Recommendations for minimizing the effects of speededness are discussed 相似文献

10.

A Mixture Group Bifactor Model for Binary Responses

Sun-Joo Cho Allan S. Cohen Seock-Ho Kim 《Structural equation modeling》2013,20(3):375-395

The presence of nuisance dimensionality is a potential threat to the accuracy of results for tests calibrated using a measurement model such as a factor analytic model or an item response theory model. This article describes a mixture group bifactor model to account for the nuisance dimensionality due to a testlet structure as well as the dimensionality due to differences in patterns of responses. The model can be used for testing whether or not an item functions differently across latent groups in addition to investigating the differential effect of local dependency among items within a testlet. An example is presented comparing test speededness results from a conventional factor mixture model, which ignores the testlet structure, with results from the mixture group bifactor model. Results suggested the 2 models treated the data somewhat differently. Analysis of the item response patterns indicated that the 2-class mixture bifactor model tended to categorize omissions as indicating speededness. With the mixture group bifactor model, more local dependency was present in the speeded than in the nonspeeded class. Evidence from a simulation study indicated the Bayesian estimation method used in this study for the mixture group bifactor model can successfully recover generated model parameters for 1- to 3-group models for tests containing testlets. 相似文献

11.

Evaluating test validity: reprise and progress

Lorrie A. Shepard 《Assessment in Education: Principles, Policy & Practice》2016,23(2):268-280

The AERA, APA, NCME Standards define validity as ‘the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’. A century of disagreement about validity does not mean that there has not been substantial progress. This consensus definition brings together interpretations and use so that it is one idea, not a sequence of steps. Just as test design is framed by a particular context of use, so too must validation research focus on the adequacy of tests for specific purposes. The consensus definition also carries forward major reforms in validity theory begun in the 1970s that rejected separate types of validity evidence for different types of tests, e.g. content validity for achievement tests and predictive correlations for employment tests. When the current definition refers to both ‘evidence and theory’ the Standards are requiring not just that a test be well designed based on theory but that evidence be collected to verify that the test device is working as intended. Having taught policy-makers, citizens, and the courts to use the word validity, especially in high-stakes applications, we cannot after the fact substitute a more limited, technical definition of validity. An official definition provides clarity even for those who disagree, because it serves as a touchstone and obliges them to acknowledge when they are departing from it. 相似文献

12.

Effects of Applying Different Time Limits to a Proposed GRE Writing Test

Donald E. Powers Mary E. Fowles 《Journal of Educational Measurement》1996,33(4):433-452

In order to determine the role of time limits on both test performance and test validity, we asked approximately 300 volunteers–prospective graduate students–to each write two essays–one in a 40-minute time period and the other in 60 minutes. Analyses revealed that, on average, test performance was significantly better when examinees were given 60 minutes instead of 40. However, there was no interaction between test-taking style (fast vs. slow) and time limits. 'That is', examinees who described themselves as slow writers/test takers did not benefit any more (or any less) from generous time limits than did their quicker counterparts. In addition, there was no detectable effect of different time limits on the meaning of essay scores, as suggested by their relationship to several nontest indicators of writing ability. 相似文献

13.

Reliability,validity, and all that jazz

Dylan Wiliam 《Education 3-13》2013,41(3):17-21

Summary

In this article my purpose has not been to indicate what kinds of things can and can't be assessed appropriately with tests. Rather, I have tried to illuminate how the key ideas of reliability and validity are used by test developers and what this means in practice — not least in terms of the decisions that are made about individual students on the basis of their test results. As I have stressed throughout this article, these limitations are not the fault of test developers. However inconvenient these limitations are for proponents of school testing, they are inherent in the nature of tests of academic achievement, and are as real as rocks. All users of the results of educational tests must understand what a limited technology this is. 相似文献

14.

Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test

April L. Zenisky Ronald K. Hambleton Stephen G. Sired 《Journal of Educational Measurement》2002,39(4):291-309

Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory. 相似文献

15.

Evaluating the Consistency of Test Content Across Two Successive Administrations of a State-Mandated Science Assessment

Timothy O'Neil Stephen G. Sireci Kristen L. Huff 《Educational Assessment》2013,18(3-4):129-151

Educational tests used for accountability purposes must represent the content domains they purport to measure. When such tests are used to monitor progress over time, the consistency of the test content across years is important for ensuring that observed changes in test scores are due to student achievement rather than to changes in what the test is measuring. In this study, expert science teachers evaluated the content and cognitive characteristics of the items from 2 consecutive annual administrations of a 10th-grade science assessment. The results indicated the content area representation was fairly consistent across years and the proportion of items measuring the different cognitive skill areas was also consistent. However, the experts identified important cognitive distinctions among the test items that were not captured in the test specifications. The implications of this research for the design of science assessments and for appraising the content validity of state-mandated assessments are discussed. 相似文献

16.

Assessing test‐taking strategies of university students: developing a scale and estimating its psychometric indices

Hamzeh Dodeen 《Assessment & Evaluation in Higher Education》2008,33(4):409-419

Test‐taking strategies are important cognitive skills that strongly affect students’ performance in tests. Using appropriate test‐taking strategies improves students’ achievement and grades, improves students’ attitudes toward tests and reduces test anxiety. This results in improving test accuracy and validity. This study aimed at developing a scale to assess students’ test‐taking strategies at university level. The scale developed was passed through several validation procedures that included content, construct and criterion‐related validity. Similarly, scale reliability (internal reliability and stability over time) was assessed through several procedures. Four samples of students (50, 828, 553 and 235) participated by responding to different versions of the scale. The scale developed consists of 31 items distributed into four sub‐scales: Before‐test, Time management, During‐test and After‐test. To the researcher’s knowledge, this is the first comprehensive scale developed to assess test‐taking strategies used by university students. 相似文献

17.

History and significance of rapid automatized naming

Martha Bridge Denckla M.D. Laurie E. Cutting Ph.D. 《Annals of dyslexia》1999,49(1):29-42

In this review, the origins and history of a test of rapid automatized naming (RAN) are traced from nineteenth-century classical brain-behavior analyses of cases of acquired “alexia without agraphia” through adaptations to studies of normal and reading disabled children. The element of speed (of responding verbally to a visual stimulus) was derived from a test of color naming developed over 50 years ago as a bedside measure of recovery from brain injuries. Merging the “visual-verbal” connection essential to reading (specific) with the response time element (general), RAN turned out to be a useful correlate and predictor of reading competence, accounting even for variance beyond that accounted for by timed tests of discrete naming. As one of the two deficits highlighted in the Double Deficit hypothesis with phonological awareness, RAN has emerged as something more than a particularly difficult challenge to a unitary phonological retrieval deficit, and has itself been subjected to further dissection. Coming full circle to its origins, recent research suggests that RAN taps both visual-verbal (language domain) and processing speed (executive domain) contributions to reading. 相似文献

18.

A Polytomous Scoring Approach to Handle Not-Reached Items in Low-Stakes Assessments

Guher Gorgun Okan Bulut 《Educational and psychological measurement》2021,81(5):847

In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed. 相似文献

19.

国际考试安全准则介绍及对我国考试组织管理的启示

蔡武越黄美薇胡佳琪骆方《中国考试》2021,(3)

针对影响考试安全性和有效性的关键问题,国际考试委员会制定考试安全准则,包括考试违规行为的界定与区分、制定和实施考试安全计划、考试实施过程中的安全保障、应对考试中出现的安全漏洞4个方面,供各国的考试组织和服务部门参考使用。通过介绍国际考试安全准则的主要内容,结合我国考试的实际情况,分析考试安全准则对我国考试组织管理的启示,为进一步提高和完善我国考试组织管理水平、保障考试安全、维护考试公平公正提供参考。相似文献

20.

Validity on Trial: Psychometric and Legal Conceptualizations of Validity

Stephen G. Sireci Polly Parker 《Educational Measurement》2006,25(3):27-34

The psychometric literature is replete with comprehensive discussions of test validity, test validation, and the characteristics of quality assessment programs. The most authoritative source for guidance regarding sound test development and evaluation practices is the Standards for Educational and Psychological Testing. However, the Standards are not legally binding. In this article, we review the way in which validity is conceptualized in the Standards and compare this conceptualization with validity evidence presented in specific court cases involving legal challenges to tests. Our review indicates that, in general, there is strong congruence between the Standards and how validity is viewed in the courts, and that testing agencies that conform to these guidelines are likely to withstand legal scrutiny. However, the courts have taken a more practical, less theoretical view on validity and tend to emphasize evidence based on test content and testing consequences. 相似文献