期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

OPTIMAL NUMBER OF OPTIONS: AN INVESTIGATION OF THE ASSUMPTION OF PROPORTIONALITY 总被引：1，自引：0，他引：1

DAVID V. BUDESCU BARUCH NEVO 《Journal of Educational Measurement》1985,22(3):183-196

One assumption common to all models for determining the optimal number of options per item (e. g., Lord, 1977) is that total testing time is proportional to the number of items and the number of options per item. Therefore, under this assumption given a fixed testing time, the test can be shortened or lengthened by deleting or adding a proportional number of options. The present study examines the validity of this assumption in three tests which were administered with 2, 3, 4, and 5 options per item. The number of items attempted in the first 10 and 15 minutes of the testing session and the time needed to complete the tests were recorded. Thus, the rate of performance for both fixed time and fixed test length was analyzed. A strong and consistently negative relationship between rate of performance and the number of options was detected in all tests. Thus, the empirical results did not support the assumption of proportionality. Furthermore, the data indicated that the method by which options are deleted can play a role in this context. A more realistic assumption of generalized proportionality, proposed by Grier (1976), was supported by the results from a Mathematical Reasoning test, but was only partially supported for a Vocabulary and a Reading Comprehension test. 相似文献

2.

The Scaling of Mixed-Item-Format Tests With the One-Parameter and Two-Parameter Partial Credit Models

Robert C. Sykes Wendy M. Yen 《Journal of Educational Measurement》2000,37(3):221-244

Item response theory scalings were conducted for six tests with mixed item formats. These tests differed in their proportions of constructed response (c.r.) and multiple choice (m.c.) items and in overall difficulty. The scalings included those based on scores for the c.r. items that had maintained the number of levels as the item rubrics, either produced from single ratings or multiple ratings that were averaged and rounded to the nearest integer, as well as scalings for a single form of c.r. items obtained by summing multiple ratings. A one-parameter (IPPC) or two-parameter (2PPC) partial credit model was used for the c.r. items and the one-parameter logistic (IPL) or three-parameter logistic (3PL) model for the m.c. items, ltem fit was substantially worse with the combination IPL/IPPC model than the 3PL/2PPC model due to the former's restrictive assumptions that there would be no guessing on the m.c. items and equal item discrimination across items and item types. The presence of varying item discriminations resulted in the IPL/IPPC model producing estimates of item information that could be spuriously inflated for c.r. items that had three or more score levels. Information for some items with summed ratings were usually overestimated by 300% or more for the IPL/IPPC model. These inflated information values resulted in under-estbnated standard errors of ability estimates. The constraints posed by the restricted model suggests limitations on the testing contexts in which the IPL/IPPC model can be accurately applied. 相似文献

3.

NCME 2008 Presidential Address: The Impact of Anchor Test Configuration on Student Proficiency Rates

Anne R. Fitzpatrick 《Educational Measurement》2008,27(4):34-40

Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results. 相似文献

4.

Item-Level Comparative Analysis of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills

Leslie Keng Katie Larsen McClarty Laurie Laughlin Davis 《教育实用测度》2013,26(3):207-226

This article describes a comparative study conducted at the item level for paper and online administrations of a statewide high stakes assessment. The goal was to identify characteristics of items that may have contributed to mode effects. Item-level analyses compared two modes of the Texas Assessment of Knowledge and Skills (TAKS) for up to four subjects at two grade levels. The analyses included significance tests of p-value differences, DIF, and response distributions for each item. Additional analyses investigated item position effects and objective-level mode differences. No evidence of item position effects emerged, but significant differences were found for several items and objectives in all subjects at grade 8 and in mathematics and English language arts (ELA) at grade 11. Differences generally favored the paper group. ELA items that were longer in passage length and math items that required graphing and geometric manipulations or involved scrolling in the online administration tended to be the items showing mode differences. 相似文献

5.

A Comparison of Three Scoring Methods for Tests With Selected-Response and Constructed-Response Items

《Educational Assessment》2013,18(4):317-340

A number of methods for scoring tests with selected-response (SR) and constructed-response (CR) items are available. The selection of a method depends on the requirements of the program, the particular psychometric model and assumptions employed in the analysis of item and score data, and how scores are to be used. This article compares 3 methods: unweighted raw scores, Item Response Theory pattern scores, and weighted raw scores. Student score data from large-scale end-of-course high school tests in Biology and English were used in the comparisons. In the weighted raw score method evaluated in this study, the CR items were weighted so that SR and CR items contributed the same number of points toward the total score. The scoring methods were compared for the total group and for subgroups of students in terms of the resultant scaled score distributions, standard errors of measurement, and proficiency-level classifications. For most of the student ability distribution, the three scoring methods yielded similar results. Some differences in results are noted. Issues to be considered when selecting a scoring method are discussed. 相似文献

6.

Sensitivity of test items to teaching quality

《Learning and Instruction》2019

Instructional sensitivity is the psychometric capacity of tests or single items of capturing effects of classroom instruction. Yet, current item sensitivity measures’ relationship to (a) actual instruction and (b) overall test sensitivity is rather unclear. The present study aims at closing these gaps by investigating test and item sensitivity to teaching quality, reanalyzing data from a quasi-experimental intervention study in primary school science education (1026 students, 53 classes, M_age = 8.79 years, SD_age = 0.49, 50% female). We examine (a) the correlation of item sensitivity measures and the potential for cognitive activation in class and (b) consequences for test score interpretation when assembling tests from items varying in their degree of sensitivity to cognitive activation. Our study (a) provides validity evidence that item sensitivity measures may be related to actual classroom instruction and (b) points out that inferences on teaching drawn from test scores may vary due to test composition. 相似文献

7.

Estimating Average Domain Scores

Mary Pommerich W. Alan Nicewander Bradley A. Hanson 《Journal of Educational Measurement》1999,36(3):199-216

A simulation study was performed to determine whether a group's average percent correct in a content domain could be accurately estimated for groups taking a single test form and not the entire domain of items. Six Item Response Theory based domain score estimation methods were evaluated, under conditions of few items per content area perform taken, small domains, and small group sizes. The methods used item responses to a single form taken to estimate examinee or group ability; domain scores were then computed using the ability estimates and domain item characteristics. The IRT-based domain score estimates typically showed greater accuracy and greater consistency across forms taken than observed performance on the form taken. For the smallest group size and least number of items taken, the accuracy of most IRT-based estimates was questionable; however, a procedure that operates on an estimated distribution of group ability showed promise under most conditions. 相似文献

8.

Factors in Paper-and-Pencil and Computer Reading Score Differences at the Primary Grades

《Educational Assessment》2013,18(2):127-143

This study investigated factors related to score differences on computerized and paper-and-pencil versions of a series of primary K–3 reading tests. Factors studied included item and student characteristics. The results suggest that the score differences were more related to student than item characteristics. These student characteristics include response style variables, especially omitting, and socioeconomic status as measured by free lunch eligibility. In addition, response style and socioeconomic status appear to be relatively independent factors in the score differences. Variables studied but not found to be related to the format score differences included association of items with a reading passage, item difficulty, and teacher versus computer administration of items. However, because this study is the 1st to study the factors behind these score differences below Grade 3, and because a number of states are increasing computer testing at the primary grades, additional studies are needed to verify the importance of these 2 factors. 相似文献

9.

Asymptotic Standard Errors of Observed‐Score Equating With Polytomous IRT Models

Bjrn Andersson 《Journal of Educational Measurement》2016,53(4):459-477

In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length. 相似文献

10.

Achievement as a Function of Test Item Complexity and Difficulty

Gilbert Sax Enid G. Eilenberg Alan J. Klockars 《Journal of Experimental Education》2013,81(4):90-93

The effects of training tests on subsequent achievement were studied using 2-test item characteristics: item difficulty and item complexity. Ninety Ss were randomly assigned to treatment conditions having easy or difficult items and calling for rote or complex skills. Each S was administered two training tests during the quarter containing only items defined by his treatment condition. The dependent measure was a sixty item final examination with fifteen items reflecting each of the four treatment condition item types. The results showed greater achievement for those trained with difficult items and with rote items. In addition, two interaction of treatment conditions with type of test items were found. The results are discussed as supporting a hierarchical model rather than a “similarity” transfer model of learning. 相似文献

11.

TEST ITEM ARRANGEMENT, TESTING TIME, AND PERFORMANCE

RONALD N. MARSO 《Journal of Educational Measurement》1970,7(2):113-118

Two experiments were conducted to determine if a relationship exists between test item arrangements and student performance on power tests. The primary hypotheses were: item arrangements based upon item difficulty, similarity of content, or order of class presentation do not influence test score or required testing time. In the first experiment 122 subjects were randomly assigned to three item difficulty arrangements of 139 test items with a 0–100% difficulty range, and in the second experiment 156 subjects were randomly assigned to three item content arrangements of 103 items. Results of analyses of variance with test anxiety used as a classification factor supported the hypotheses. 相似文献

12.

Correlates of Rapid-Guessing Behavior in Low-Stakes Testing: Implications for Test Development and Measurement Practice

Steven L. Wise Dena A. Pastor Xiaojing J. Kong 《教育实用测度》2013,26(2):185-205

Previous research has shown that rapid-guessing behavior can degrade the validity of test scores from low-stakes proficiency tests. This study examined, using hierarchical generalized linear modeling, examinee and item characteristics for predicting rapid-guessing behavior. Several item characteristics were found significant; items with more text or those occurring later in the test were related to increased rapid guessing, while the inclusion of a graphic in a item was related to decreased rapid guessing. The sole significant examinee predictor was SAT total score. Implications of these results for measurement professionals developing low-stakes tests are discussed. 相似文献

13.

Change in Self‐esteem Between Year 2 and Year 6: a longitudinal study

Julie Davies Ivy Brember 《教育心理学》1995,15(2):171-180

All Year 2 children in six randomly selected primary schools within one Local Education Authority (LEA) comprised the sample to which the Lawseq self‐esteem questionnaire was administered. Four years later, when they were Year 6, they completed the Lawseq again. A two‐way analysis of variance with Sex and Occasions was carried out on the 12 individual items of the instrument and the total. There were no significant differences between occasions or sexes on the overall score, but there were significant differences between occasions on seven of the 12 items and between sexes on two items. On only one item was there a significant interaction between sexes and occasions. The mean for the total fell over the 4 years. The means for both occasions were considerably below the mean of 19.00 obtained when Lawrence standardised the test in 1981. Discussion centred on possible reasons for this, such as appropriacy of the instrument for the age‐groups under study, stability of administration and changes within society and school. 相似文献

14.

A Comparison of Item Selection Routines in Linear and Adaptive Tests

Deborah L. Schnipke Bert F. Green 《Journal of Educational Measurement》1995,32(3):227-242

Two item selection algorithms were compared in simulated linear and adaptive tests of cognitive ability. One algorithm selected items that maximally differentiated between examinees. The other used item response theory (IRT) to select items having maximum information for each examinee. Normally distributed populations of 1,000 cases were simulated, using test lengths of 4, 5, 6, and 7 items. Overall, adaptive tests based on maximum information provided the most information over the widest range of ability values and, in general, differentiated among examinees slightly better than the other tests. Although the maximum differentiation technique may be adequate in some circumstances, adaptive tests based on maximum information are clearly superior. 相似文献

15.

A Bifactor Analysis for a Mode-of-Administration Effect

Mark Pomplun 《教育实用测度》2013,26(2):137-152

This study investigated the usefulness of the bifactor model in the investigation of score equivalence from computerized and paper-and-pencil formats of the same reading tests. Concerns about the equivalence of the paper-and-pencil and computerized formats were warranted because of the use of reading passages, computer unfamiliarity of primary school students, teacher versus computer administration of the tests, and slightly lower scores on the computerized format than on the paper-and-pencil format across all 4 grades. A confirmatory item factor analysis implemented through the bifactor model in TESTFACT indicated that the best-fitting model had a general factor as well as skill-group factors. This model was more consistent with the data than a model with 2 method factors, paper-and-pencil and computer administration. In addition, the general and skill factor loadings for most of the items were reasonable. Although several instances of negative loadings were found for items on the skill factors, these did not appear to have any practical importance. As a result, the bifactor model proved useful for studying paper-and-pencil and computerized score equivalence because of the reasonable results and delineation of loadings for the method and skill factors at the item level as well as for the general factor. 相似文献

16.

Answer Changing on Objective Tests

《The Journal of educational research》2012,105(6):313-315

Abstract

In an attempt to identify some of the causes of answer changing behavior, the effects of four tests and item specific variables were evaluated. Three samples of New Zealand school children of different ages were administered tests of study skills. The number of answer changes per item was compared with the position of each item in a group of items, the position of each item in the test, the discrimination index and the difficulty index of each item. It is shown that answer changes were more likely to be made on items occurring early in a group of items and toward the end of a test. There was also a tendency for difficult items and items with poor discriminations to be changed more frequently. Some implications of answer changing in the design of tests are discussed. 相似文献

17.

Scaling Performance Assessments: A Comparison of One-Parameter and Two-Parameter Partial Credit Models

Anne R. Fitzpatrick Valerie B. Link Wendy M. Yen George R. Burket Kyoko Ito Robert C. Sykes 《Journal of Educational Measurement》1996,33(3):291-314

In one study, parameters were estimated for constructed-response (CR) items in 8 tests from 4 operational testing programs using the l-parameter and 2- parameter partial credit (IPPC and 2PPC) models. Where multiple-choice (MC) items were present, these models were combined with the 1-parameter and 3-parameter logistic (IPL and 3PL) models, respectively. We found that item fit was better when the 2PPC model was used alone or with the 3PL model. Also, the slopes of the CR and MC items were found to differ substantially. In a second study, item parameter estimates produced using the IPL-IPPC and 3PL-2PPC model combinations were evaluated for fit to simulated data generated using true parameters known to fit one model combination or ttle other. The results suggested that the more flexible 3PL-2PPC model combination would produce better item fit than the IPL-1PPC combination. 相似文献

18.

Information Functions of Rank-2PL Models for Forced-Choice Questionnaires

Jianbin Fu Xuan Tan Patrick C. Kyllonen 《Journal of Educational Measurement》2024,61(1):125-149

This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference) and, for triplets, is the Triplet-2PLM. Fisher's information and directional information are described, and the test information for Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) trait score estimates is distinguished. Expected item/test information indexes at various levels are proposed and plotted to provide diagnostic information on items and tests. The expected test information indexes for EAP scores may be difficult to compute due to a typical test's vast number of item response patterns. The relationships of item/test information with discrimination parameters of statements, standard error, and reliability estimates of trait score estimates are discussed and demonstrated using real data. Practical suggestions for checking the various expected item/test information indexes and plots are provided. 相似文献

19.

Identifying Sources of Differential Item and Bundle Functioning on Translated Achievement Tests: A Confirmatory Analysis

Mark J. Gierl Shameem Nyla Khaliq 《Journal of Educational Measurement》2001,38(2):164-187

Increasingly, tests are being translated and adapted into different languages. Differential item functioning (DIF) analyses are often used to identify non-equivalent items across language groups. However, few studies have focused on understanding why some translated items produce DIF. The purpose of the current study is to identify sources of differential item and bundle functioning on translated achievement tests using substantive and statistical analyses. A substantive analysis of existing DIF items was conducted by an 11-member committee of testing specialists. In their review, four sources of translation DIF were identified. Two certified translators used these four sources to categorize a new set of DIF items from Grade 6 and 9 Mathematics and Social Studies Achievement Tests. Each item was associated with a specific source of translation DIF and each item was anticipated to favor a specific group of examinees. Then, a statistical analysis was conducted on the items in each category using SIBTEST. The translators sorted the mathematics DIF items into three sources, and they correctly predicted the group that would be favored for seven of the eight items or bundles of items across two grade levels. The translators sorted the social studies DIF items into four sources, and they correctly predicted the group that would be favored for eight of the 13 items or bundles of items across two grade levels. The majority of items in mathematics and social studies were associated with differences in the words, expressions, or sentence structure of items that are not inherent to the language and/or culture. By combining substantive and statistical DIF analyses, researchers can study the sources of DIF and create a body of confirmed DIF hypotheses that may be used to develop guidelines and test construction principles for reducing DIF on translated tests. 相似文献

20.

Equating Minimum-Competency Tests: Comparisons of Methods

John R. Hills Raja G. Subhiyah Thomas M. Hirsch 《Journal of Educational Measurement》1988,25(3):221-231

The 1986 scores from Florida's Statewide Student Assessment Test, Part II (SSAT-II), a minimum-competency test required for high school graduation in Florida, were placed on the scale of the 1984 scores from that test using five different equating procedures. For the highest scoring 84 % of the students, four of the five methods yielded results within 1.5 raw-score points of each other. They would be essentially equally satisfactory in this situation, in which the tests were made parallel item by item in difficulty and content and the groups of examinees were population cohorts separated by only 2 years. Also, the results from six different lengths of anchor items were compared. Anchors of 25, 20, 15, or 10 randomly selected items provided equatings as effective as 30 items using the concurrent IRT equating method, but an anchor of 5 randomly selected items did not 相似文献