期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

ALLOCATION OF ITEMS AND EXAMINEES IN ESTIMATING A NORM DISTRIBUTION BY ITEM-SAMPLING

DAVID M. SHOEMAKER 《Journal of Educational Measurement》1970,7(2):123-128

A norm distribution consisting of test scores received by 810 college students on a 150 item dichotomously-scored 4-alternative multiple-choice test was empirically estimated through several item-examinee sampling procedures. The post mortum item-sampling investigation was specifically designed to manipulate systematically the variables of number of subtests, number of items per subtest, and number of examinees responding to each subtest. Defining one observation as the score received by one examinee on one item, the results suggest that as the number of observations increases beyond 1.23% of the data base all procedures produce stochastically equivalent results. The results of this investigation indicate that, in estimating a norm distribution by item-sampling, the variable of importance is not the item-sampling procedure per se but is instead the number of observations obtained by the procedure. It should be noted, however, that in this investigation the test score norm distribution was approximately symmetrical and the possibility should not be overlooked that item-sampling as a procedure may be robust only for symmetrical norm distributions. 相似文献

2.

A NOTE ON ALLOCATING ITEMS TO SUBTESTS IN MULTIPLE MATRIX SAMPLING AND APPROXIMATING STANDARD ERRORS OF ESTIMATE WITH THE JACKKNIFE

DAVID M. SHOEMAKER 《Journal of Educational Measurement》1973,10(3):211-219

Investigated empirically through post mortem item-examinee sampling were the relative merits of two alternative procedures for allocating items to subtests in multiple matrix sampling and the feasibility of using the jackknife in approximating standard errors of estimate. The results indicate clearly that a partially balanced incomplete block design is preferable to random sampling in allocating items to subtests. The jackknife was found to better approximate standard errors of estimate in the latter item allocation procedure than in the former. 相似文献

3.

AN APPLICATION OF ITEM-EXAMINEE SAMPLING TO SCALING ATTITUDES

DAVID M. SHOEMAKER 《Journal of Educational Measurement》1971,8(4):279-282

The post mortem item-examinee sampling investigation described herein explored the feasibility of using item-examinee sampling to estimate scale values denoting degree of affect toward stimuli when measured by the method of paired-comparisons. Results indicate that such scale values can be approximated satisfactorily through item-examinee sampling. Defining one observation as the response made by one examinee to one item, the similarity between the estimated scale values and normative scale values increased generally with increases in the number of observations acquired by the sampling plan. 相似文献

4.

FURTHER RESULTS ON THE STANDARD ERRORS OF ESTIMATE ASSOCIATED WITH ITEM-EXAMINEE SAMPLING PROCEDURES

DAVID M. SHOEMAKER 《Journal of Educational Measurement》1971,8(3):215-220

Defining one observation as the score received by one examinee on one item, the results of this investigation suggest that, for a given test length, item-examinee sampling procedures having the same number of observation have, for all practical purposes, the same standard error in estimating μ but different standard errors in estimating σ. Additionally, the variance of the item difficulty indices (proportion answering the item correctly) was found to be a significant factor in accounting for differences in standard errors of estimating μ between normative distributions differing primarily in degree of skewness. 相似文献

5.

Robust Detection of Examinees With Aberrant Answer Changes

下载免费PDF全文

Dmitry I. Belov 《Journal of Educational Measurement》2015,52(4):437-456

The statistical analysis of answer changes (ACs) has uncovered multiple testing irregularities on large‐scale assessments and is now routinely performed at testing organizations. However, AC data has an uncertainty caused by technological or human factors. Therefore, existing statistics (e.g., number of wrong‐to‐right ACs) used to detect examinees with aberrant ACs capitalize on the uncertainty, which may result in a large Type I error. In this article, the information about ACs is used only for the partitioning of administered items into two disjoint subtests: items where ACs did not occur, and items where ACs did occur. A new statistic is based on the difference in performance between these subtests (measured as Kullback–Leibler divergence between corresponding posteriors of latent traits), where, in order to avoid the uncertainty, only final responses are used. One of the subtests can be filtered such that the asymptotic distribution of the statistic is chi‐square with one degree of freedom. In computer simulations, the presented statistic demonstrated a strong robustness to the uncertainty and higher detection rates in contrast to two popular statistics based on wrong‐to‐right ACs. 相似文献

6.

AN INVESTIGATION OF AN EXTENSION OF ITEM SAMPLING WHICH YIELDS INDIVIDUAL SCORES1

MARY ANNE BUNDA 《Journal of Educational Measurement》1973,10(2):117-130

The sampling procedures were designed so that the full matrix of item variances and covariances could be estimated. Three subtest sizes were investigated- subtests of size five, nine and sixteen items. In each of these implementations a double cross validation was used yielding two predicted scores for each individual. Discrepancy measures were also computed showing the difference between the observed and the predicted scores. The prediction of individual scores was accomplished within various ranges of error. The correlations between predicted scores and observed scores ranged from the .70′s to the .90′s, depending on the number of predictor variables used. The procedure is applicable in situations in which large numbers of individuals are tested or in situations where multiple measures are taken. 相似文献

7.

IRT Ability Estimates from Customized Achievement Tests Without Representative Content Sampling

《教育实用测度》2013,26(1):15-35

This study examines the effects of using item response theory (IRT) ability estimates based on customized tests that were formed by selecting specific content areas from a nationally standardized achievement test. Subsets of items were selected from four different subtests of the Iowa Tests of Basic Skills (Hieronymus, Hoover, & Lindquist, 1985) on the basis of (a) selected content areas (content-customized tests) and (b) a representative sampling of content areas (representative-customized tests). For three of the four tests examined, ability estimates and estimated national percentile ranks based on the content-customized tests in school samples tended to be systematically higher than those based on the full tests. The results of the study suggested that for certain populations, IRT ability estimates and corresponding normative scores on content-customized versions of standardized achievement tests cannot be expected to be equivalent to scores based on the full-length tests. 相似文献

8.

The Effects of Item and Examinee Sampling in the Analysis and Selection of Objective Test Items

G.M. Seddon R.M.H. Hind 《教育心理学》1986,6(1):71-77

Abstract The investigation sets out to determine whether in the construction of normreferenced tests the effects of sampling errors in the pre‐testing procedure outweigh the theoretical advantages to be gained in selecting items with the highest discrimination indices. The procedure for pre‐testing and selecting the items is simulated by sampling artificial matrices each representing the scores of a population of examinees on a domain of items. The results indicate that the sampling errors do not have a significant deleterious effect if the samples in the pre‐testing procedure contain more than 50 items and 25 examinees. Moreover, there may be very little to be gained by using larger samples. 相似文献

9.

A COMPARISON OF EXAMINEE SAMPLING AND MULTIPLE MATRIX SAMPLING IN TEST DEVELOPMENT

RASHMI GARG MARVIN W. BOSS JAMES E. CARLSON 《Journal of Educational Measurement》1986,23(2):119-130

For the purpose of obtaining data to use in test development, multiple matrix sampling (MMS) plans were compared to examinee sampling plans. Data were simulated for examinees, sampled from a population with a normal distribution of ability, responding to items selected from an item universe. Three item universes were considered: one that would produce a normal distribution of test scores, one a moderately platykurtic distribution, and one a very platykurtic distribution. When comparing sampling plans, total numbers of observations were held constant. No differences were found among plans in estimating item difficulty. Examinee sampling produced better estimates of item discrimination, test reliability, and test validity. As total number of observations increased, estimates improved considerably, especially for those MMS plans with larger subtest sizes. Larger numbers of observations were needed for tests designed to produce a normal distribution of test scores. With an adequate number of observations, MMS is seen as an alternative to examinee sampling in test development. 相似文献

10.

Diagnostic utility of the number of Wisc‐III subtests deviating from mean performance among students with learning disabilities

Marley W. Watkins Frank C. Worrell 《Psychology in the schools》2000,37(4):303-309

This paper examined the diagnostic utility of subtest variability, as represented by the number of subtests that deviate from examinees' mean IQ scores, for identifying students with a learning disability (LD). Participants consisted of the 2,200 students in the WISC‐III normative sample and 684 students (Mdngrade = 5; Mage = 10.8) identified as LD. The number of subtests deviating from examinees' Verbal, Performance, and Full Scale IQ by ±3 points for normative and exceptional samples were contrasted via Receiver Operating Curve (ROC) analyses. Results indicated that LD students did not differ from normative sample children at levels above chance. It was concluded that deviation of individual subtest scores from mean IQ scores has no diagnostic utility for hypothesizing about students with learning disabilities. © 2000 John Wiley & Sons, Inc. 相似文献

11.

A Comparison of Item Sampling Plans in the Application of Multiple Matrix Sampling

Risa P. Gressard Brenda H. Loyd 《Journal of Educational Measurement》1991,28(2):119-130

This study used a Monte Carlo approach to investigate the effect of item sampling by item stratification on parameter estimation when applying multiple matrix sampling to achievement data. From the results of this study it was concluded that the item sampling method and sampling plan which is a practical compromise in terms of precision and sample size is one based on item stratification by item discrimination and a sampling plan with a moderate number of subtests. This sampling condition provides reasonable precision of the mean and variance estimates but requires only a moderately sized sample. 相似文献

12.

The Use of Multiple Matrix Sampling for Survey Research

Gail F. Munger Brenda H. Loyd 《Journal of Experimental Education》2013,81(4):187-191

Multiple matrix sampling procedures can be employed to improve survey research when the results of matrix sampling are equivalent to those obtained by the traditional census testing approach. This study examined the use of multiple matrix sampling as a strategy for the collection of data and compared rates of response when subgroups of items were administered as opposed to an entire instrument. In addition, the study investigated whether responses were equivalent in the two sampling procedures and whether bias was present. The results indicate that multiple matrix sampling is a viable and reasonable procedure to use when a mail survey questionnaire consists of a large number of pages and/or items. 相似文献

13.

Detecting Differential Speededness in Multistage Testing

Wim J. van der Linden Krista Breithaupt Siang Chee Chuah Yanwei Zhang 《Journal of Educational Measurement》2007,44(2):117-130

A potential undesirable effect of multistage testing is differential speededness, which happens if some of the test takers run out of time because they receive subtests with items that are more time intensive than others. This article shows how a probabilistic response-time model can be used for estimating differences in time intensities and speed between subtests and test takers and detecting differential speededness. An empirical data set for a multistage test in the computerized CPA Exam was used to demonstrate the procedures. Although the more difficult subtests appeared to have items that were more time intensive than the easier subtests, an analysis of the residual response times did not reveal any significant differential speededness because the time limit appeared to be appropriate. In a separate analysis, within each of the subtests, we found minor but consistent patterns of residual times that are believed to be due to a warm-up effect, that is, use of more time on the initial items than they actually need. 相似文献

14.

Influence of Type of Judge, Normative Information, and Discussion on Standards Recommended for the National Teacher Examinations

John Christian Busch Richard M. Jaeger 《Journal of Educational Measurement》1990,27(2):145-163

There are few empirical investigations of the consequences of using widely recommended data collection procedures in conjunction with a specific standardsetting method such as the Angoff (1971) procedure. Such recommendations include the use of several types of judges, the provision of normative information on examinees' test performance, and the opportunity to discuss and reconsider initial recommendations in an iterative standard-setting procedure. This study of 236 expert judges investigated the effects of using these recommended procedures on (a) average recommended test standards, (b) the variability of recommended test standards, and (c) the reliability of recommended standards for seven subtests of the National Teacher Examinations Communication Skills and General Knowledge Tests. Small, but sometimes statistically significant, changes in mean recommended test standards were observed when judges were allowed to reconsider their initial recommendations following review of normative information and discussion. Means for public school judges changed more than did those for college or university judges. In addition, there was a significant reduction in the within-group variability of standards recommended for several subtests. Methods for estimating the reliability of recommended test standards proposed by Kane and Wilson (1984) were applied, and their hypothesis of positive covariation between empirical item difficulties and mean recommended standards was confirmed. The data collection procedures examined in this study resulted in substantial increases in the reliability of recommended test standards. 相似文献

15.

NCME 2008 Presidential Address: The Impact of Anchor Test Configuration on Student Proficiency Rates

Anne R. Fitzpatrick 《Educational Measurement》2008,27(4):34-40

Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results. 相似文献

16.

Gender Differences in Multiple-Choice Tests: The Role of Differential Guessing Tendencies 总被引：2，自引：0，他引：2

Gershon Ben-Shakhar Yakov Sinai 《Journal of Educational Measurement》1991,28(1):23-35

The present study focused on gender differences in the tendency to omit items and to guess in multiple-choice tests. It was hypothesized that males would show greater guessing tendencies than females and that the use of formula scoring rather than the use of number of correct answers would result in a relative advantage for females. Two samples were examined: ninth graders and applicants to Israeli universities. The teenagers took a battery of five or six aptitude tests used to place them in various high schools, and the adults took a battery of five tests designed to select candidates to the various faculties of the Israeli universities. The results revealed a clear male advantage in most subtests of both batteries. Four measures of item-omission tendencies were computed for each subtest, and a consistent pattern of greater omission rates among females was revealed by all measures in most subtests of the two batteries. This pattern was observed even in the few subtests that did not show male superiority and even when permissive instructions were used. Correcting the raw scores for guessing reduced the male advantage in all cases (and in the few subtests that showed female advantage the difference increased as a result of this correction), but this effect was small. It was concluded that although gender differences in guessing tendencies are robust they account for only a small fraction of the observed gender differences in multiple-choice tests. The results were discussed, focusing on practical implications. 相似文献

17.

On the perennial argument about grading “on the curve” in college courses

Donald W. Zimmerman 《教育心理学家》2013,48(3):175-178

If letter grades assigned in a course are determined by the percent of objective test items answered correctly, then under various assumptions about the mean and standard deviation of normally distributed test scores, the distribution of the percent of students receiving each letter grade is highly anomalous. Conversely, if one insists that letter grades be “reasonably” distributed and at the same time determined by a typical percentage scheme, then allowed test parameters are severely constrained and not the ones which psychometricians consider most defensible. One recourse available to an instructor who does not wish to grade “on the curve” is to adjust the test parameters to dubious values by manipulation of item difficulty so that a desired grade distribution based on percentages is achieved. This informal adjustment procedure constitutes an indirect and inefficient method of employing test norms — in effect, grading “on the curve.” 相似文献

18.

The Influence of Test-Wiseness on Performance of High School Seniors on School Leaving Examinations

《教育实用测度》2013,26(2):159-183

The influence of test-wiseness on the performance of high school seniors on provincial school leaving examinations in English, algebra, geography, history, biology, and chemistry was empirically assessed. The percentages of test-wise susceptible four-option multiple-choice items varied from 43% to 80% across the six examinations. The mean score of provincially representative samples on the subject area subtests comprised of these faulty items exceeded by 9.2% to 18.4% p < .01) the mean scores of the same students on the corresponding subtests consisting of the nonsusceptible items. 相似文献

19.

Teaching Computer Technology to Adult Computer Novices. Normative Didactics Based on Adults’ Perception of Information Technology 总被引：1，自引：1，他引：0

Sven Erik Nordenbo 《Scandinavian Journal of Educational Research》2013,57(4):243-258

相似文献

20.

DETECTING EXPERIMENTALLY INDUCED ITEM BIAS USING THE ITERATIVE LOGIT METHOD

FRANK G. KOK GIDEON J. MELLENBERGH HENK VAN DER FLIER 《Journal of Educational Measurement》1985,22(4):295-303

A test for mental arithmetic was constructed, consisting of items written in Dutch (the subjects' native language), Spanish, and Roman numerals. A group of 286 subjects received some information on Spanish numerals. The group was randomly split into a Spanish Group and a Roman Group. The Spanish Group received further instruction on Spanish numerals, while the Roman Group got instruction on Roman numerals. Checks on the experimental manipulations showed that the Spanish Group had better knowledge of Spanish numerals than the Roman Group, whereas the Roman Group had better knowledge of Roman numerals. From the total test two subtests were constructed: a 30-item Dutch/Spanish subtest (15 items in Dutch and 15 in Spanish), and a 25-item Dutch/Roman subtest (15 items in Dutch and 10 in Roman). The Dutch items were unbiased between the Spanish and Roman groups, whereas the Spanish items of the Dutch/Spanish subtest were biased against the Roman Group, and the Roman items of the Dutch/Roman subtest were biased against the Spanish Group. The iterative logit method was applied to the two subtests. The method showed very good results in detecting biased items. 相似文献