期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Comparison of Developmental Scales Based on Thurstone Methods and Item Response Theory

Valerie S. L. Williams Mary Pommerich David Thissen 《Journal of Educational Measurement》1998,35(2):93-107

A developmental scale for the North Carolina End-of-Grade Mathematics Tests was created using a subset of identical test forms administered to adjacent grade levels. Thurstone scaling and item response theory (IRT) techniques were employed to analyze the changes in grade distributions across these linked forms.Three variations of Thurstone scaling were examined, one based on Thurstone's 1925 procedure and two based on Thurstone's 1938 procedure. The IRT scaling was implemented using both B i M ain and M ultilog .All methods indicated that average mathematics performance improved from Grade 3 to Grade 8, with similar results for the two IRT analyses and one version of Thurstone's 1938 method.The standard deviations of the IRT scales did not show a consistent pattern across grades, whereas those produced by Thurstone's 1925 procedure generally decreased; one version of the 1938 method exhibited slightly increasing variation with increasing grade level, while the other version displayed inconsistent trends. 相似文献

2.

IRT Item Parameter Scaling for Developing New Item Pools

Hyeon-Ah Kang Ying Lu Hua-Hua Chang 《教育实用测度》2017,30(1):1-15

Increasing use of item pools in large-scale educational assessments calls for an appropriate scaling procedure to achieve a common metric among field-tested items. The present study examines scaling procedures for developing a new item pool under a spiraled block linking design. The three scaling procedures are considered: (a) concurrent calibration, (b) separate calibration with one linking, and (c) separate calibration with three sequential linking. Evaluation across varying sample sizes and item pool sizes suggests that calibrating an item pool simultaneously results in the most stable scaling. The separate calibration with linking procedures produced larger scaling errors as the number of linking steps increased. The Haebara’s item characteristic curve linking resulted in better performances than the test characteristic curve (TCC) linking method. The present article provides an analytic illustration that the test characteristic curve method may fail to find global solutions in polytomous items. Finally, comparison of the single- and mixed-format item pools suggests that the use of polytomous items as the anchor can improve the overall scaling accuracy of the item pools. 相似文献

3.

A Comparison of IRT Linking Procedures

Won-Chan Lee Jae-Chun Ban 《教育实用测度》2013,26(1):23-48

Various applications of item response theory often require linking to achieve a common scale for item parameter estimates obtained from different groups. This article used a simulation to examine the relative performance of four different item response theory (IRT) linking procedures in a random groups equating design: concurrent calibration with multiple groups, separate calibration with the Stocking-Lord method, separate calibration with the Haebara method, and proficiency transformation. The simulation conditions used in this article included three sampling designs, two levels of sample size, and two levels of the number of items. In general, the separate calibration procedures performed better than the concurrent calibration and proficiency transformation procedures, even though some inconsistent results were observed across different simulation conditions. Some advantages and disadvantages of the linking procedures are discussed. 相似文献

4.

A Longitudinal Study of Sex-Related Item Bias in Mathematics Subtests of the California Achievement Test

《教育实用测度》2013,26(3):275-284

The purpose of this study was to investigate the mathematics components of an achievement battery for differential item performance between boys and girls. By employing a longitudinal study, changes in differential item performance from Grades 4 to 6 were explored, thereby avoiding confounding of age and cohort differences. Two statistical procedures were employed to investigate differential performance at both the test and item level (Spearman's rho and Camilli's chi-square). Items in two mathematics subtests, Mathematics Computation (MC) and Mathematics Concepts and Applications (MCA), did not appear to be a source of sex-related bias. Girls scored higher than boys on MC subtest across grades. Boys scored higher than girls across grades on MCA subtest. However, there were no skill classifications, ability levels, or item locations that favored one sex group consistently across grades. 相似文献

5.

Introduction and training of students to use separate answer sheets: Effects on standardized test scores

Steven L. Wise Barbara S. Plake Leslie A. Eastman Carl D. Novak 《Psychology in the schools》1987,24(3):285-288

Previous research has found conflicting evidence regarding how early children can effectively use separate answer sheets with achievement tests. This study looked at the effects of separate answer sheets on the California Achievement Test (CAT) scores of third, fourth, and fifth graders. The Mathematics Computation and the Reading Comprehension subtests of the CAT were used. Seventy-one classrooms were randomly assigned to have students record their answers on either: (a) their test booklets, (b) separate answer sheets, or (c) separate answer sheets after being given training in the use of separate answer sheets. The results were consistent across both subtests and grades; no response mode treatment effect was found. Further, no evidence of a treatment by ability interaction was found, which was contrary to previously reported research. The results of this study suggest that students can, as early as grade three, effectively use separate answer sheets without prior training. 相似文献

6.

A Comparison of Six Methods for Combining Multiple IRT Item Parameter Estimates

Robert L. McKinley 《Journal of Educational Measurement》1988,25(3):233-246

Six procedures for combining sets of IRT item parameter estimates obtained from different samples were evaluated using real and simulated response data. In the simulated data analyses, true item and person parameters were used to generate response data for three different-sized samples. Each sample was calibrated separately to obtain three sets of item parameter estimates for each item. The six procedures for combining multiple estimates were each applied, and the results were evaluated by comparing the true and estimated item characteristic curves. For the real data, the two best methods from the simulation data analyses were applied to three different-sized samples and the resulting estimated item characteristic curves were compared to the curves obtained when the three samples were combined and calibrated simultaneously. The results support the use of covariance matrix-weighted averaging and a procedure that involves sample-size-weighted averaging of estimated item characteristic curves at the center of the ability distribution 相似文献

7.

Comparisons of Methodologies and Results in Vertical Scaling for Educational Achievement Tests

Ye Tong Michael J. Kolen 《教育实用测度》2013,26(2):227-253

相似文献

8.

Constructing a Universal Scale of High School Course Difficulty

Dina Bassiri E. Matthew Schulz 《Journal of Educational Measurement》2003,40(2):147-161

This study examined the usefulness of applying the Rasch rating scale model (Andrich, 1978) to high school grade data. ACT Assessment test scores (English, Mathematics, Reading, and Science Reasoning) were used as "common items" to adjust for different grading standards in individual high school courses both within and across schools. This scaling approach yielded an ACT Assessment-adjusted high school grade point average (AA-HSGPA) on a common scale across high schools and cohorts within a large public university. AA-HSGPA was a better predictor of first-year college grade point average (CGPA) than the regular high school grade point average. The best model for predicting CGPA included both the ACT composite score and AA-HSGPA. 相似文献

9.

An Empirical Investigation of Thurstone and IRT Methods of Scaling Achievement Tests

Douglas F. Becker Robert A. Forsyth 《Journal of Educational Measurement》1992,29(4):341-354

相似文献

10.

Employing teachers' ratings in selection of achievement tests in reading and mathematics with a behaviorally disturbed population

Irwin F. Altrows Stephen Maunula Brian D. Lalonde 《Psychology in the schools》1986,23(3):316-319

The usefulness of a particular standardized achievement test with a specific population may be determined largely on the basis of experience. Sixty-six behaviourally disturbed students were administered portions of a test battery including the Reading Recognition subtest of the Peabody Individual Achievement Test (PIAT), PIAT Reading Comprehension, the Reading subtest of the Wide Range Achievement Test (WRAT), and Stanford Diagnostic Reading Test (SDRT); PIAT Mathematics, WRAT Arithmetic, Stanford Diagnostic Mathematics Test (SDMT), and KeyMath Diagnostic Arithmetic Test. Toward the end of the academic year, teachers estimated students' grade levels in reading and mathematics. Results indicated that, in mathematics, the SDMT and the PIAT predicted teachers' ratings equally well and better than the other tests; in reading, all tests predicted teachers' ratings equally well except for the PIAT Reading Comprehension, which performed less well than others. Explanations for these results are offered, together with suggestions for identifying achievement tests suitable to specific populations. 相似文献

11.

高校英语专业四级考试阅读理解能力研究

李传益《考试研究》2014,(4):35-40

基于Grabe＆Stoller（2005）的阅读能力层次理论以及《高校英语专业四级考试大纲（2004）》对于基础阶段英语专业学生英语阅读能力的考查要求,从能力分类、能力层次评价以及题目难度和区分度与能力层次关系的角度,研究2010年TEM4阅读理解试题中体现的阅读理解能力。结果表明,教师对题目的能力分类和能力层次评价存在明显分歧,但题目难度和区分度与题目能力层次存在正相关。相似文献

12.

The Performance of the Mantel-Haenszel Procedure Across Samples and Matching Criteria

Katherine E. Ryan 《Journal of Educational Measurement》1991,28(4):325-337

This study examined the reliability of the Mantel-Haenszel indexes across different samples of test takers as well as across sample sizes and investigated whether these indexes are robust to item context effects. Mathematics data from the Second International Mathematics Study (SIMS; 1985) for U.S. eighth-grade students were analyzed. The results suggest that the MH D-DIF is robust to item context effects. However, larger sample sizes than those used in this investigation (N = 141-167 for the focal group) may be necessary to obtain stable estimates from the Mantel-Haenszel procedure. 相似文献

13.

Sex and grade differences and learning rate in an intensive summer reading clinic

Lorynne D. Cahn 《Psychology in the schools》1988,25(1):84-91

In a long-term study of student progress in the Loyola University of the South Summer Reading Clinic, patterns of variance for sex and grade level were examined. Three assessment tools were used: the Nelson Reading Test (Vocabulary and Paragraph Comprehension) for grades four through eight, the Gates-MacGinitie Reading Test (Vocabulary and Comprehension) for grades one through three, and the Spache Diagnostic Reading Scales (Instructional [oral] and Independent [silent] subtests) for all students. Subjects were 684 public and private school students in grades one through eight referred to the Clinic over an eight-year period. All were referred for possible reading disabilities. Because reading disabled males outnumber reading disabled females in the general population, the Clinic's data were examined to elucidate the comparative success rates of boys and girls in an intensive reading clinic setting. Grade differences also were examined to find significant differences in rate of learning among different grades. Females outscored males significantly on all measures. Both a difference in performance among grades and a difference in rate of learning among grades were shown. 相似文献

14.

纵向量表制作浅谈

刘志明《考试研究》2005,(2)

等值(equating)和纵向量表化(vertical scaling)的功用是建立来自不同考试的分数之间的关系。等值是施用于相同年级,相同性质的试卷,而纵向量表化则用于不同年级而性质相似的试卷。纵向量表化是将不同年级的成绩放置于统一的成长分数量表之中。纵向量表(vertical scale)是一种延伸的分数,其度量跨越和串连不同年级之间,用以评估学生连继性的成就成长(Nitko,2004)。在教学中,学生的进度可以利用纵向量表来监察和评估。而在教育研究上,纵向量表可成为长期跟踪调查(longitudinal study)之有力工具。本文讨论纵向量表化的方法论,包括成长定义(definition of growth),数据收集(data collection)方法,试卷设计和使用项目反应理论(Item Response Theory)的方法以及对制作纵向量表提供一些实际的建议。相似文献

15.

Applications of the Analytically Derived Asymptotic Standard Errors of Item Response Theory Item Parameter Estimates

Yuan H. Li Robert W. Lissitz 《Journal of Educational Measurement》2004,41(2):85-117

The analytically derived asymptotic standard errors (SEs) of maximum likelihood (ML) item estimates can be approximated by a mathematical function without examinees' responses to test items, and the empirically determined SEs of marginal maximum likelihood estimation (MMLE)/Bayesian item estimates can be obtained when the same set of items is repeatedly estimated from the simulation (or resampling) test data. The latter method will result in rather stable and accurate SE estimates as the number of replications increases, but requires cumbersome and time-consuming calculations. Instead of using the empirically determined method, the adequacy of using the analytical-based method in predicting the SEs for item parameter estimates was examined by comparing results produced from both approaches. The results indicated that the SEs yielded from both approaches were, in most cases, very similar, especially when they were applied to a generalized partial credit model. This finding encourages test practitioners and researchers to apply the analytically asymptotic SEs of item estimates to the context of item-linking studies, as well as to the method of quantifying the SEs of equating scores for the item response theory (IRT) true-score method. Three-dimensional graphical presentation for the analytical SEs of item estimates as the bivariate function of item difficulty together with item discrimination was also provided for a better understanding of several frequently used IRT models. 相似文献

16.

Comparison of Item Response Theory and Thurstone Methods of Vertical Scaling

Wendy M. Yen George R. Burket 《Journal of Educational Measurement》1997,34(4):293-313

Vertical achievement scales, which range from the lower elementary grades to high school, are used pervasively in educational assessment. Using simulated data modeled after real tests, the present article examines two procedures available for vertical scaling: a Thurstone method and three-parameter item response theory. Neither procedure produced artifactual scale shrinkage; both procedures produced modest scale expansion for one simulated condition. 相似文献

17.

The concurrent validity of the peabody individual achievement test and woodcock reading mastery tests among students with mild learning problems

Ronald C. Eaves Craig Darch Maureen Haynes 《Psychology in the schools》1989,26(3):261-266

Previous research has revealed moderate to high validity coefficients between the Peabody Individual Achievement Test and the Woodcock Reading Mastery Tests. However, the same research has indicated rather consistently that the latter instrument provides significantly lower means than do several screening tests currently being used in the field. This investigation replicated and extended previous research by comparing the two instruments across three grade clusters in order to determine whether the lower Woodcock scores are equally robust for each level. As in prior research, validity coefficients were moderate to high in magnitude. However, the differences between the means of two instruments were found to decrease in size from earlier to later grades. That is, seven of the eight significantly different means were found to occur in grades 1 through 4. Only one significant difference was found for grades 5 through 8. Discussion sought to explain the results in terms of the representativeness of the Woodcock norms and the novel method that Woodcock used to estimate sample means and standard deviations. 相似文献

18.

Measuring value added effects across schools: Should schools be compared in performance?

《Studies in Educational Evaluation》2005,31(2-3):247-266

This article traces the evolution of a quest for solving the problems involved in the analysis of multilevel data and the estimation of the value added effects of schools in influencing educational outcomes. The authors report the findings of two studies that followed several cohorts of students that were tested at two grade levels (Grade 3 and Grade 5) and at three grade levels (Grade 3, 5, and 7) respectively with basic skills tests of literacy and numeracy in a single school system. The tests were calibrated using Rasch scaling and equated using concurrent equating procedures for approximately 8000 students in 440 schools. Hierarchical linear modelling was employed in the analysis with different multilevel models used in the two studies to assess relative and absolute change in performance respectively. The findings of these studies show that with different regression models and different variables taken into consideration there are very different estimates of the variance between schools and under some circumstances the residual variance between schools is very small. Research is clearly needed into the procedures of analysis and the different value added effects that could and should be employed. 相似文献

19.

Comparative investigation of the psychometric properties of three tests of logical thinking

Kapur S. Ahlawat Victor Y. Billeh 《科学教学研究杂志》1987,24(2):93-105

Using a sample of 908 eleventh grade science stream male and female students from similar socioeconomic area schools, variance based psychometric properties of three paper-and-pencil tests of logical thinking (Longeot test, Lawson's test TOFR, and Tobin and Capie's test TOLT) are investigated. A sub-sample of 212 students took the three tests in randomly allocated different sequential orders of presentation, while 696 students took only two tests. Alfa coefficients for each test separately and for the three tests combined together, concurrent validity coefficients, measures of item difficulty, item discrimination, item-criterion correlation, and 30-day stability coefficients are calculated. Considering the relative homogeneity of the sample, the reliability coefficients of the tests are judged satisfactory, but concurrent validity coefficients are quite low which implies incongruency in decisions made on the basis of the three tests. Need for estimating various psychometric parameters of alternative tests of logical thinking over different grade populations is emphasized. 相似文献

20.

The Relationship Between Item Parameters and Item Fit

Hamzeh Dodeen 《Journal of Educational Measurement》2004,41(3):261-270

The effect of item parameters (discrimination, difficulty, and level of guessing) on the item-fit statistic was investigated using simulated dichotomous data. Nine tests were simulated using 1,000 persons, 50 items, three levels of item discrimination, three levels of item difficulty, and three levels of guessing. The item fit was estimated using two fit statistics: the likelihood ratio statistic (X²_B), and the standardized residuals (SRs). All the item parameters were simulated to be normally distributed. Results showed that the levels of item discrimination and guessing affected the item-fit values. As the level of item discrimination or guessing increased, item-fit values increased and more items misfit the model. The level of item difficulty did not affect the item-fit statistic. 相似文献