期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Comparisons of Methodologies and Results in Vertical Scaling for Educational Achievement Tests

Ye Tong Michael J. Kolen 《教育实用测度》2013,26(2):227-253

相似文献

2.

A Comparison of Developmental Scales Based on Thurstone Methods and Item Response Theory

Valerie S. L. Williams Mary Pommerich David Thissen 《Journal of Educational Measurement》1998,35(2):93-107

A developmental scale for the North Carolina End-of-Grade Mathematics Tests was created using a subset of identical test forms administered to adjacent grade levels. Thurstone scaling and item response theory (IRT) techniques were employed to analyze the changes in grade distributions across these linked forms.Three variations of Thurstone scaling were examined, one based on Thurstone's 1925 procedure and two based on Thurstone's 1938 procedure. The IRT scaling was implemented using both B i M ain and M ultilog .All methods indicated that average mathematics performance improved from Grade 3 to Grade 8, with similar results for the two IRT analyses and one version of Thurstone's 1938 method.The standard deviations of the IRT scales did not show a consistent pattern across grades, whereas those produced by Thurstone's 1925 procedure generally decreased; one version of the 1938 method exhibited slightly increasing variation with increasing grade level, while the other version displayed inconsistent trends. 相似文献

3.

Rejoinder to Clemans

《Educational Assessment》2013,18(2):203-206

This rejoinder responds to the major statements and claims made in Clemans (this issue). The arbitrary and unrealistic assumptions made by the Thurstone procedure are described. We point out the logical inconsistency of Clemans's claim that the relationship between raw scores, and abilities holds when transforming abilities into raw scores but not when transforming raw scores into abilities. Two effects that Clemans claims are caused by item response theory (IRT) scaling are examined, and we demonstrate that they occur more often with Thurstone scaling than with IRT scaling. We reiterate our belief in the superiority of IRT scaling over Thurstone scaling. 相似文献

4.

THE CHOICE OF SCALE FOR EDUCATIONAL MEASUREMENT: AN IRT PERSPECTIVE

WENDY M. YEN 《Journal of Educational Measurement》1986,23(4):299-325

Two methods of constructing equal-interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait. 相似文献

5.

Item Response Theory,Vertical Scaling,and Something's Awry in the State of Test Mark

《Educational Assessment》2013,18(4):329-347

It is generally accepted that variability in performance will increase throughout Grades 1 to 12. Those with minimal knowledge of a domain should vary but little, but, as learning rates differ, variability should increase as a function of growth. In this article, the series of reading tests from a widely used test battery for Grades 1 through 12 was singled out for study as the scale scores for the series have the opposite characteristic-that is, variability is greatest at Grade 1 and decreases as growth proceeds. Item response theory (IRT) scaling was used; in previous editions, the publisher had used Thurstonian scaling and the variance increased with growth. Using data with known characteristics (i.e., weight distributions for ages 6 through 17), a comparison was made between the effectiveness of IRT and Thurstonian scaling procedures. The Thurstonian scaling more accurately reproduced the characteristics of the known distributions. As IRT scaling was shown to improve when perfect scores were included in the analyses and when items were selected whose difficulties reflected the entire range of ability, these steps were recommended. However, even when these steps were implemented with IRT, the Thurstonian scaling was still found to be more accurate. 相似文献

6.

IRT Ability Estimates from Customized Achievement Tests Without Representative Content Sampling

《教育实用测度》2013,26(1):15-35

This study examines the effects of using item response theory (IRT) ability estimates based on customized tests that were formed by selecting specific content areas from a nationally standardized achievement test. Subsets of items were selected from four different subtests of the Iowa Tests of Basic Skills (Hieronymus, Hoover, & Lindquist, 1985) on the basis of (a) selected content areas (content-customized tests) and (b) a representative sampling of content areas (representative-customized tests). For three of the four tests examined, ability estimates and estimated national percentile ranks based on the content-customized tests in school samples tended to be systematically higher than those based on the full tests. The results of the study suggested that for certain populations, IRT ability estimates and corresponding normative scores on content-customized versions of standardized achievement tests cannot be expected to be equivalent to scores based on the full-length tests. 相似文献

7.

Comparison of Item Response Theory and Thurstone Methods of Vertical Scaling

Wendy M. Yen George R. Burket 《Journal of Educational Measurement》1997,34(4):293-313

Vertical achievement scales, which range from the lower elementary grades to high school, are used pervasively in educational assessment. Using simulated data modeled after real tests, the present article examines two procedures available for vertical scaling: a Thurstone method and three-parameter item response theory. Neither procedure produced artifactual scale shrinkage; both procedures produced modest scale expansion for one simulated condition. 相似文献

8.

Grade Equivalent and IRT Representations of Growth

E. Matthew Schulz W. Alan Nicewander 《Journal of Educational Measurement》1997,34(4):315-331

It has long been a part of psychometric lore that the variance of children's scores on cognitive tests increases with age. This increasing-variance phenomenon was first observed on Binet's intelligence measures in the early 1900s. An important detail in this matter is the fact that developmental scales based on age or grade have served as the medium for demonstrating the increasing-variance phenomenon. Recently, developmental scales based on item response theory (IRT) have shown constant or decreasing variance of measures of achievement with increasing age. This discrepancy is o f practical and theoretical importance. Conclusions about the effects of variables on growth in achievement will depend on the metric chosen. In this study, growth in the mean of a latent educational achievement variable is assumed to be a negatively accelerated function o f grade; within-grade variance is assumed to be constant across grade, and observed test scores are assumed to follow an IRT model. Under these assumptions, the variance of grade equivalent scores increases markedly. Perspective on this phenomenon is gained by examining longitudinal trends in centimeter and age equivalent measures of height. 相似文献

9.

Examining agreement and longitudinal stability among traditional and RTI-based definitions of reading disability using the affected-status agreement statistic

Brown Waesche JS Schatschneider C Maner JK Ahmed Y Wagner RK 《Journal of learning disabilities》2011,44(3):296-307

Rates of agreement among alternative definitions of reading disability and their 1- and 2-year stabilities were examined using a new measure of agreement, the affected-status agreement statistic. Participants were 288,114 first through third grade students. Reading measures were Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency and Nonsense Word Fluency, and six levels of severity of poor reading were examined (25th, 20th, 15th, 10th, 5th, and 3rd percentile ranks). Four definitions were compared, including traditional unexpected low achievement and three response-to-intervention-based definitions: low achievement, low growth, and dual discrepancy. Rates of agreement were variable but only poor to moderate overall, with poorest agreement between unexpected low achievement and the other definitions. Longitudinal stability was poor, with poorest stability for the low growth definition. Implications for research and practice are discussed. 相似文献

10.

Overview of the Scaling Methodology Used in the National Assessment

Albert E. Beaton Eugene G. Johnson 《Journal of Educational Measurement》1992,29(2):163-175

The National Assessment of Educational Progress (NAEP) uses item response theory (IRT)–based scaling methods to summarize the information in complex data sets. Scale scores are presented as tools for illuminating patterns in the data and for exploiting regularities across patterns of responses to tasks requiring similar skills. In this way, the dominant features of the data are captured. Discussed are the necessity of global scores or more detailed subscores, the creation of developmental scales spanning different age levels, and the use of scale anchoring as a way of interpreting the scales. 相似文献

11.

A Comparison of Teacher Effectiveness Measures Calculated Using Three Multilevel Models for Raters Effects

Daniel L. Murphy S. Natasha Beretvas 《教育实用测度》2015,28(3):219-236

This study examines the use of cross-classified random effects models (CCrem) and cross-classified multiple membership random effects models (CCMMrem) to model rater bias and estimate teacher effectiveness. Effect estimates are compared using CTT versus item response theory (IRT) scaling methods and three models (i.e., conventional multilevel model, CCrem, CCMMrem). Results indicate that ignoring rater bias can lead to teachers being misclassified within an evaluation system. The best estimates of teacher effectiveness are produced using CCrems regardless of scaling method. Use of CCMMrems to model rater bias cannot be recommended based on the results of this study; combining the use of CCMMrems with an IRT scaling method produced especially unstable results. 相似文献

12.

A Comparative Study of the Effects of Recency of Instruction on the Stability of IRT and Conventional Item Parameter Estimates

Linda L. Cook Daniel R. Eignor Hessy L. Taft 《Journal of Educational Measurement》1988,25(1):31-45

A potential concern for individuals interested in using item response theory (IRT) with achievement test data is that such tests have been specifically designed to measure content areas related to course curriculum and students taking the tests at different points in their coursework may not constitute samples from the same population. In this study, data were obtained from three administrations of two forms of a Biology achievement test. Data from the newer of the two forms were collected at a spring administration, made up of high school sophomores just completing the Biology course, and at a fall administration, made up mostly of seniors who completed their instruction in the course from 6–18 months prior to the test administration. Data from the older form, already on scale, were collected at only a fall administration, where the sample was comparable to the newer form fall sample. IRT and conventional item difficulty parameter estimates for the common items across the two forms were compared for each of the two form/sample combinations. In addition, conventional and IRT score equatings were performed between the new and old forms for each o f the form sample combinations. Widely disparate results were obtained between the equatings based on the two form/sample combinations. Conclusions are drawn about the use o f both classical test theory and IRT in situations such as that studied, and implications o f the results for achievement test validity are also discussed 相似文献

13.

IRT Model Misspecification and Measurement of Growth in Vertical Scaling

Daniel M. Bolt Sien Deng Sora Lee 《Journal of Educational Measurement》2014,51(2):141-162

Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades. 相似文献

14.

Promoting self-regulated learning to improve achievement: A case study in higher education

《Africa Education Review》2013,10(2):356-375

Abstract

The aim of this article is to report on a study conducted to assess the effect of an intervention programme to improve SRL and the achievement of a group of poorly performing undergraduate students at the Tshwane University of Technology. SRL was used as theoretical framework. The case study reports on 20 Engineering students who attended learning skills intervention sessions and wrote a college version of the learning and study strategies inventory (LASSI) pre-test and post-test. The intervention consisted of 12 workshop sessions presented over a period of three months. The LASSI pre-test showed that the group scored below the 50th percentile on four scales (anxiety, attitude, selecting main ideas and test-taking strategies). Observed improvements in the post-test scores of the LASSI scales for seven out of ten scales were statistically significant. The students’ academic achievements also improved. The findings are important for improving student success and throughput in South African higher education. 相似文献

15.

Stability of Rasch Scales Over Time

Catherine S. Taylor Yoonsun Lee 《教育实用测度》2013,26(1):87-113

Item response theory (IRT) methods are generally used to create score scales for large-scale tests. Research has shown that IRT scales are stable across groups and over time. Most studies have focused on items that are dichotomously scored. Now Rasch and other IRT models are used to create scales for tests that include polytomously scored items. When tests are equated across forms, researchers check for the stability of common items before including them in equating procedures. Stability is usually examined in relation to polytomous items' central “location” on the scale without taking into account the stability of the different item scores (step difficulties). We examined the stability of score scales over a 3–5-year period, considering both stability of location values and stability of step difficulties for common item equating. We also investigated possible changes in the scale measured by the tests and systematic scale drift that might not be evident in year-to-year equating. Results across grades and content areas suggest that equating results are comparable whether or not the stability of step difficulties is taken into account. Results also suggest that there may be systematic scale drift that is not visible using year-to-year common item equating. 相似文献

16.

Psychometric Properties of Three New National Survey of Student Engagement Based Engagement Scales: An Item Response Theory Analysis 总被引：1，自引：0，他引：1

Adam C. Carle David Jaffee Neil W. Vaughan Douglas Eder 《Research in higher education》2009,50(8):775-794

相似文献

17.

Chapter 5 Gender differences in mathematics problem solving and science: A longitudinal analysis

《International Journal of Educational Research》1994,21(4):407-416

This study describes gender differences in achievement in the academic areas of mathematics problem solving and science. Standardized achievement test scores for a sample of 3002 students who were tested in each of ten consecutive years, grades 3 through 12, were used to assess the differences in mathematics while matched data for grades 9 through 12 were available in science. Selected percentiles (90th, 75th, 50th, 25th, 10th) were estimated for both Female and male score distributions for each area. The results indicate fairly consistent patterns of differences. Males generally performed better at the upper percentile levels of the score distributions in mathematics problem solving and science, while females closed the gap and, in some instances, outperformed males at the lower percentile levels. 相似文献

18.

Sind Modelle der Item-Response-Theorie (IRT) das „Mittel der Wahl“ für die Modellierung von Kompetenzen?

Johannes Hartig Andreas Frey 《Zeitschrift für Erziehungswissenschaft》2013,16(1):47-51

Item response theory (IRT) models can be subsumed under the larger class of statistical models with latent variables. IRT models are increasingly used for the scaling of the responses derived from standardized assessments of competencies. The paper summarizes the strengths of IRT in contrast to more traditional techniques as well as in contrast to alternative models with latent variables (e. g. structural equation modeling). Subsequently, specific limitations of IRT and cases where other methods might be preferable are lined out. 相似文献

19.

Response to Clemans

《Educational Assessment》2013,18(2):181-190

Clemans (1993) argued that the use of item response theory (IRT) to vertically scale Form E of the California Achievement Tests produces inappropriate results. In this response we show that (a) Cleman's analysis of school district data was incomplete, inconsistent, and did not follow good measurement practice; (b) the simulation he conducted was unfairly stacked against IRT, was unrealistic, and ignored other realistic published simulations that demonstrated the accuracy of IRT scaling procedures; (c) his "common sense" evaluations of student performance ignored basic facts about the measurement of student achievement; and (d) the concerns expressed in his article were irrelevant to the vast majority of uses of norm-referenced tests. 相似文献

20.

Reply to Yen,Burket, and Fitzpatrick

《Educational Assessment》2013,18(2):191-202

In their response to my article, "Item Response Theory, Vertical Scaling, and Something's Awry in the State of Test Mark," Yen, Burket, and Fitzpatrick (this issue) question the validity of my field observations. I present evidence that validates those observations. They claim that my simulation was unrealistic. I present evidence (convincing, I believe) that they are simply misinformed. They argue that Thurstone scaling has several weaknesses. I present information that should enable them to understand the procedure better and that reveals that the supposed weaknesses do not, in fact, exist. They say they are very "up front" about not being able to measure students at the extremes accurately but claim the vast majority of students are assessed well, thus implying that my use of data for students at the 2nd and 98th percentiles led to conclusions that would not be found if other segments of the score distribution were examined. I duplicated the analyses at the 15th and 85th percentile points and demonstrated that they were wrong. Yen et al. seem to be convinced that the variance of performance decreases (they use the term "homogenization") as learning progresses. Using their published data for 7 on-grade tests administered at the beginning and end of each school year, when the same on-grade test form was used-thus eliminating any confounding introduced by scaling—I show that in 67 of 77 instances the variance increased. This should serve as convincing evidence to the most doubtful person that the variance of performance increases as learning progresses. Given that there is a serious problem, as clearly illustrated in Figure 2, I suggest some avenues that research could take to address it. 相似文献