首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
A developmental scale for the North Carolina End-of-Grade Mathematics Tests was created using a subset of identical test forms administered to adjacent grade levels. Thurstone scaling and item response theory (IRT) techniques were employed to analyze the changes in grade distributions across these linked forms.Three variations of Thurstone scaling were examined, one based on Thurstone's 1925 procedure and two based on Thurstone's 1938 procedure. The IRT scaling was implemented using both B i M ain and M ultilog .All methods indicated that average mathematics performance improved from Grade 3 to Grade 8, with similar results for the two IRT analyses and one version of Thurstone's 1938 method.The standard deviations of the IRT scales did not show a consistent pattern across grades, whereas those produced by Thurstone's 1925 procedure generally decreased; one version of the 1938 method exhibited slightly increasing variation with increasing grade level, while the other version displayed inconsistent trends.  相似文献   

3.
《Educational Assessment》2013,18(2):203-206
This rejoinder responds to the major statements and claims made in Clemans (this issue). The arbitrary and unrealistic assumptions made by the Thurstone procedure are described. We point out the logical inconsistency of Clemans's claim that the relationship between raw scores, and abilities holds when transforming abilities into raw scores but not when transforming raw scores into abilities. Two effects that Clemans claims are caused by item response theory (IRT) scaling are examined, and we demonstrate that they occur more often with Thurstone scaling than with IRT scaling. We reiterate our belief in the superiority of IRT scaling over Thurstone scaling.  相似文献   

4.
Two methods of constructing equal-interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait.  相似文献   

5.
《Educational Assessment》2013,18(4):329-347
It is generally accepted that variability in performance will increase throughout Grades 1 to 12. Those with minimal knowledge of a domain should vary but little, but, as learning rates differ, variability should increase as a function of growth. In this article, the series of reading tests from a widely used test battery for Grades 1 through 12 was singled out for study as the scale scores for the series have the opposite characteristic-that is, variability is greatest at Grade 1 and decreases as growth proceeds. Item response theory (IRT) scaling was used; in previous editions, the publisher had used Thurstonian scaling and the variance increased with growth. Using data with known characteristics (i.e., weight distributions for ages 6 through 17), a comparison was made between the effectiveness of IRT and Thurstonian scaling procedures. The Thurstonian scaling more accurately reproduced the characteristics of the known distributions. As IRT scaling was shown to improve when perfect scores were included in the analyses and when items were selected whose difficulties reflected the entire range of ability, these steps were recommended. However, even when these steps were implemented with IRT, the Thurstonian scaling was still found to be more accurate.  相似文献   

6.
《教育实用测度》2013,26(1):15-35
This study examines the effects of using item response theory (IRT) ability estimates based on customized tests that were formed by selecting specific content areas from a nationally standardized achievement test. Subsets of items were selected from four different subtests of the Iowa Tests of Basic Skills (Hieronymus, Hoover, & Lindquist, 1985) on the basis of (a) selected content areas (content-customized tests) and (b) a representative sampling of content areas (representative-customized tests). For three of the four tests examined, ability estimates and estimated national percentile ranks based on the content-customized tests in school samples tended to be systematically higher than those based on the full tests. The results of the study suggested that for certain populations, IRT ability estimates and corresponding normative scores on content-customized versions of standardized achievement tests cannot be expected to be equivalent to scores based on the full-length tests.  相似文献   

7.
Vertical achievement scales, which range from the lower elementary grades to high school, are used pervasively in educational assessment. Using simulated data modeled after real tests, the present article examines two procedures available for vertical scaling: a Thurstone method and three-parameter item response theory. Neither procedure produced artifactual scale shrinkage; both procedures produced modest scale expansion for one simulated condition.  相似文献   

8.
It has long been a part of psychometric lore that the variance of children's scores on cognitive tests increases with age. This increasing-variance phenomenon was first observed on Binet's intelligence measures in the early 1900s. An important detail in this matter is the fact that developmental scales based on age or grade have served as the medium for demonstrating the increasing-variance phenomenon. Recently, developmental scales based on item response theory (IRT) have shown constant or decreasing variance of measures of achievement with increasing age. This discrepancy is o f practical and theoretical importance. Conclusions about the effects of variables on growth in achievement will depend on the metric chosen. In this study, growth in the mean of a latent educational achievement variable is assumed to be a negatively accelerated function o f grade; within-grade variance is assumed to be constant across grade, and observed test scores are assumed to follow an IRT model. Under these assumptions, the variance of grade equivalent scores increases markedly. Perspective on this phenomenon is gained by examining longitudinal trends in centimeter and age equivalent measures of height.  相似文献   

9.
Rates of agreement among alternative definitions of reading disability and their 1- and 2-year stabilities were examined using a new measure of agreement, the affected-status agreement statistic. Participants were 288,114 first through third grade students. Reading measures were Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency and Nonsense Word Fluency, and six levels of severity of poor reading were examined (25th, 20th, 15th, 10th, 5th, and 3rd percentile ranks). Four definitions were compared, including traditional unexpected low achievement and three response-to-intervention-based definitions: low achievement, low growth, and dual discrepancy. Rates of agreement were variable but only poor to moderate overall, with poorest agreement between unexpected low achievement and the other definitions. Longitudinal stability was poor, with poorest stability for the low growth definition. Implications for research and practice are discussed.  相似文献   

10.
The National Assessment of Educational Progress (NAEP) uses item response theory (IRT)–based scaling methods to summarize the information in complex data sets. Scale scores are presented as tools for illuminating patterns in the data and for exploiting regularities across patterns of responses to tasks requiring similar skills. In this way, the dominant features of the data are captured. Discussed are the necessity of global scores or more detailed subscores, the creation of developmental scales spanning different age levels, and the use of scale anchoring as a way of interpreting the scales.  相似文献   

11.
This study examines the use of cross-classified random effects models (CCrem) and cross-classified multiple membership random effects models (CCMMrem) to model rater bias and estimate teacher effectiveness. Effect estimates are compared using CTT versus item response theory (IRT) scaling methods and three models (i.e., conventional multilevel model, CCrem, CCMMrem). Results indicate that ignoring rater bias can lead to teachers being misclassified within an evaluation system. The best estimates of teacher effectiveness are produced using CCrems regardless of scaling method. Use of CCMMrems to model rater bias cannot be recommended based on the results of this study; combining the use of CCMMrems with an IRT scaling method produced especially unstable results.  相似文献   

12.
A potential concern for individuals interested in using item response theory (IRT) with achievement test data is that such tests have been specifically designed to measure content areas related to course curriculum and students taking the tests at different points in their coursework may not constitute samples from the same population. In this study, data were obtained from three administrations of two forms of a Biology achievement test. Data from the newer of the two forms were collected at a spring administration, made up of high school sophomores just completing the Biology course, and at a fall administration, made up mostly of seniors who completed their instruction in the course from 6–18 months prior to the test administration. Data from the older form, already on scale, were collected at only a fall administration, where the sample was comparable to the newer form fall sample. IRT and conventional item difficulty parameter estimates for the common items across the two forms were compared for each of the two form/sample combinations. In addition, conventional and IRT score equatings were performed between the new and old forms for each o f the form sample combinations. Widely disparate results were obtained between the equatings based on the two form/sample combinations. Conclusions are drawn about the use o f both classical test theory and IRT in situations such as that studied, and implications o f the results for achievement test validity are also discussed  相似文献   

13.
Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades.  相似文献   

14.
《Africa Education Review》2013,10(2):356-375
Abstract

The aim of this article is to report on a study conducted to assess the effect of an intervention programme to improve SRL and the achievement of a group of poorly performing undergraduate students at the Tshwane University of Technology. SRL was used as theoretical framework. The case study reports on 20 Engineering students who attended learning skills intervention sessions and wrote a college version of the learning and study strategies inventory (LASSI) pre-test and post-test. The intervention consisted of 12 workshop sessions presented over a period of three months. The LASSI pre-test showed that the group scored below the 50th percentile on four scales (anxiety, attitude, selecting main ideas and test-taking strategies). Observed improvements in the post-test scores of the LASSI scales for seven out of ten scales were statistically significant. The students’ academic achievements also improved. The findings are important for improving student success and throughput in South African higher education.  相似文献   

15.
Item response theory (IRT) methods are generally used to create score scales for large-scale tests. Research has shown that IRT scales are stable across groups and over time. Most studies have focused on items that are dichotomously scored. Now Rasch and other IRT models are used to create scales for tests that include polytomously scored items. When tests are equated across forms, researchers check for the stability of common items before including them in equating procedures. Stability is usually examined in relation to polytomous items' central “location” on the scale without taking into account the stability of the different item scores (step difficulties). We examined the stability of score scales over a 3–5-year period, considering both stability of location values and stability of step difficulties for common item equating. We also investigated possible changes in the scale measured by the tests and systematic scale drift that might not be evident in year-to-year equating. Results across grades and content areas suggest that equating results are comparable whether or not the stability of step difficulties is taken into account. Results also suggest that there may be systematic scale drift that is not visible using year-to-year common item equating.  相似文献   

16.
17.
This study describes gender differences in achievement in the academic areas of mathematics problem solving and science. Standardized achievement test scores for a sample of 3002 students who were tested in each of ten consecutive years, grades 3 through 12, were used to assess the differences in mathematics while matched data for grades 9 through 12 were available in science. Selected percentiles (90th, 75th, 50th, 25th, 10th) were estimated for both Female and male score distributions for each area. The results indicate fairly consistent patterns of differences. Males generally performed better at the upper percentile levels of the score distributions in mathematics problem solving and science, while females closed the gap and, in some instances, outperformed males at the lower percentile levels.  相似文献   

18.
Item response theory (IRT) models can be subsumed under the larger class of statistical models with latent variables. IRT models are increasingly used for the scaling of the responses derived from standardized assessments of competencies. The paper summarizes the strengths of IRT in contrast to more traditional techniques as well as in contrast to alternative models with latent variables (e. g. structural equation modeling). Subsequently, specific limitations of IRT and cases where other methods might be preferable are lined out.  相似文献   

19.
《Educational Assessment》2013,18(2):181-190
Clemans (1993) argued that the use of item response theory (IRT) to vertically scale Form E of the California Achievement Tests produces inappropriate results. In this response we show that (a) Cleman's analysis of school district data was incomplete, inconsistent, and did not follow good measurement practice; (b) the simulation he conducted was unfairly stacked against IRT, was unrealistic, and ignored other realistic published simulations that demonstrated the accuracy of IRT scaling procedures; (c) his "common sense" evaluations of student performance ignored basic facts about the measurement of student achievement; and (d) the concerns expressed in his article were irrelevant to the vast majority of uses of norm-referenced tests.  相似文献   

20.
《Educational Assessment》2013,18(2):191-202
In their response to my article, "Item Response Theory, Vertical Scaling, and Something's Awry in the State of Test Mark," Yen, Burket, and Fitzpatrick (this issue) question the validity of my field observations. I present evidence that validates those observations. They claim that my simulation was unrealistic. I present evidence (convincing, I believe) that they are simply misinformed. They argue that Thurstone scaling has several weaknesses. I present information that should enable them to understand the procedure better and that reveals that the supposed weaknesses do not, in fact, exist. They say they are very "up front" about not being able to measure students at the extremes accurately but claim the vast majority of students are assessed well, thus implying that my use of data for students at the 2nd and 98th percentiles led to conclusions that would not be found if other segments of the score distribution were examined. I duplicated the analyses at the 15th and 85th percentile points and demonstrated that they were wrong. Yen et al. seem to be convinced that the variance of performance decreases (they use the term "homogenization") as learning progresses. Using their published data for 7 on-grade tests administered at the beginning and end of each school year, when the same on-grade test form was used-thus eliminating any confounding introduced by scaling—I show that in 67 of 77 instances the variance increased. This should serve as convincing evidence to the most doubtful person that the variance of performance increases as learning progresses. Given that there is a serious problem, as clearly illustrated in Figure 2, I suggest some avenues that research could take to address it.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号