首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This study illustrates how generalizability theory can be used to evaluate the dependability of school-level scores in situations where test forms have been matrix sampled within schools, and to estimate the minimum number of forms required to achieve acceptable levels of score reliability. Data from a statewide performance assessment in reading, writing, and language usage were analyzed in a series of generalizability studies using a person: (school x form) design that provided variance component estimates for four sources: school, form, school x form, and person: (school x form). Six separate scores were examined. The results of the generalizability studies were then used in decision studies to determine the impact on score reliability when the number of forms administered within schools was varied. Results from the decision studies indicated that score generalizability could be improved when the number of forms administered within schools was increased from one to three forms, but that gains in generalizability were small when the number of forms was increased beyond three. The implications of these results for planning large-scale performance assessments are discussed.  相似文献   

2.
The QUASAR Cognitive Assessment Instrument (QCAI) is designed to measure program outcomes and growth in mathematics. It consists of a relatively large set of open-ended tasks that assess mathematical problem solving, reasoning, and communication at the middle-school grade levels. This study provides some evidence for the generalizability and validity of the assessment. The results from the generalizability studies indicate that the error due to raters is minimal, whereas there is considerable differential student performance across tasks. The dependability of grade level scores for absolute decision making is encouraging; when the number of students is equal to 350, the coefficients are between .80 and .97 depending on the form and grade level. As expected, there tended to be a higher relationship between the QCAI scores and both the problem solving and conceptual subtest scores from a mathematics achievement multiple-choice test than between the QCAI scores and the mathematics computation subtest scores.  相似文献   

3.
《教育实用测度》2013,26(2):191-203
Generalizability theory provides a conceptual and statistical framework for estimating variance components and measurement precision. The theory has been widely used in evaluating technical qualities of performance assessments. However, estimates of variance components, measurement error variances, and generalizability coefficients are likely to vary from one sample to another. This study empirically investigates sampling variability of estimated variance components using data collected in several years for a listening and writing performance assessment. This study also evaluates stability of estimated measurement precision from year to year. The results indicated that the estimated variance components varied from one study to another, especially when sample sizes were small. The estimated measurement error variances and generalizability coefficients also changed from one year to another. Measurement precision projected by a generalizability study may not be fully realized in an actual decision study. The study points out the importance of examining variability of estimated variance components and related statistics in performance assessments.  相似文献   

4.
This study evaluated the reliability and validity of a performance assessment designed to measure students' thinking and reasoning skills in mathematics. The QUASAR Cognitive Assessment Instrument (QCA1) was administered to over 1.700 sixth and seventh grade students of various ethnic backgrounds in six schools that are participating in the QUASAR project. The consistency of students' responses across tasks and the validity for inferences drawn from the scores on the assessment to the more broadly-defined construct domain were examined. The intertask consistency and the dimensionality of the assessment was assessed through the use of polychoric correlations and confirmatory factor analysis, and the generalizability of the derived scores was examined through the use of generalizability theory. The results from the confirmatory factor analysis indicate that a one-factor model fits the data for each of the four QCAI forms. The major findings from the generalizability studies (person x task and person x rater x task) indicate that, for each of the four forms, the person x task variance component accounts for the largest percentage of the total variability and the percentage of variance accounted for by the variance components that include the rater effect is negligible. The variance components that-include the rater effect were negligible. The generalizability and dependability coefficients for the person x task decision studies (nt, = 9) range from .71-.84. These results indicate that the use of nine tasks may not be adequate for generalizing to the larger domain of mathematics for individual student level scores. The QUASAR project, however, is interested in assessing mathematics achievement at the program level not the student level; therefore, these coefficients are not alarmingly low.  相似文献   

5.
The purpose of this study was to investigate the effects of items, passages, contents, themes, and types of passages on the reliability and standard errors of measurement for complex reading comprehension tests. Seven different generalizability theory models were used in the analyses. Results indicated that generalizability coefficients estimated using multivariate models incorporating content strata and types of passages were similar in size to reliability estimates based upon a model that did not include these factors. In contrast, incorporating passages and themes within univariate generalizability theory models produced non-negligible differences in the reliability estimates. This suggested that passages and themes be taken into account when evaluating the reliability of test scores for complex reading comprehension tests.  相似文献   

6.
This article treats various procedures for examining the reliability of group mean difference scores, with particular emphasis on procedures from univariate and multivariate generalizability theory. Attention is given to both traditional norm-referenced perspectives on reliability as well as criterion-referenced perspectives that focus on error-tolerance ratios and functions of them. The procedures discussed are illustrated using three cohorts of data for third- and fourth-grade students in Iowa who took the Iowa Tests of Basic Skills in recent years. For these data, estimates of reliability for norm-referenced decisions tend to be relatively low. By contrast, for criterion-referenced decisions, estimates of reliability-like coefficients based on error-tolerance ratios tend to be noticeably larger.  相似文献   

7.
Implementing the idea that more emphasis should be placed on student achievement in the affective domain is contingent upon the concurrent development of suitable instruments for the assessment of prescribed criteria. One such instrument, the Schwirian Science Support Scale (Tri-S scale), was reported in a recent NSTA publication as a promising tool for measuring student science support. Recent research using the Tri-S scale with high school pupils showed that scores on this instrument did not increase after the students had taken a tenth grade introductory course in biology. Further analysis indicated students of teachers scoring “high” in science support did not produce higher scores on the Tri-S scale than students studying biology from teachers “low” in science support. Reliability estimates using high school student scores were well below previous estimates using scores from college undergraduates. Factor analysis of inter-item correlations indicated that student interpretation of item meaning did not correspond to the five subtest structure of the Tri-S scale. Findings from this study demonstrate that the Tri-S scale is not an appropriate instrument for measuring attitudinal changes of tenth grade high school students. This study is suggestive of the fact that went and future instruments that purport to measure achievement in noncognitive areas should be carefully analyzed before they are recommended for use with specific populations.  相似文献   

8.
The purpose of this study was to investigate the methods of estimating the reliability of school-level scores using generalizability theory and multilevel models. Two approaches, ‘student within schools’ and ‘students within schools and subject areas,’ were conceptualized and implemented in this study. Four methods resulting from the combination of these two approaches with generalizability theory and multilevel models were compared for both balanced and unbalanced data. The generalizability theory and multilevel models for the ‘students within schools’ approach produced the same variance components and reliability estimates for the balanced data, while failing to do so for the unbalanced data. The different results from the two models can be explained by the fact that they administer different procedures in estimating the variance components used, in turn, to estimate reliability. Among the estimation methods investigated in this study, the generalizability theory model with the ‘students nested within schools crossed with subject areas’ design produced the lowest reliability estimates. Fully nested designs such as (students:schools) or (subject areas:students:schools) would not have any significant impact on reliability estimates of school-level scores. Both methods provide very similar reliability estimates of school-level scores.  相似文献   

9.
《教育实用测度》2013,26(4):301-309
The relevance of test content to practice is essential for credentialing examinations and one way to ensure it is to collect ratings of item relevance from job incumbents. This study analyzed ratings of the 132 single-best-answer items and 117 multiple true-false item sets that formed the pretest books in a single administration of a medical certifying examination. Ratings collected from 57 practitioners were high (an average of more than 4 on a 5-point scale) and correlated with item difficulty (r = .31 to .34). The relationship between ratings and item discrimination is less clear (r = -.04 to .31). Application of generalizability theory to the ratings shows that reasonable estimates of item, stem, and total test relevance can be obtained with about 10 raters.  相似文献   

10.
Person reliability parameters (PRPs) model temporary changes in individuals’ attribute level perceptions when responding to self‐report items (higher levels of PRPs represent less fluctuation). PRPs could be useful in measuring careless responding and traitedness. However, it is unclear how well current procedures for estimating PRPs can recover parameter estimates. This study assesses these procedures in terms of mean error (ME), average absolute difference (AAD), and reliability using simulated data with known values. Several prior distributions for PRPs were compared across a number of conditions. Overall, our results revealed little differences between using the χ or lognormal distributions as priors for estimated PRPs. Both distributions produced estimates with reasonable levels of ME; however, the AAD of the estimates was high. AAD did improve slightly as the number of items increased, suggesting that increasing the number of items would ameliorate this problem. Similarly, a larger number of items were necessary to produce reasonable levels of reliability. Based on our results, several conclusions are drawn and implications for future research are discussed.  相似文献   

11.
In educational assessment, overall scores obtained by simply averaging a number of domain scores are sometimes reported. However, simply averaging the domain scores ignores the fact that different domains have different score points, that scores from those domains are related, and that at different score points the relationship between overall score and domain score may be different. To report reliable and valid overall scores and domain scores, I investigated the performance of four methods using both real and simulation data: (a) the unidimensional IRT model; (b) the higher-order IRT model, which simultaneously estimates the overall ability and domain abilities; (c) the multidimensional IRT (MIRT) model, which estimates domain abilities and uses the maximum information method to obtain the overall ability; and (d) the bifactor general model. My findings suggest that the MIRT model not only provides reliable domain scores, but also produces reliable overall scores. The overall score from the MIRT maximum information method has the smallest standard error of measurement. In addition, unlike the other models, there is no linear relationship assumed between overall score and domain scores. Recommendations for sizes of correlations between domains and the number of items needed for reporting purposes are provided.  相似文献   

12.
This paper investigates whether inferences about school performance based on longitudinal models are consistent when different assessments and metrics are used as the basis for analysis. Using norm-referenced (NRT) and standards-based (SBT) assessment results from panel data of a large heterogeneous school district, we examine inferences based on vertically equated scale scores, normal curve equivalents (NCEs), and nonvertically equated scale scores. The results indicate that the effect of the metric depends upon the evaluation objective. NCEs significantly underestimate absolute individual growth, but NCEs and scale scores yield highly correlated (r >.90) school-level results based on mean initial status and growth estimates. SBT and NRT results are highly correlated for status but only moderately correlated for growth. We also find that as few as 30 students per school provide consistent results and that mobility tends to affect inferences based on status but not growth – irrespective of the assessment or metric used.  相似文献   

13.
Large‐scale assessments such as the Programme for International Student Assessment (PISA) have field trials where new survey features are tested for utility in the main survey. Because of resource constraints, there is a trade‐off between how much of the sample can be used to test new survey features and how much can be used for the initial item response theory (IRT) scaling. Utilizing real assessment data of the PISA 2015 Science assessment, this article demonstrates that using fixed item parameter calibration (FIPC) in the field trial yields stable item parameter estimates in the initial IRT scaling for samples as small as n = 250 per country. Moreover, the results indicate that for the recovery of the county‐specific latent trait distributions, the estimates of the trend items (i.e., the information introduced into the calibration) are crucial. Thus, concerning the country‐level sample size of n = 1,950 currently used in the PISA field trial, FIPC is useful for increasing the number of survey features that can be examined during the field trial without the need to increase the total sample size. This enables international large‐scale assessments such as PISA to keep up with state‐of‐the‐art developments regarding assessment frameworks, psychometric models, and delivery platform capabilities.  相似文献   

14.
Contemporary educational accountability systems, including state‐level systems prescribed under No Child Left Behind as well as those envisioned under the “Race to the Top” comprehensive assessment competition, rely on school‐level summaries of student test scores. The precision of these score summaries is almost always evaluated using models that ignore the classroom‐level clustering of students within schools. This paper reports balanced and unbalanced generalizability analyses investigating the consequences of ignoring variation at the level of classrooms within schools when analyzing the reliability of such school‐level accountability measures. Results show that the reliability of school means cannot be determined accurately when classroom‐level effects are ignored. Failure to take between‐classroom variance into account biases generalizability (G) coefficient estimates downward and standard errors (SEs) upward if classroom‐level effects are regarded as fixed, and biases G‐coefficient estimates upward and SEs downward if they are regarded as random. These biases become more severe as the difference between the school‐level intraclass correlation (ICC) and the class‐level ICC increases. School‐accountability systems should be designed so that classroom (or teacher) level variation can be taken into consideration when quantifying the precision of school rankings, and statistical models for school mean score reliability should incorporate this information.  相似文献   

15.
Using generalizability (G-) theory and rater interviews as research methods, this study examined the impact of the current scoring system of the CET-4 (College English Test Band 4, a high-stakes national standardized EFL assessment in China) writing on its score variability and reliability. One hundred and twenty CET-4 essays written by 60 non-English major undergraduate students at one Chinese university were scored holistically by 35 experienced CET-4 raters using the authentic CET-4 scoring rubric. Ten purposively selected raters were further interviewed for their views on how the current scoring system could impact its score variability and reliability. The G-theory results indicated that the current single-task and single-rater holistic scoring system would not be able to yield acceptable generalizability and dependability coefficients. The rater interview results supported the quantitative findings. Important implications for the CET-4 writing assessment policy in China are discussed.  相似文献   

16.
《Educational Assessment》2013,18(4):329-347
It is generally accepted that variability in performance will increase throughout Grades 1 to 12. Those with minimal knowledge of a domain should vary but little, but, as learning rates differ, variability should increase as a function of growth. In this article, the series of reading tests from a widely used test battery for Grades 1 through 12 was singled out for study as the scale scores for the series have the opposite characteristic-that is, variability is greatest at Grade 1 and decreases as growth proceeds. Item response theory (IRT) scaling was used; in previous editions, the publisher had used Thurstonian scaling and the variance increased with growth. Using data with known characteristics (i.e., weight distributions for ages 6 through 17), a comparison was made between the effectiveness of IRT and Thurstonian scaling procedures. The Thurstonian scaling more accurately reproduced the characteristics of the known distributions. As IRT scaling was shown to improve when perfect scores were included in the analyses and when items were selected whose difficulties reflected the entire range of ability, these steps were recommended. However, even when these steps were implemented with IRT, the Thurstonian scaling was still found to be more accurate.  相似文献   

17.
Stanford Binet: Fourth Edition (SB:IV) assessments have been collected longitudinally for 195 individuals with Down syndrome. This article discusses individual assessments which were selected for their ability to highlight major concerns that practitioners need to consider when interpreting intelligence test scores with this population. In this study, Intelligence Quotient (IQ) changed substantially for many individuals, demonstrating changes in classification from a mild level of intellectual impairment on initial assessment to a severe level on later assessment. Subtests used in calculating composite scores were found to have a dramatic effect on IQ. There was up to 9 IQ points difference depending on whether only the “core” subtests or all subtests used by the assessor were included in the calculations. Thirty‐seven percent of the assessments were at “floor level” (i.e., IQ of 36), despite obvious divergent abilities illustrated by age equivalent scores. Mean Age Equivalent (MAE) scores were also problematic as they failed to adequately represent either the range, or divergence, of abilities of the individuals whose data are presented. Directions for future research are discussed.  相似文献   

18.
Accountability for educational quality is a priority at all levels of education. Low-stakes testing is one way to measure the quality of education that students receive and make inferences about what students know and can do. Aggregate test scores from low-stakes testing programs are suspect, however, to the degree that these scores are influenced by low test-taker effort. This study examined the generalizability of a recently developed technique called motivation filtering, whereby scores for students of low motivation are systemically filtered from test data to determine aggregate test scores that more accurately reflect student performance and that can be used for reporting purposes. Across assessment tests in five different content areas, motivation filtering was found to consistently increase mean test performance and convergent validity.  相似文献   

19.
Although multivariate generalizability theory was developed more than 30 years ago, little published research utilizing this framework exists and most of what does exist examines tests built from tables of specifications. In this context, it is assumed that the universe scores from levels of the fixed multivariate facet will be correlated, but the error terms will be uncorrelated because subscores result from mutually exclusive sets of test items. This paper reports on an application in which multiple subscores are derived from each task completed by the examinee. In this context, both universe scores and errors may be correlated across levels of the fixed multivariate facet. The data described come from the United States Medical Licensing Examination® Step 2 Clinical Skills Examination. In this test, each examinee interacts with a series of standardized patients and each interaction results in four component scores. The paper focuses on the application of multivariate generalizability theory in this context and on the practical interpretation of the resulting estimated variance and covariance components.  相似文献   

20.
This paper addresses the contextual interference hypothesis, which was originally formulated by Battig (1966) and later adapted to motor learning by Shea and Morgan (1979). The hypothesis has generated much research, and its application has been readily suggested to practitioners. According to the hypothesis, high contextual interference (random practice) impairs acquisition but enhances retention and transfer, whereas low contextual interference (blocked practice) has the opposite effects. The empirical basis for the hypothesis—from laboratory-oriented and field-based settings—is examined. The generalizability of the hypothesis is also assessed. Recommendations are made for practitioners for optimal use of the contextual interference effect.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号