首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
An IRT method for estimating conditional standard errors of measurement of scale scores is presented, where scale scores are nonlinear transformations of number-correct scores. The standard errors account for measurement error that is introduced due to rounding scale scores to integers. Procedures for estimating the average conditional standard error of measurement for scale scores and reliability of scale scores are also described. An illustration of the use of the methodology is presented, and the results from the IRT method are compared to the results from a previously developed method that is based on strong true-score theory.  相似文献   

2.
Two methods of constructing equal-interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait.  相似文献   

3.
Scale scores for educational tests can be made more interpretable by incorporating score precision information at the time the score scale is established. Methods for incorporating this information are examined that are applicable to testing situations with number-correct scoring. Both linear and nonlinear methods are described. These methods can be used to construct score scales that discourage the overinterpretation of small differences in scores. The application of the nonlinear methods also results in scale scores that have nearly equal error variability along the score scale and that possess the property that adding a specified number of points to and subtracting the same number of points from any examinee's scale score produces an approximate two-sided confidence interval with a specified coverage. These nonlinear methods use an arcsine transformation to stabilize measurement error variance for transformed scores. The methods are compared through the use of illustrative examples. The effect of rounding on measurement error variability is also considered and illustrated using stanines  相似文献   

4.
Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM-Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale.  相似文献   

5.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

6.
7.
Student–teacher interactions are dynamic relationships that change and evolve over the course of a school year. Measuring classroom quality through observations that focus on these interactions presents challenges when observations are conducted throughout the school year. Variability in observed scores could reflect true changes in the quality of student–teacher interaction or simply reflect measurement error. Classroom observation protocols should be designed to minimize measurement error while allowing measureable changes in the construct of interest. Treating occasions as fixed multivariate outcomes allows true changes to be separated from random measurement error. These outcomes may also be summarized through trend score composites to reflect different types of growth over the school year. We demonstrate the use of multivariate generalizability theory to estimate reliability for trend score composites, and we compare the results to traditional methods of analysis. Reliability estimates computed for average, linear, quadratic, and cubic trend scores from 118 classrooms participating in the MyTeachingPartner study indicate that universe scores account for between 57% and 88% of observed score variance.  相似文献   

8.
A decision-theoretic approach to the question of reliability in categorically scored examinations is explored. The concepts of true scores and errors are discussed as they deviate from conventional psychometric definitions and measurement error in categorical scores is cast in terms ofmisclassifications. A reliability measure based on proportional reduction in loss (PRL) is then presented and exemplified with data from a large-scale assessment. The link between the PRL approach and the classical conception of reliability is discussed. Some design considerations for reliability studies are also discussed.  相似文献   

9.
The importance of Rotter's concept of locus of control is discussed and its educational implications outlined. A procedure involving relaxation, suggestion, and imagery (RSI) is described and its use in modifying perception of locus of control postulated. This hypothesis is tested in a study with 36 final-year secondary school students who were paired according to their scores on Rotter's I-E (internal-external) scale. Members of each pair were allocated at random to either an experimental group, which experienced three half-hour RSI sessions distributed over a 3-week period, or a control group which spent the same amount of time in discussing ways of modifying locus of control. Immediate posttreatment administration of the I-E scale indicated a definite shift toward increased internal control by subjects of both groups, with the experimental group's scores being significantly better. A further administration of the I-E scale after 12 months confirmed that this superiority had been maintained.  相似文献   

10.
Book reviews     
Background:?A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests.

Purpose:?This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories.

Main argument and conclusion:?Classification accuracy is defined in terms of a ‘true score’ specified in a psychometric model. Three things must be known or hypothesised in order to derive a value for classification accuracy: (1) a psychometric model relating observed scores to true scores; (2) the location of the cut-scores on the score scale; and (3) the distribution of true scores in the group of test-takers.  相似文献   

11.
Confidence intervals often are recommended as a means of communicating the extent to which individual test scores may be influenced by measurement error. However, test manuals and assessment texts vary widely in their recommendations about how confidence intervals should be constructed, and several contain misinterpretations of classical test theory. The most widely used procedure for constructing confidence intervals misrepresents the likely distribution of true scores, and confidence intervals constructed with it will be inaccurate, especially when extreme scores are involved. The various procedures for constructing confidence intervals that have been suggested in measurement texts are examined in relation to their approximation to the most accurate procedure that uses the estimated true score as the center of the confidence interval and the standard error of estimate to determine the width. In addition, the problems of applying these procedures to norm-referenced scores are discussed—an issue that has been largely ignored in the assessment literature and that leads to further misinterpretations of confidence intervals.  相似文献   

12.
Educational measurement specialists in undertaking test equating in applied settings have been plagued by the absence of a logically or mathematically compelling rationale for their test equating efforts. Classical test theory and other test theories based on the assumption of identically distributed true scores are tautological in terms of test equating. The present study examined (by means of a Monte Carlo procedure) the effects of four parameters on the accuracy of test equating under a relaxed definition of test form equivalence. The four parameters studied were sample size, test form length, test form reliability, and the correlation between the true scores of the test forms to be equated. Significant interactions involving sample size and the other parameters indicated that smaller samples of observations yielded disproportionately larger errors in test equating for fixed values of the test form parameters. In terms of main effects, sample size emerged as most important in controlling equating error. Taken together, the results suggest that when test equating is carried out on larger samples of observations, errors of equating will tend to be relatively small even though the test forms are not strictly parallel. For arbitrarily small samples, however, errors of equating will tend to be larger regardless of how equivalent the test forms are.  相似文献   

13.
The standard error of measurement usefully provides confidence limits for scores in a given test, but is it possible to quantify the reliability of a test with just a single number that allows comparison of tests of different format? Reliability coefficients do not do this, being dependent on the spread of examinee attainment. Better in this regard is a measure produced by dividing the standard error of measurement by the test's ‘reliability length’, the latter defined as the maximum possible score minus the most probable score obtainable by blind guessing alone. This, however, can be unsatisfactory with negative marking (formula scoring), as shown by data on 13 negatively marked true/false tests. In these the examinees displayed considerable misinformation, which correlated negatively with correct knowledge. Negative marking can improve test reliability by penalizing such misinformation as well as by discouraging guessing. Reliability measures can be based on idealized theoretical models instead of on test data. These do not reflect the qualities of the test items, but can be focused on specific test objectives (e.g. in relation to cut‐off scores) and can be expressed as easily communicated statements even before tests are written.  相似文献   

14.
Equatings were performed on both simulated and real data sets using the common-examinee design and two abilities for each examinee (i.e., two dimensions). Item and ability parameter estimates were found by using the Multidimensional Item Response Theory Estimation (MIRTE) program. The amount of equating error was evaluated by a comparison of the mean difference and the mean absolute difference between the true scores and ability estimates found on both tests for the common examinees used in the equating. The results indicated that effective equating, as measured by comparability o f true scores, was possible with the techniques used in this study. When the stability o f the ability estimates was examined, unsatisfactory results were found.  相似文献   

15.
Longitudinal studies offer unique opportunities to identify the specificity variance in the components of a psychometric scale that is administered repeatedly. This article discusses a procedure for evaluation of the relationship between true scale scores and criterion variables uncorrelated with measurement errors in longitudinally presented measures comprising unidimensional multicomponent instruments. The approach provides point and interval estimates of the true scale criterion validity with respect to a criterion that is assessed once or repeatedly, as well as a means for testing temporal stability in this validity. The outlined method is based on an application of the latent variable modeling methodology, is readily applicable with popular software, and is illustrated using empirical data.  相似文献   

16.
With a focus on performance assessments, this paper describes procedures for calculating conditional standard error of measurement (CSEM) and reliability of scale scores and classification consistency of performance levels. Scale scores that are transformations of total raw scores are the focus of these procedures, although other types of raw scores are considered as well. Polytomous IRT models provide the psychometric foundation for the procedures that are described. The procedures are applied using test data from ACT's Work Keys Writing Assessment to demonstrate their usefulness. Two polytomous IRT models were compared, as were two different procedures for calculating scores. One simulation study was done using one of the models to evaluate the accuracy of the proposed procedures. The results suggest that the procedures provide quite stable estimates and have the potential to be useful in a variety of performance assessment situations.  相似文献   

17.
An improved method is derived for estimating conditional measurement error variances, that is, error variances specific to individual examinees or specific to each point on the raw score scale of the test. The method involves partitioning the test into short parallel parts, computing for each examinee the unbiased estimate of the variance of part-test scores, and multiplying this variance by a constant dictated by classical test theory. Empirical data are used to corroborate the principal theoretical deductions.  相似文献   

18.
Reliability coefficients of linear combinations of observed scores have anomalous properties which have led to persistent difficulties in the investigation of difference scores and gain scores in test theory. Interpretation of these test scores is further complicated by effects of correlated errors of measurement which are likely to appear in difference scores and gain scores in practice. In this paper the discrepancies between classical results and correct results obtained from more general formulas, which allow for correlated errors, are examined systematically. These discrepancies depend strongly on the reliability coefficients of the respective tests and are smallest when the influence of the variables related by the formulas is least. A vector representation of difference scores reveals that these anomalies arise from simple geometric relations among observed scores, true scores, and error scores inherent in the test-theory model. In this context, doubts as to the usefulness of difference scores and gain scores in testing practice expressed by previous authors appear to be justified.  相似文献   

19.
Any standardized method for identifying cases of likely child abuse requires specification of a cutting score (or scores) on a predictor variable. In this paper, we describe two criteria for determining cutting scores--utility maximizing (UtilMax) and error minimizing (ErrMin)--and we demonstrate that UtilMax is often the superior, and never the inferior, criterion. Two types of ErrMin cutting scores, true and artificial, are distinguishable based on whether realistic or artificial base rates are used to find the cutting score. Since studies often compute artificial ErrMin cutting scores, these scores must be modified to produce true ErrMin cutting scores. UtilMax cutting scores are explained and a numerical example is presented to show that maximizing utility is the preferable criterion in that it optimizes the balance between the costs of incorrect decisions and the benefits of correct decisions. The example also illustrates how UtilMax cutting scores help one to decide whether attempting to predict abuse would be worthwhile or not.  相似文献   

20.
为了实现常规尺寸和大尺寸机械零件的视觉测量,提出一种新的基于序列局部图像尺寸特征的测量方法.不进行图像拼接,而是提取序列局部图像的尺寸特征,并以图像序列之间的关联关系为依据求解零件尺寸.针对影响测量精度的相面旋转问题,充分利用工件表面纹理信息,提出了基于纹理特征的序列局部图像校准方法,获得序列图像之间的相对旋转角度,解决了测量过程中相机抖动或零件旋转引起的尺寸特征方向变动问题.通过实例说明了所提方法的实现过程及其有效性.实验表明,对大尺寸零件采用序列图像测量法,相对测量误差在0.012%以内,满足板类零件的精密测量要求.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号