首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM-Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale.  相似文献   

2.
The focus of this paper is assessing the impact of measurement errors on the prediction error of an observed‐score regression. Measures are presented and described for decomposing the linear regression's prediction error variance into parts attributable to the true score variance and the error variances of the dependent variable and the predictor variable(s). These measures are demonstrated for regression situations reflecting a range of true score correlations and reliabilities and using one and two predictors. Simulation results also are presented which show that the measures of prediction error variance and its parts are generally well estimated for the considered ranges of true score correlations and reliabilities and for homoscedastic and heteroscedastic data. The final discussion considers how the decomposition might be useful for addressing additional questions about regression functions’ prediction error variances.  相似文献   

3.
The latent state–trait (LST) theory is an extension of the classical test theory that allows one to decompose a test score into a true trait, a true state residual, and an error component. For practical applications, the variances of these latent variables may be estimated with standard methods of structural equation modeling (SEM). These estimates allow one to decompose the coefficient of reliability into a coefficient of consistency (indicating true effects of the person) plus a coefficient of occasion specificity (indicating true effects of the situation and the person–situation interaction). One disadvantage of this approach is that the standard SEM analysis requires large sample sizes. This article aims to overcome this disadvantage by presenting a simple method that allows one to estimate the LST parameters algebraically from the observed covariance matrix. A Monte Carlo simulation suggests that the proposed method may be superior to the standard SEM analysis in small samples.  相似文献   

4.
An improved method is derived for estimating conditional measurement error variances, that is, error variances specific to individual examinees or specific to each point on the raw score scale of the test. The method involves partitioning the test into short parallel parts, computing for each examinee the unbiased estimate of the variance of part-test scores, and multiplying this variance by a constant dictated by classical test theory. Empirical data are used to corroborate the principal theoretical deductions.  相似文献   

5.
It is well known that measurement error in observable variables induces bias in estimates in standard regression analysis and that structural equation models are a typical solution to this problem. Often, multiple indicator equations are subsumed as part of the structural equation model, allowing for consistent estimation of the relevant regression parameters. In many instances, however, embedding the measurement model into structural equation models is not possible because the model would not be identified. To correct for measurement error one has no other recourse than to provide the exact values of the variances of the measurement error terms of the model, although in practice such variances cannot be ascertained exactly, but only estimated from an independent study. The usual approach so far has been to treat the estimated values of error variances as if they were known exact population values in the subsequent structural equation modeling (SEM) analysis. In this article we show that fixing measurement error variance estimates as if they were true values can make the reported standard errors of the structural parameters of the model smaller than they should be. Inferences about the parameters of interest will be incorrect if the estimated nature of the variances is not taken into account. For general SEM, we derive an explicit expression that provides the terms to be added to the standard errors provided by the standard SEM software that treats the estimated variances as exact population values. Interestingly, we find there is a differential impact of the corrections to be added to the standard errors depending on which parameter of the model is estimated. The theoretical results are illustrated with simulations and also with empirical data on a typical SEM model.  相似文献   

6.
This Monte Carlo simulation study compares methods to estimate the effects of programs with multiple versions when assignment of individuals to program version is not random. These methods use generalized propensity scores, which are predicted probabilities of receiving a particular level of the treatment conditional on covariates, to remove selection bias. The results indicate that inverse probability of treatment weighting (IPTW) removes the most bias, followed by optimal full matching (OFM), and marginal mean weighting through stratification (MMWTS). The study also compared standard error estimation with Taylor series linearization, bootstrapping and the jackknife across propensity score methods. With IPTW, these standard error estimation methods performed adequately, but standard errors estimates were biased in most conditions with OFM and MMWTS.  相似文献   

7.
Instruction cannot be really personalised, as long as assessment remains norm‐referenced. Whereas psychometrics aims at differentiating the performances of individuals at a given moment, edumetrics aims at differentiating stages of learning for a given individual. The structure of the two projects is the same and generalisability theory offers symmetrical formulae for estimating the reliability of each of these measurement designs. An example is presented in this paper which shows that satisfactory reliability can be obtained in an edumetric situation, where the between‐pupils variance is completely ignored. Even though the absolute error variance is the same in both cases, the relative error variances and hence the standard errors of measurement are different. As the true score variances are also different, the edumetric properties of a test should be considered alongside its psychometric ones. Certification of progress by the teacher, supporting a portfolio of achievement, could even have a summative, as well as a formative, function.  相似文献   

8.
It has been suggested that females who score extremely high on the mathematical portions of the Scholastic Aptitude Test (SAT) do so because they have very high verbal skills, whereas some males score extremely high on the mathematical portion despite their relatively low verbal skills. This hypothesis was investigated with data from two SAT administrations by comparing the conditional distributions of SAT-Verbal (SAT-V), given SAT-Mathematical (SAT-M), for males and females. Evidence for and against the hypothesis was observed. As implied by the hypothesis, the males had lower conditional SAT-V mean scores. However, in contradiction to the hypothesis, the males did not have greater conditional variances of SAT-V given SAT-M.  相似文献   

9.
The purpose of this study was to examine the impact of misspecifying a growth mixture model (GMM) by assuming that Level-1 residual variances are constant across classes, when they do, in fact, vary in each subpopulation. Misspecification produced bias in the within-class growth trajectories and variance components, and estimates were substantially less precise than those obtained from a correctly specified GMM. Bias and precision became worse as the ratio of the largest to smallest Level-1 residual variances increased, class proportions became more disparate, and the number of class-specific residual variances in the population increased. Although the Level-1 residuals are typically of little substantive interest, these results suggest that researchers should carefully estimate and report these parameters in published GMM applications.  相似文献   

10.
Student–teacher interactions are dynamic relationships that change and evolve over the course of a school year. Measuring classroom quality through observations that focus on these interactions presents challenges when observations are conducted throughout the school year. Variability in observed scores could reflect true changes in the quality of student–teacher interaction or simply reflect measurement error. Classroom observation protocols should be designed to minimize measurement error while allowing measureable changes in the construct of interest. Treating occasions as fixed multivariate outcomes allows true changes to be separated from random measurement error. These outcomes may also be summarized through trend score composites to reflect different types of growth over the school year. We demonstrate the use of multivariate generalizability theory to estimate reliability for trend score composites, and we compare the results to traditional methods of analysis. Reliability estimates computed for average, linear, quadratic, and cubic trend scores from 118 classrooms participating in the MyTeachingPartner study indicate that universe scores account for between 57% and 88% of observed score variance.  相似文献   

11.
This module describes and extends X‐to‐Y regression measures that have been proposed for use in the assessment of X‐to‐Y scaling and equating results. Measures are developed that are similar to those based on prediction error in regression analyses but that are directly suited to interests in scaling and equating evaluations. The regression and scaling function measures are compared in terms of their uncertainty reductions, error variances, and the contribution of true score and measurement error variances to the total error variances. The measures are also demonstrated as applied to an assessment of scaling results for a math test and a reading test. The results of these analyses illustrate the similarity of the regression and scaling measures for scaling situations when the tests have a correlation of at least .80, and also show the extent to which the measures can be adequate summaries of nonlinear regression and nonlinear scaling functions, and of heteroskedastic errors. After reading this module, readers will have a comprehensive understanding of the purposes, uses, and differences of regression and scaling functions.  相似文献   

12.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

13.
We evaluated the statistical power of single-indicator latent growth curve models to detect individual differences in change (variances of latent slopes) as a function of sample size, number of longitudinal measurement occasions, and growth curve reliability. We recommend the 2 degree-of-freedom generalized test assessing loss of fit when both slope-related random effects, the slope variance and intercept-slope covariance, are fixed to 0. Statistical power to detect individual differences in change is low to moderate unless the residual error variance is low, sample size is large, and there are more than four measurement occasions. The generalized test has greater power than a specific test isolating the hypothesis of zero slope variance, except when the true slope variance is close to 0, and has uniformly superior power to a Wald test based on the estimated slope variance.  相似文献   

14.
选取3位有经验的评估专家,对河北省8个重点学科的自评量表进行评价,使用概化理论中的混合设计模型,对该评价结果所反映的量表结构及方差误差进行分析。结果表明,不同的评估专家对学科能力的评价并没有造成很大的系统误差,而学科、指标、评估专家的交互作用,以及指标和评估专家的交互作用方差很大,说明评估体系的二级指标设置尚存在较大缺陷。  相似文献   

15.
在给定的权回归模型下,讨论了最小二乘估计、最优加权最小二乘估计和线性无偏最小方差估计的性能比较,得出了在随机误差方差矩阵可逆条件下,可算出最优加权最小二乘估计与线性无偏最小方差估计误差方差阵的差表达式,并在一定条件下,两者趋于一致。  相似文献   

16.
This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores.  相似文献   

17.
An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability-theory–based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion-correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance.  相似文献   

18.
《教育实用测度》2013,26(2):191-203
Generalizability theory provides a conceptual and statistical framework for estimating variance components and measurement precision. The theory has been widely used in evaluating technical qualities of performance assessments. However, estimates of variance components, measurement error variances, and generalizability coefficients are likely to vary from one sample to another. This study empirically investigates sampling variability of estimated variance components using data collected in several years for a listening and writing performance assessment. This study also evaluates stability of estimated measurement precision from year to year. The results indicated that the estimated variance components varied from one study to another, especially when sample sizes were small. The estimated measurement error variances and generalizability coefficients also changed from one year to another. Measurement precision projected by a generalizability study may not be fully realized in an actual decision study. The study points out the importance of examining variability of estimated variance components and related statistics in performance assessments.  相似文献   

19.
Mean or median student growth percentiles (MGPs) are a popular measure of educator performance, but they lack rigorous evaluation. This study investigates the error in MGP due to test score measurement error (ME). Using analytic derivations, we find that errors in the commonly used MGP are correlated with average prior latent achievement: Teachers with low prior achieving students have MGPs that underestimate true teacher performance and vice versa for teachers with high achieving students. We evaluate alternative MGP estimators, showing that aggregates of SGP that correct for ME only contain errors independent of prior achievement. The alternatives are thus more fair because they are not biased by prior mean achievement and have smaller overall variance and larger marginal reliability than the Standard MGP approach. In addition, the mean estimators always outperform their median counterparts.  相似文献   

20.
Testing the goodness of fit of item response theory (IRT) models is relevant to validating IRT models, and new procedures have been proposed. These alternatives compare observed and expected response frequencies conditional on observed total scores, and use posterior probabilities for responses across θ levels rather than cross-classifying examinees using point estimates of θ and score responses. This research compared these alternatives with regard to their methods, properties (Type 1 error rates and empirical power), available research, and practical issues (computational demands, treatment of missing data, effects of sample size and sparse data, and available computer programs). Different advantages and disadvantages related to these characteristics are discussed. A simulation study provided additional information about empirical power and Type 1 error rates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号