首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Reporting confidence intervals with test scores helps test users make important decisions about examinees by providing information about the precision of test scores. Although a variety of estimation procedures based on the binomial error model are available for computing intervals for test scores, these procedures assume that items are randomly drawn from a undifferentiated universe of items, and therefore might not be suitable for tests developed according to a table of specifications. To address this issue, four interval estimation procedures that use category subscores for the computation of confidence intervals are presented in this article. All four estimation procedures assume that subscores instead of test scores follow a binomial distribution (i.e., compound binomial error model). The relative performance of the four compound binomial–based interval estimation procedures is compared to each other and to the better known normal approximation and Wilson score procedures based on the binomial error model.  相似文献   

2.
A Monte Carlo approach was used to examine bias in the estimation of indirect effects and their associated standard errors. In the simulation design, (a) sample size, (b) the level of nonnormality characterizing the data, (c) the population values of the model parameters, and (d) the type of estimator were systematically varied. Estimates of model parameters were generally unaffected by either nonnormality or small sample size. Under severely nonnormal conditions, normal theory maximum likelihood estimates of the standard error of the mediated effect exhibited less bias (approximately 10% to 20% too small) compared to the standard errors of the structural regression coefficients (20% to 45% too small). Asymptotically distribution free standard errors of both the mediated effect and the structural parameters were substantially affected by sample size, but not nonnormality. Robust standard errors consistently yielded the most accurate estimates of sampling variability.  相似文献   

3.
In the logistic regression (LR) procedure for differential item functioning (DIF), the parameters of LR have often been estimated using maximum likelihood (ML) estimation. However, ML estimation suffers from the finite-sample bias. Furthermore, ML estimation for LR can be substantially biased in the presence of rare event data. The bias of ML estimation due to small samples and rare event data can degrade the performance of the LR procedure, especially when testing the DIF of difficult items in small samples. Penalized ML (PML) estimation was originally developed to reduce the finite-sample bias of conventional ML estimation and also was known to reduce the bias in the estimation of LR for the rare events data. The goal of this study is to compare the performances of the LR procedures based on the ML and PML estimation in terms of the statistical power and Type I error. In a simulation study, Swaminathan and Rogers's Wald test based on PML estimation (PSR) showed the highest statistical power in most of the simulation conditions, and LRT based on conventional PML estimation (PLRT) showed the most robust and stable Type I error. The discussion about the trade-off between bias and variance is presented in the discussion section.  相似文献   

4.
This article examines whether Bayesian estimation with minimally informed prior distributions can alleviate the estimation problems often encountered with fitting the true score multitrait–multimethod structural equation model with split-ballot data. In particular, the true score multitrait–multimethod structural equation model encounters an empirical underidentification when (a) latent variable correlations are homogenous, and (b) fitted to data from a 2-group split-ballot design; an understudied case of empirical underidentification due to a planned missingness (i.e., split-ballot) design. A Monte Carlo simulation and 3 empirical examples showed that Bayesian estimation performs better than maximum likelihood (ML) estimation. Therefore, we suggest using Bayesian estimation with minimally informative prior distributions when estimating the true score multitrait–multimethod structural equation model with split-ballot data. Furthermore, given the increase in planned missingness designs in psychological research, we also suggest using Bayesian estimation as a potential alternative to ML estimation for analyses using data from planned missingness designs.  相似文献   

5.
讨论可交换单纯矩阵族A的联合特征值估计问题.为了克服基于同时Schur分解和酉变换算法的收敛和性能分析缺陷,提出了一种基于同时相似对角化的联合特征结构估计算法.该算法通过对A交替进行同时Schur分解和范数平衡来实现矩阵族的对角化.该算法的有效性在于:每个子过程在优化自身代价函数的同时,还对另一子过程的收敛起到加速作用.在适当的假设条件下,可以证明该算法交替优化的2个代价函数(矩阵族范数和矩阵族下三角元素范数)的收敛性.基于多维谐波提取的数值仿真显示该算法在矩阵族偏离正规阵时收敛速度显著快于基于同时Schur分解和酉变换算法,并且联合特征值的估计性能可以进行简洁的闭式分析.  相似文献   

6.
This study investigates the development of an adaptive strategy for the estimation of numerosity from the theoretical perspective of “strategic change” (Lemaire & Siegler, 1995; Siegler & Shipley, 1995). A simple estimation task was used in which participants of three different age groups (20 university students, 20 sixth-graders and 10 second-graders) had to estimate 100 numerosities of (colored) blocks presented in a 10x10 rectangular grid. Generally speaking, this task allows for two distinct estimation procedures: either repeatedly adding estimations of groups of blocks (=addition procedure) or subtracting the estimated number of empty squares from the (estimated) total number of squares in the grid (=subtraction procedure). A rational task analysis indicates that the most efficient overall estimation strategy consists of the adaptive use of both procedures, depending on the ratio of the blocks to the empty squares. The first hypothesis was that there will be a developmental difference in the adaptive use of the two procedures, and according to the second hypothesis this adaptive use will result in better estimation accuracy. Converging evidence from different kinds of data (i.e., response times, error rates, and retrospective reports) supported both hypotheses. From a methodological point of view, the study shows the potential of Beem’s (1995a, 1995b) “segmentation analysis” for unravelling subjects’ adaptive choices between different procedures in cognitive tasks, and for examining the relationship between these adaptive choices and performance.  相似文献   

7.
This research focuses on the problem of model selection between the latent change score (LCS) model and the autoregressive cross-lagged (ARCL) model when the goal is to infer the longitudinal relationship between variables. We conducted a large-scale simulation study to (a) investigate the conditions under which these models return statistically (and substantively) different results concerning the presence of bivariate longitudinal relationships, and (b) ascertain the relative performance of an array of model selection procedures when such different results arise. The simulation results show that the primary sources of differences in parameter estimates across models are model parameters related to the slope factor scores in the LCS model (specifically, the correlation between the intercept factor and the slope factor scores) as well as the size of the data (specifically, the number of time points and sample size). Among several model selection procedures, correct selection rates were higher when using model fit indexes (i.e., comparative fit index, root mean square error of approximation) than when using a likelihood ratio test or any of several information criteria (i.e., Akaike’s information criterion, Bayesian information criterion, consistent AIC, and sample-size-adjusted BIC).  相似文献   

8.
In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed.  相似文献   

9.
In structural equation modeling software, either limited-information (bivariate proportions) or full-information item parameter estimation routines could be used for the 2-parameter item response theory (IRT) model. Limited-information methods assume the continuous variable underlying an item response is normally distributed. For skewed and platykurtic latent variable distributions, 3 methods were compared in Mplus: limited information, full information integrating over a normal distribution, and full information integrating over the known underlying distribution. Interfactor correlation estimates were similar for all 3 estimation methods. For the platykurtic distribution, estimation method made little difference for the item parameter estimates. When the latent variable was negatively skewed, for the most discriminating easy or difficult items, limited-information estimates of both parameters were considerably biased. Full-information estimates obtained by marginalizing over a normal distribution were somewhat biased. Full-information estimates obtained by integrating over the true latent distribution were essentially unbiased. For the a parameters, standard errors were larger for the limited-information estimates when the bias was positive but smaller when the bias was negative. For the d parameters, standard errors were larger for the limited-information estimates of the easiest, most discriminating items. Otherwise, they were generally similar for the limited- and full-information estimates. Sample size did not substantially impact the differences between the estimation methods; limited information did not gain an advantage for smaller samples.  相似文献   

10.
Simulation and real data studies are used to investigate the value of modeling multiple-choice distractors on item response theory linking. Using the characteristic curve linking procedure for Bock's (1972) nominal response model presented by Kim and Hanson (2002) , all-category linking (i.e., a linking based on all category characteristic curves of the linking items) is compared against correct-only (CO) linking (i.e., linking based on the correct category characteristic curves only) using a common-item nonequivalent groups design. The CO linking is shown to represent an approximation to what occurs when using a traditional correct/incorrect item response model for linking. Results suggest that the number of linking items needed to achieve an equivalent level of linking precision declines substantially when incorporating the distractor categories.  相似文献   

11.
This study explores classification consistency and accuracy for mixed-format tests using real and simulated data. In particular, the current study compares six methods of estimating classification consistency and accuracy for seven mixed-format tests. The relative performance of the estimation methods is evaluated using simulated data. Study results from real data analysis showed that the procedures exhibited similar patterns across various exams, but some tended to produce lower estimates of classification consistency and accuracy than others. As data became more multidimensional, unidimensional and multidimensional item response theory (IRT) methods tended to produce different results, with the unidimensional approach yielding lower estimates than the multidimensional approach. Results from simulated data analysis demonstrated smaller estimation error for the multidimensional IRT methods than for the unidimensional IRT method. The unidimensional approach yielded larger error as tests became more multidimensional, whereas a reverse relationship was observed for the multidimensional IRT approach. Among the non-IRT approaches, the normal approximation and Livingston-Lewis methods performed well, whereas the compound multinomial method tended to produce relatively larger error.  相似文献   

12.
Increasingly, assessment practitioners use generalizability coefficients to estimate the reliability of scores from performance tasks. Little research, however, examines the relation between the estimation of generalizability coefficients and the number of rubric scale points and score distributions. The purpose of the present research is to inform assessment practitioners of (a) the optimum number of scale points necessary to achieve the best estimates of generalizability coefficients and (b) the possible biases of generalizability coefficients when the distribution of scores is non-normal. Results from this study indicate that the number of scale points substantially affects the generalizability estimates. Generalizability estimates increase as scale points increase, with little bias after scales reach 12 points. Score distributions had little effect on generalizability estimates.  相似文献   

13.
In operational equating situations, frequency estimation equipercentile equating is considered only when the old and new groups have similar abilities. The frequency estimation assumptions are investigated in this study under various situations from both the levels of theoretical interest and practical use. It shows that frequency estimation equating can be used under circumstances when it is not normally used. To link theoretical results with practice, statistical methods are proposed for checking frequency estimation assumptions based on available data: observed‐score distributions and item difficulty distributions of the forms. In addition to the conventional use of frequency estimation equating when the group abilities are similar, three situations are identified when the group abilities are dissimilar: (a) when the two forms and the observed conditional score distributions are similar the two forms and the observed conditional score distributions are similar (in this situation, the frequency estimation equating assumptions are likely to hold, and frequency estimation equating is appropriate); (b) when forms are similar but the observed conditional score distributions are not (in this situation, frequency estimation equating is not appropriate); and (c) when forms are not similar but the observed conditional score distributions are (frequency estimation equating is not appropriate). Statistical analysis procedures for comparing distributions are provided. Data from a large‐scale test are used to illustrate the use of frequency estimation equating when the group difference in ability is large.  相似文献   

14.
The generality of the frustration effect in an aversive stimulus conditioning procedure was examined by training and testing 40 rats in a double cold-waterway escape conditioning apparatus. Experimental and control procedures analogous to appetitive conditioning experiments (e.g., Amsel & Roussel, 1952; Wagner. 1959) indicated that frustrative nonrelief (i.e., reinforcement omission) in the first goal tank yielded significant increments in swimming speed in the second waterway, and that these increments in performance were dependent upon initial training with continuous relief (i.e., reinforcement) in the first goal tank. Extensions of the generality of the frustration effect are discussed.  相似文献   

15.
The conventional noncentrality parameter estimator of covariance structure models, which is currently implemented in widely circulated structural modeling programs (e.g., LISREL, EQS, AMOS, RAMONA), is shown to possess asymptotically potentially large bias, variance, and mean squared error (MSE). A formal expression for its large-sample bias is presented, and its large-sample variance and MSE are quantified. Based on these results, it is suggested that future research needs to develop means of possibly unbiased estimation of the noncentrality parameter, with smaller variance and MSE.  相似文献   

16.
In judgmental standard setting procedures (e.g., the Angoff procedure), expert raters establish minimum pass levels (MPLs) for test items, and these MPLs are then combined to generate a passing score for the test. As suggested by Van der Linden (1982), item response theory (IRT) models may be useful in analyzing the results of judgmental standard setting studies. This paper examines three issues relevant to the use of lRT models in analyzing the results of such studies. First, a statistic for examining the fit of MPLs, based on judges' ratings, to an IRT model is suggested. Second, three methods for setting the passing score on a test based on item MPLs are analyzed; these analyses, based on theoretical models rather than empirical comparisons among the three methods, suggest that the traditional approach (i.e., setting the passing score on the test equal to the sum of the item MPLs) does not provide the best results. Third, a simple procedure, based on generalizability theory, for examining the sources of error in estimates of the passing score is discussed.  相似文献   

17.
Ridge generalized least squares (RGLS) is a recently proposed estimation procedure for structural equation modeling. In the formulation of RGLS, there is a key element, ridge tuning parameter, whose value determines the efficiency of parameter estimates. This article aims to optimize RGLS by developing formulas for the ridge tuning parameter to yield the most efficient parameter estimates in practice. For the formulas to have a wide scope of applicability, they are calibrated using empirical efficiency and via many conditions on population distribution, sample size, number of variables, and model structure. Results show that RGLS with the tuning parameter determined by the formulas can substantially improve the efficiency of parameter estimates over commonly used procedures with real data being typically nonnormally distributed.  相似文献   

18.
When the assumption of multivariate normality is violated or when a discrepancy function other than (normal theory) maximum likelihood is used in structural equation models, the null distribution of the test statistic may not be χ2 distributed. Most existing methods to approximate this distribution only match up to 2 moments. In this article, we propose 2 additional approximation methods: a scaled F distribution that matches 3 moments simultaneously and a direct Monte Carlo–based weighted sum of i.i.d. χ2 variates. We also conduct comprehensive simulation studies to compare the new and existing methods for both maximum likelihood and nonmaximum likelihood discrepancy functions and to separately evaluate the effect of sampling uncertainty in the estimated weights of the weighted sum on the performance of the approximation methods.  相似文献   

19.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

20.
We measure the impact of observed teacher characteristics on student math and reading proficiency using a rich dataset from Florida. We expand upon prior work by accounting directly for nonrandom attrition of teachers from the classroom in a sample selection framework. We find evidence that sample selection is present in the estimation of the influence of teacher characteristics, but that failure to account for it does not appear to substantially bias estimation. Further, our procedure produces some evidence that more effective teachers are more likely to exit the classroom.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号