首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Increasingly, assessment practitioners use generalizability coefficients to estimate the reliability of scores from performance tasks. Little research, however, examines the relation between the estimation of generalizability coefficients and the number of rubric scale points and score distributions. The purpose of the present research is to inform assessment practitioners of (a) the optimum number of scale points necessary to achieve the best estimates of generalizability coefficients and (b) the possible biases of generalizability coefficients when the distribution of scores is non-normal. Results from this study indicate that the number of scale points substantially affects the generalizability estimates. Generalizability estimates increase as scale points increase, with little bias after scales reach 12 points. Score distributions had little effect on generalizability estimates.  相似文献   

2.
A common practice in the field of learning disabilities is analysis of ability-achievement discrepancies. The reliability of discrepancy scores is an important statistic in such decision making. In this study, selected ability and achievement devices were administered to a sample of low achievers (N = 99), and the reliability of various difference scores was analyzed. In all cases, the reliabilities of difference scores were moderately high. Reliabilities of differences for devices normed on the same population and differences for devices normed on different populations were comparable. These results are discussed in light of current psychometric practices.  相似文献   

3.
Behavior rating scales aid in the identification of problem behaviors, as well as the development of interventions to reduce such behavior. Although scores on many behavior rating scales have been validated in the United States, there have been few such studies in other cultural contexts. In this study, the structural validity of scores on a Spanish translation of the six‐factor Child Behavior Scale (CBS) was assessed in a sample of 265 Peruvian preschool children who ranged from 2 to 6 years in age. Exploratory factor analysis yielded a four‐factor structure, and reliability estimates for scores on the four factors were adequate. The authors suggest replicating the study and examining the utility of CBS scores in predicting future problem behaviors in this population. © 2011 Wiley Periodicals, Inc.  相似文献   

4.
It is considered standard practice to transform IRT-scaled test scores into standard normal variables for regression analysis in order to enable comparison with other research whose test scores are similarly transformed. This paper calls this practice into question. I show that these transformations can potentially result in radically different estimates of regression parameters due to differences in sample composition. Regression coefficient comparisons between different samples that use z-standardized test scores is only possible if the samples are considered to be random draws from the same population. I outline several different methods to deal with this problem and the caveats attached to each.  相似文献   

5.
When a constructed‐response test form is reused, raw scores from the two administrations of the form may not be comparable. The solution to this problem requires a rescoring, at the current administration, of examinee responses from the previous administration. The scores from this “rescoring” can be used as an anchor for equating. In this equating, the choice of weights for combining the samples to define the target population can be critical. In rescored data, the anchor usually correlates very strongly with the new form but only moderately with the reference form. This difference has a predictable impact: the equating results are most accurate when the target population is the reference form sample, least accurate when the target population is the new form sample, and somewhere in the middle when the new form and reference form samples are equally weighted in forming the target population.  相似文献   

6.
Person reliability parameters (PRPs) model temporary changes in individuals’ attribute level perceptions when responding to self‐report items (higher levels of PRPs represent less fluctuation). PRPs could be useful in measuring careless responding and traitedness. However, it is unclear how well current procedures for estimating PRPs can recover parameter estimates. This study assesses these procedures in terms of mean error (ME), average absolute difference (AAD), and reliability using simulated data with known values. Several prior distributions for PRPs were compared across a number of conditions. Overall, our results revealed little differences between using the χ or lognormal distributions as priors for estimated PRPs. Both distributions produced estimates with reasonable levels of ME; however, the AAD of the estimates was high. AAD did improve slightly as the number of items increased, suggesting that increasing the number of items would ameliorate this problem. Similarly, a larger number of items were necessary to produce reasonable levels of reliability. Based on our results, several conclusions are drawn and implications for future research are discussed.  相似文献   

7.
This article treats various procedures for examining the reliability of group mean difference scores, with particular emphasis on procedures from univariate and multivariate generalizability theory. Attention is given to both traditional norm-referenced perspectives on reliability as well as criterion-referenced perspectives that focus on error-tolerance ratios and functions of them. The procedures discussed are illustrated using three cohorts of data for third- and fourth-grade students in Iowa who took the Iowa Tests of Basic Skills in recent years. For these data, estimates of reliability for norm-referenced decisions tend to be relatively low. By contrast, for criterion-referenced decisions, estimates of reliability-like coefficients based on error-tolerance ratios tend to be noticeably larger.  相似文献   

8.
Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring.  相似文献   

9.
The Pervasive Developmental Disorders Rating Scale (PDDRS; Eaves, 1993) is a screening instrument used in the assessment of autistic disorder. In this study, the reliability of test scores for the PDDRS was examined with three samples. The first sample consisted of 456 participants ranging in age from 1 to 12 years old and the second sample consisted of 111 participants in the 13 to 24 year‐old range. Additionally, the test‐retest reliability of scores for the PDDRS was examined with a sample of 40 participants. The results indicated that coefficient alpha for the PDDRS Total Score was adequate for screening purposes (r = .89) for both age groups. The results of the test‐retest study also suggested that PDDRS had adequate test‐retest reliability (r = .92) for the PDDRS Total Score. © 2002 Wiley Periodicals, Inc. Psychol Schs 39: 605–611, 2002.  相似文献   

10.
Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

11.
This study examined the extent to which log-linear smoothing could improve the accuracy of differential item functioning (DIF) estimates in small samples of examinees. Examinee responses from a certification test were analyzed using White examinees in the reference group and African American examinees in the focal group. Using a simulation approach, separate DIF estimates for seven small-sample-size conditions were obtained using unsmoothed (U) and smoothed (S) score distributions. These small sample U and S DIF estimates were compared to a criterion (i.e., DIF estimates obtained using the unsmoothed total data) to assess their degree of variability (random error) and accuracy (bias). Results indicate that for most studied items smoothing the raw score distributions reduced random error and bias of the DIF estimates, especially in the small-sample-size conditions. Implications of these results for operational testing programs are discussed.  相似文献   

12.
The accuracy of short-cut estimates of standard deviation was investigated for distributions of raw scores on classroom tests and computer generated samples. Estimates proposed by Mason and Odeh, Jenkins, Diederich, Ebel, and Davenport were compared for relative accuracy in three studies. The loss in accuracy due to short-cut methods versus the conventional statistic ranged from 0 to 7,8%. Based on the findings of these studies, it is recommended that a shortcut method for computing standard deviations he included in courses where the conventional formula is taught.  相似文献   

13.
The purpose of this study was to investigate the methods of estimating the reliability of school-level scores using generalizability theory and multilevel models. Two approaches, ‘student within schools’ and ‘students within schools and subject areas,’ were conceptualized and implemented in this study. Four methods resulting from the combination of these two approaches with generalizability theory and multilevel models were compared for both balanced and unbalanced data. The generalizability theory and multilevel models for the ‘students within schools’ approach produced the same variance components and reliability estimates for the balanced data, while failing to do so for the unbalanced data. The different results from the two models can be explained by the fact that they administer different procedures in estimating the variance components used, in turn, to estimate reliability. Among the estimation methods investigated in this study, the generalizability theory model with the ‘students nested within schools crossed with subject areas’ design produced the lowest reliability estimates. Fully nested designs such as (students:schools) or (subject areas:students:schools) would not have any significant impact on reliability estimates of school-level scores. Both methods provide very similar reliability estimates of school-level scores.  相似文献   

14.
The reliability and validity of a revised version of Finucci's (1982) Reading History Questionnaire was examined in two adult samples. One sample included 84 adults from an ongoing study of familial dyslexia, and a second sample was composed of parents of 107 children from a longitudinal study of reading development. Internal consistency was demonstrated by Cronbach's alphas of .94 and .92 in the two samples. Test-retest reliability was demonstrated by significant correlations (.87 and .84 in the two samples) over several years between an earlier and revised form of the questionnaire. Validity was demonstrated via (a) correlations between the questionnaire score and reading measures (rs = .57-.70), (b) the results of a discriminant function analysis that used questionnaire scores to predict reading disability diagnosis, and (c) the finding that the questionnaire had substantial incremental validity in predicting reading skill in a hierarchical regression analysis that first entered IQ and SES. These results indicated that the questionnaire is both reliable and valid.  相似文献   

15.
This article proposes 2 classes of ridge generalized least squares (GLS) procedures for structural equation modeling (SEM) with unknown population distributions. The weight matrix for the first class of ridge GLS is obtained by combining the sample fourth-order moment matrix with the identity matrix. The weight matrix for the second class is obtained by combining the sample fourth-order moment matrix with its diagonal matrix. Empirical results indicate that, with data from an unknown population distribution, parameter estimates by ridge GLS can be much more accurate than those by either GLS or normal-distribution-based maximum likelihood; and standard errors of the parameter estimates also become more accurate in predicting the empirical ones. Rescaled and adjusted statistics are proposed for overall model evaluation, and they also perform much better than the default statistic following from the GLS method. The use of the ridge GLS procedures is illustrated with a real data set.  相似文献   

16.
This study investigated the extent to which log-linear smoothing could improve the accuracy of common-item equating by the chained equipercentile method in small samples of examinees. Examinee response data from a 100-item test were used to create two overlapping forms of 58 items each, with 24 items in common. The criterion equating was a direct equipercentile equating of the two forms in the full population of 93,283 examinees. Anchor equatings were performed in samples of 25, 50, 100, and 200 examinees, with 50 pairs of samples at each size level. Four equatings were performed with each pair of samples: one based on unsmoothed distributions and three based on varying degrees of smoothing. Smoothing reduced, by at least half, the sample size required for a given degree of accuracy. Smoothing that preserved only two moments of the marginal distributions resulted in equatings that failed to capture the curvilinearity in the population equating.  相似文献   

17.
The purpose of this study was to examine the reliability and validity of the School Anxiety Inventory (SAI) using a sample of 646 Slovenian adolescents (48% boys), ranging in age from 12 to 19 years. Single confirmatory factor analyses replicated the correlated four‐factor structure of scores on the SAI for anxiety‐provoking school situations (Anxiety about School Failure and Punishment, Anxiety about Aggression, Anxiety about Social Evaluation, and Anxiety about Academic Evaluation), and the three‐factor structure of the anxiety response systems (Physiological Anxiety, Cognitive Anxiety, and Behavioral Anxiety). Equality of factor structures was compared using multigroup confirmatory factor analyses. Measurement invariance for the four‐ and three‐factor models was obtained across gender and school‐level samples. The scores of the instrument showed high internal reliability and adequate test–retest reliability. The concurrent validity of the SAI scores was also examined through its relationship with the Social Anxiety Scale for Adolescents (SASA) scores and the Questionnaire about Interpersonal Difficulties for Adolescents (QIDA) scores. Correlations of the SAI scores with scores on the SASA and the QIDA were of low to moderate effect sizes.  相似文献   

18.
The test-retest reliability of the Bender was examined for a sample comprised of 84 reading disabled children. The time interval between the first and second administrations of the test ranged from 12 to 24 days. The total working time necessary to reproduce the designs was considered along with total Koppitz score and errors of distortion, rotation, integration, and perseveration. The reliability estimated (r = .83) was quite satisfactory, and the estimate for working time (r = .70) was large enough to indicate a relatively stable behavioral dimension. The estimates for errors of distortion (r = .62) and rotation (r = .56) were somewhat smaller. The estimates for integration (r = .33) and perseveration (r = .29) errors were significant but much smaller than those for other scores.  相似文献   

19.
This paper serves as an illustration of the usefulness of structurally incomplete designs as an approach to reduce the length of educational questionnaires. In structurally incomplete test designs, respondents only fill out a subset of the total item set, while all items are still provided to the whole sample. The scores on the unadministered items are subsequently dealt with by using methods for the estimation of missing data. Two structurally incomplete test designs — one recording two thirds, and the other recording a half of the potentially complete data — were applied to the complete item scores on 8 educational psychology scales. The incomplete item scores were estimated with missing data method Data Augmentation. Complete and estimated test data were compared at the estimates of total scores, reliability, and predictive validity of an external criterion. The reconstructed data yielded estimates that were very close to the values in the complete data. As expected the statistical uncertainty was higher in the design that recorded fewer item scores. It was concluded that the procedure of applying incomplete test designs and subsequently dealing with the missing values is very fruitful for reducing questionnaire length.  相似文献   

20.
Recent methods to improve generalizations from nonrandom samples typically invoke assumptions such as the strong ignorability of sample selection, which is challenging to meet in practice. Although researchers acknowledge the difficulty in meeting this assumption, point estimates are still provided and used without considering alternative assumptions. We compare the point identifying assumption of strong ignorability of sample selection with two alternative assumptions—bounded sample variation and monotone treatment response—that partially identify the parameter of interest, yielding interval estimates. Additionally, we explore the role that population data frames play in contributing identifying power for the interval estimates. We situate the comparison around causal generalization with nonrandom samples by applying the assumptions to a cluster randomized trial in education. Bounds on the population average treatment effect are derived under the alternative assumptions and the case when no assumptions are made on the data. While comparing the bounds, we discuss the plausibility of each alternative assumption and the practical trade-offs. We highlight the importance of thoughtfully considering the role that assumptions play in causal generalization by illustrating the differences in inferences from different assumptions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号