首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 953 毫秒
1.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

2.
ABSTRACT

Touch screen tablets are being increasingly used in schools for learning and assessment. However, the validity and reliability of assessments delivered via tablets are largely unknown. The present study tested the psychometric properties of a tablet-based app designed to measure early literacy skills. Tablet-based tests were also compared with traditional paper-based tests. Children aged 2–6 years (N?=?99) completed receptive tests delivered via a tablet for letter, word, and numeral skills. The same skills were tested with a traditional paper-based test that used an expressive response format. Children (n?=?35) were post-tested 8 weeks later to examine the stability of test scores over time. The tablet test scores showed high internal consistency (all α’s?>?.94), acceptable test-retest reliability (ICC range?=?.39–.89), and were correlated with child age, family SES, and home literacy teaching to indicate good predictive validity. The agreement between scores for the tablet and traditional tests was high (ICC range?=?.81–.94). The tablet tests provides valid and reliable measures of children’s early literacy skills. The strong psychometric properties and ease of use suggests that tablet-based tests of literacy skills have the potential to improve assessment practices for research purposes and classroom use.  相似文献   

3.
This paper illustrates that the psychometric properties of scores and scales that are used with mixed‐format educational tests can impact the use and interpretation of the scores that are reported to examinees. Psychometric properties that include reliability and conditional standard errors of measurement are considered in this paper. The focus is on mixed‐format tests in situations for which raw scores are integer‐weighted sums of item scores. Four associated real‐data examples include (a) effects of weights associated with each item type on reliability, (b) comparison of psychometric properties of different scale scores, (c) evaluation of the equity property of equating, and (d) comparison of the use of unidimensional and multidimensional procedures for evaluating psychometric properties. Throughout the paper, and especially in the conclusion section, the examples are related to issues associated with test interpretation and test use.  相似文献   

4.
Studies investigating invariance have often been limited to measurement or prediction invariance. Selection invariance, wherein the use of test scores for classification results in equivalent classification accuracy between groups, has received comparatively little attention in the psychometric literature. Previous research suggests that some form of selection bias (lack of selection invariance) will exist in most testing contexts, where classification decisions are made, even when meeting the conditions of measurement invariance. We define this conflict between measurement and selection invariance as the invariance paradox. Previous research has found test reliability to be an important factor in minimizing selection bias. This study demonstrates that the location of maximum test information may be a more important factor than overall test reliability in minimizing decision errors between groups.  相似文献   

5.
Scientific communication in the field of educational technology was examined by analyzing references from and citations to articles published in Educational Technology Research and Development (ETR&D) for the period 1990–2004 with particular emphasis on other journals found in the citation record. Data were collected on the 369 core articles found in the 60 issues published during that time period, their reference lists (containing over 14,805 individual items), and citations of those articles in other journals (1,896 entries). The top cited and citing journals during that time period are listed. Nine symbiotic journals (i.e. those that are most cited by ETR&D and frequently cite it) were identified: Contemporary Educational Psychology, Educational Psychologist, Instructional Science, Journal of Computer-Based Instruction (no longer published), Journal of Educational Computing Research, Journal of Educational Psychology, Journal of Educational Research, Journal of Research in Science Teaching, and the Review of Educational Research. The results provide an in-depth, quantitative view of informal connections within the field via the citation record. Implications for further research and the potential influence of new technologies on scientific communication are also discussed.  相似文献   

6.
This article introduces procedures for the computation and asymptotic statistical inference for classification consistency and accuracy indices specifically designed for cognitive diagnostic assessments. The new classification indices can be used as important indicators of the reliability and validity of classification results produced by cognitive diagnostic assessments. For tests with known or previously calibrated item parameters, the sampling distributions of the two new indices are shown to be asymptotically normal. To illustrate the computations of the new indices, we apply them to the real diagnostic data from a fraction subtraction test (Tatsuoka). We also use simulated data to evaluate their performances and distributional properties.  相似文献   

7.
Reliability of a criterion-referenced test is often viewed as the consistency with which individuals who have taken two strictly parallel forms of a test are classified as being masters or nonmasters. However, in practice, it is rarely possible to retest students, especially with equivalent forms. For this reason, methods for making conservative approximations of alternate form (or test-retest “without the effects of testing”) reliability have been developed. Because these methods are computationally tedious and require some psychometric sophistication, they have rarely been used by teachers and school psychologists. This paper (a) describes one method (Subkoviak's) for estimating alternate-form reliability from one administration of a criterion-referenced test and (b) describes a computer program developed by the authors that will handle tests containing hundreds of items for large numbers of examinees and allow any test user to apply the technique described. The program is a superior alternative to other methods of simplifying this estimation procedure that rely upon tables; a user can check classification consistency estimates for several prospective cut scores directly from a data file, without having to make prior calculations.  相似文献   

8.
Research Findings: Few rating scales measure social competence in very young Spanish or Catalan children. We aimed to analyze the psychometric characteristics of the California Preschool Social Competence Scale (CPSCS) when applied to a Spanish- and Catalan-speaking population. Children were rated by their respective teachers within 6 months following their 4th birthday in two population-based birth cohorts in Spain (N = 378). A confirmatory factor analysis (CFA) was used to compare the underlying structure of the Spanish–Catalan version with that of the original version. Cronbach's alpha coefficient was used to determine the internal consistency of each of the confirmed factors. Cohen's kappa formula was used to calculate the test–retest reliability in a small subset of children who were rated again one month later. Five correlated factors (Considerateness, Task Orientation, Extraversion, Verbal Facility, and Response to Unfamiliar) were optimally confirmed as a result of CFA. The first three factors had robust internal consistency. The kappa coefficient was satisfactory in 29 items out of 30. Children's cognitive abilities as assessed by the McCarthy Scales, children's gender, maternal social class and level of education were related to the social competence scores as indicators of criterion-related factors. Practice or Policy: The bilingual version of the CPSCS has good psychometric properties allowing it to be used in further studies in either Spanish or Catalan populations.  相似文献   

9.
This study evaluated the classification accuracy of a second grade oral reading fluency curriculum‐based measure (R‐CBM) in predicting third grade state test performance. It also compared the long‐term classification accuracy of local and publisher‐recommended R‐CBM cut scores. Participants were 266 students who were divided into a calibration sample (n = 170) and two cross‐validation samples (n = 46; n = 50), respectively. Using calibration sample data, local fall, winter, and spring R‐CBM cut scores for predicting students’ state test performance were developed using three methods: discriminant analysis (DA), logistic regression (LR), and receiver operating characteristic curve analysis (ROC). The classification accuracy of local and publisher‐recommended cut scores was evaluated across subsamples. Only DA and ROC produced cut scores that maintained adequate sensitivity (≥.70) across cohorts; however, LR and publisher‐recommended scores had higher levels of specificity and overall correct classification. Implications for developing local cut scores are discussed.  相似文献   

10.
With a focus on performance assessments, this paper describes procedures for calculating conditional standard error of measurement (CSEM) and reliability of scale scores and classification consistency of performance levels. Scale scores that are transformations of total raw scores are the focus of these procedures, although other types of raw scores are considered as well. Polytomous IRT models provide the psychometric foundation for the procedures that are described. The procedures are applied using test data from ACT's Work Keys Writing Assessment to demonstrate their usefulness. Two polytomous IRT models were compared, as were two different procedures for calculating scores. One simulation study was done using one of the models to evaluate the accuracy of the proposed procedures. The results suggest that the procedures provide quite stable estimates and have the potential to be useful in a variety of performance assessment situations.  相似文献   

11.
A common suggestion made in the psychometric literature for fixed‐length classification tests is that one should design tests so that they have maximum information at the cut score. Designing tests in this way is believed to maximize the classification accuracy and consistency of the assessment. This article uses simulated examples to illustrate that one can obtain higher classification accuracy and consistency by designing tests that have maximum test information at locations other than at the cut score. We show that the location where one should maximize the test information is dependent on the length of the test, the mean of the ability distribution in comparison to the cut score, and, to a lesser degree, whether or not one wants to optimize classification accuracy or consistency. Analyses also suggested that the differences in classification performance between designing tests optimally versus maximizing information at the cut score tended to be greatest when tests were short and the mean of ability distribution was further away from the cut score. Larger differences were also found in the simulated examples that used the 3PL model compared to the examples that used the Rasch model.  相似文献   

12.
13.
The aim of this study is to link the science scale of the German National Educational Panel Study (NEPS) with the science scale of the Programme for International Student Assessment (PISA). One requirement for a strong linking of test scores from different studies is a sufficient similarity of the tests regarding their constructs. The present study aims to assess the similarity of the operationalized constructs of the NEPS and PISA scientific literacy tests with the aim to link the scales of the two tests. A linking study was carried out for this purpose in which 1079 students worked on the tasks of both studies. The results of the comparison between NEPS and PISA indicated a high overlap regarding their constructs. However, both studies deal with missing responses differently. The linking via equipercentile equating showed a high classification consistency which was highest when missing responses were ignored in both studies.  相似文献   

14.
Abstract

The present study compared the performance of six cognitive diagnostic models (CDMs) to explore inter skill relationship in a reading comprehension test. To this end, item responses of about 21,642 test-takers to a high-stakes reading comprehension test were analyzed. The models were compared in terms of model fit at both test and item levels, classification consistency and accuracy, and proportion of skill mastery profiles. The results showed that the G-DINA performed the best and the C-RUM, NC-RUM, and ACDM showed the closest affinity to the G-DINA. In terms of some criteria, the DINA showed comparable performance to the G-DINA. The test-level results were corroborated by the item-level model comparison, where DINA, DINO, and ACDM variously fit some of the items. The results of the study suggested that relationships among the subskills of reading comprehension might be a combination of compensatory and non-compensatory. Therefore, it is suggested that the choice of the CDM be carried out at item level rather than test level.  相似文献   

15.
This study investigates the comparability of two item response theory based equating methods: true score equating (TSE), and estimated true equating (ETE). Additionally, six scaling methods were implemented within each equating method: mean-sigma, mean-mean, two versions of fixed common item parameter, Stocking and Lord, and Haebara. Empirical test data were examined to investigate the consistency of scores resulting from the two equating methods, as well as the consistency of the scaling methods both within equating methods and across equating methods. Results indicate that although the degree of correlation among the equated scores was quite high, regardless of equating method/scaling method combination, non-trivial differences in equated scores existed in several cases. These differences would likely accumulate across examinees making group-level differences greater. Systematic differences in the classification of examinees into performance categories were observed across the various conditions: ETE tended to place lower ability examinees into higher performance categories than TSE, while the opposite was observed for high ability examinees. Because the study was based on one set of operational data, the generalizability of the findings is limited and further study is warranted.  相似文献   

16.
This article presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate effective test length in terms of discrete items. The true-score distribution is estimated by fitting a 4-parameter beta model. The conditional distribution of scores on an alternate form, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Agreement between classifications on alternate forms is estimated by assuming conditional independence, given the true score. Evaluation of the method showed estimates to be within 1 percentage point of the actual values in most cases. Estimates of decision accuracy and decision consistency statistics were only slightly affected by changes in specified minimum and maximum possible scores.  相似文献   

17.
In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k2is recommended for estimating all measures of error, Sc is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed.  相似文献   

18.
A scale was developed to assess primary school Teachers’ Self-Efficacy on Education for Sustainable Development (TSESESD). It includes four domains of competences: values and ethics, systems thinking, emotions and feelings, and actions. The scale development is consistent with key principles of educational and social psychology research. Nine hundred twenty-four (924) primary education student teachers and 88 in-service primary teachers participated in the study. Findings demonstrated that TSESESD has good psychometric properties, strong validity and reliability scores, adequate internal consistency (Cronbach α?=?0.97), and satisfactory mean inter-correlation of items within domains (M?=?0.78). TSESESD is considered a reliable instrument for teacher preparation programs aiming to develop primary school teachers’ self-efficacy in ESD.  相似文献   

19.

Strategic Uncertainties: Ethics, Politics and Risk in Contemporary Educational Research. Phyllida Coombes, Mike Danaher, Patrick Alan Danaher (Editors) Post Pressed, Flaxton Qld, 2004, pp.210 ISBN: 1876682728 (paperback) AUD$29.50

Making Hope Practical. School Reform for Social Justice. Peter McInerney. Post Pressed, Qld, 2004, 244pp. ISBN 187668271X AUD$45.00  相似文献   

20.
Selected Marker Tests, of Educational Testing Service and Sheridan Psychological Services, Inc., were examined in terms of problems in scoring and internal consistency. The tests were administered orally to 116 sixth and seventh grade students. Problems in scoring were discovered and changes were suggested. The agreement between two independent judges on part scores was high. Twenty-one of 28 correlations were .90 and above. Item correlations with part and total scores, using Cureton’s correction, were frequently very low, and many items did not meet the desirable difficulty level for norm-referenced tests. The study suggests that, with the sample used, the problem is not one of agreement among judges but, rather, one of item reliability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号