首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
The Trends in International Mathematics and Science Study (TIMSS) is a comparative assessment of the achievement of students in many countries. In the present study, a rigorous independent evaluation was conducted of a representative sample of TIMSS science test items because item quality influences the validity of the scores used to inform educational policy in those countries. The items had been administered internationally to 16,009 students in their eighth year of formal schooling. The evaluation had three components. First, the Rasch model, which emphasizes high quality items, was used to evaluate the items psychometrically. Second, readability and vocabulary analyses were used to evaluate the wording of the items to ensure they were comprehensible to the students. And third, item development guidelines were used by a focus group of science teachers to evaluate the items in light of the TIMSS assessment framework, which specified the format, content, and cognitive domains of the items. The evaluation components indicated that the majority of the items were of high quality, thereby contributing to the validity of TIMSS scores. These items had good psychometric characteristics, readability, vocabulary, and compliance with the assessment framework. Overall, the items tended to be difficult: constructed response items assessing reasoning or application were the most difficult, and multiple choice items assessing knowledge or application were less difficult. The teachers revised some of the sampled items to improve their clarity of content, conciseness of wording, and fit with format specifications. For TIMSS, the findings imply that some of the non‐sampled items may need revision, too. For researchers and teachers, the findings imply that the TIMSS science items and the Rasch model are valuable resources for assessing the achievement of students. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 1321–1344, 2012  相似文献   

2.
Mathematical word problems represent a common item format for assessing student competencies. Automatic item generation (AIG) is an effective way of constructing many items with predictable difficulties, based on a set of predefined task parameters. The current study presents a framework for the automatic generation of probability word problems based on templates that allow for the generation of word problems involving different topics from probability theory. It was tested in a pilot study with N = 146 German university students. The items show a good fit to the Rasch model. Item difficulties can be explained by the Linear Logistic Test Model (LLTM) and by the random-effects LLTM. The practical implications of these findings for future test development in the assessment of probability competencies are also discussed.  相似文献   

3.
《教育实用测度》2013,26(3):233-241
Tests of educational achievement typically present items in the multiple-choice format. Some achievement test items may be so "saturated with aptitude" (Willingham, 1980) as to be insensitive to skills acquired through education. Multiple-choice tests are ill-suited for assessing productive thinking and problem-solving skills, skills that often constitute important objectives of education. Viewed as incentives for learning, multiple-choice tests may impede student progress toward these objectives. There is need for accelerated research to develop alternatives to multiple-choice achievement tests, with content selected to match the specified educational objectives.  相似文献   

4.
本研究利用建构图设计一套含有六大部分的30道试题。题型包括拼写题、选择题和简答题。共有175名6到14岁儿童参加了此项考试。Rasch分析结果发现题组内局部题目依赖并不严重。信度为0.85。考题的难度和考生能力的配合度相当良好。我们根据建构图来编写考题,因此有一定程度的内容效度。但有9道题的难度稍微与原先预期略有出入。有5道题不大吻合Rasch模式的预期,没有发现在性别上有明显的项目功能差异。考生能力与学习英语的时间有正相关。最后探讨了基于信息通讯技术的远程计算机自适应测验的技术问题。  相似文献   

5.
The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam.  相似文献   

6.
Identifying the Causes of DIF in Translated Verbal Items   总被引:1,自引:0,他引:1  
Translated tests are being used increasingly for assessing the knowledge and skills of individuals who speak different languages. There is little research exploring why translated items sometimes function differently across languages. If the sources of differential item functioning (DIF) across languages could be predicted, it could have important implications on test development, scoring and equating. This study focuses on two questions: “Is DIF related to item type?”, “What are the causes of DIF?” The data were taken from the Israeli Psychometric Entrance Test in Hebrew (source) and Russian (translated). The results indicated that 34% of the items functioned differentially across languages. The analogy items were the most problematic with 65% showing DIF, mostly in favor of the Russian-speaking examinees. The sentence completion items were also a problem (45% D1F). The main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.  相似文献   

7.
The problem of assessing the content validity (or relevance) of standardized achievement tests is considered within the framework of generalizability theory. Four illustrative designs are described that may be used to assess test-item fit to a curriculum. For each design, appropriate variance components are identified for making relative and absolute item (or test) selection decisions. Special consideration is given to use of these procedures for determining the number of raters and/or schools needed in a content-validation decisionmaking study. Application of these procedures is illustrated using data from an international assessment of mathematics achievement  相似文献   

8.
Individual person fit analyses provide important information regarding the validity of test score inferences for an individual test taker. In this study, we use data from an undergraduate statistics test (N = 1135) to illustrate a two-step method that researchers and practitioners can use to examine individual person fit. First, person fit is examined numerically with several indices based on the Rasch model (i.e., Infit, Outfit, and Between-Subset statistics). Second, person misfit is presented graphically with person response functions, and these person response functions are interpreted using a heuristic. Individual person fit analysis holds promise for improving score interpretation in that it may detect potential threats to validity of score inferences for some test takers. Individual person fit analysis may also highlight particular subsets of items (on which a test taker performs unexpectedly) that can be used to further contextualize her or his test performance.  相似文献   

9.
This paper discusses various issues involved in using the Rasch model with multiple choice tests. By presenting a modified test that is much more powerful, the value of Wright and Panchapakesan's test as evidence of model fit is shown to be questionable. According to the new test, the model failed to fit 68% of the items in the Anchor Test Study. Effects of such misfit on test equating are demonstrated. Results of some past studies purporting to support the Rasch model are shown to be irrelevant, or to yield the conclusion that the Rasch model did not fit the data. Issues like "objectivity" and consistent estimation are shown to be unimportant in selection of a latent trait model. Thus, available evidence shows the Rasch model to be unsuitable for multiple choice items.  相似文献   

10.
The purposes of this study were to (a) test the hypothesized factor structure of the Student-Teacher Relationship Scale (STRS; Pianta, 2001) for 308 African American (AA) and European American (EA) children using confirmatory factor analysis (CFA) and (b) examine the measurement invariance of the factor structure across AA and EA children. CFA of the hypothesized three-factor model with correlated latent factors did not yield an optimal model fit. Parameter estimates obtained from CFA identified items with low factor loadings and R2 values, suggesting that content revision is required for those items on the STRS. Deletion of two items from the scale yielded a good model fit, suggesting that the remaining 26 items reliably and validly measure the constructs for the whole sample. Tests for configural invariance, however, revealed that the underlying constructs may differ for AA and EA groups. Subsequent exploratory factor analyses (EFAs) for AA and EA children were carried out to investigate the comparability of the measurement model of the STRS across the groups. The results of EFAs provided evidence suggesting differential factor models of the STRS across AA and EA groups. This study provides implications for construct validity research and substantive research using the STRS given that the STRS is extensively used in intervention and research in early childhood education.  相似文献   

11.
In this study, the psychometric properties of the scenario‐based Achievement Guilt and Shame Scale (AGSS) were established. The AGSS and scales assessing interpersonal guilt and shame, high standards, overgeneralization, self‐criticism, self‐esteem, academic self‐concept, fear of failure, and tendency to respond in a socially desirable manner were completed by 322 undergraduate students. A confirmatory factor analysis indicated that a 12‐scenario model had an acceptable fit to the data, with guilt and shame items forming separate, weakly correlated subscales. Each of the guilt and shame subscales of the AGSS demonstrated good internal and test–retest reliability. Good construct validity was also evident, with each subscale uniquely correlating with constructs in ways that were consistent with predictions. Acceptable discriminant validity was also evident. These outcomes provide support for the utility of the AGSS in assessing guilt and shame reactions in achievement situations.  相似文献   

12.
The purpose of this study was to evaluate the adequacy of three cognitive models, one developed by content experts and two generated from student verbal reports for explaining examinee performance on a grade 3 diagnostic mathematics test. For this study, the items were developed to directly measure the attributes in the cognitive model. The performance of each cognitive model was evaluated by examining its fit to different data samples: verbal report, total, high-, moderate-, and low ability using the Hierarchy Consistency Index (Cui & Leighton, 2009), a model-data fit index. This study utilized cognitive diagnostic assessments developed under the framework of construct-centered test design and analyzed using the Attribute Hierarchy Method (Gierl, Wang, & Zhou, 2008; Leighton, Gierl, & Hunka, 2004). Both the expert-based and the student-based cognitive models provided excellent fit to the verbal report and high ability samples, but moderate to poor fit to the total, moderate and low ability samples. Implications for cognitive model development for cognitive diagnostic assessment are discussed.  相似文献   

13.
14.
The Progressive Matrices items require varying degrees of analytical reasoning. Individuals high on the underlying trait measured by the Raven should score high on the test. Latent trait models applied to data of the Raven form provide a useful methodology for examining the tenability of the above hypothesis. In this study the Rasch latent model was applied to investigate the fit of observed performance on Raven items to what was expected by the model for individuals at six different levels of the underlying scale. For the most part the model showed a good fit to the test data. The findings were similar to previous empirical work that has investigated the behavior of Rasch test scores. In three instances, however, the item fit statistic was relatively large. A closer study of the “misfitting” items revealed two items were of extreme difficulty, which is likely to contribute to the misfit. The study raises issues about the use of the Rasch model in instances of small samples. Other issues related to the interpretation of the Rasch model to Raven-type data are discussed.  相似文献   

15.
To ensure the statistical result validity, model-data fit must be evaluated for each item. In practice, certain actions or treatments are needed for misfit items. If all misfit items are treated, much item information would be lost during calibration. On the other hand, if only severely misfit items are treated, the inclusion of misfit items may invalidate the statistical inferences based on the estimated item response models. Hence, given response data, one has to find a balance between treating too few and too many misfit items. In this article, misfit items are classified into three categories based on the extent of misfit. Accordingly, three different item treatment strategies are proposed in determining which categories of misfit items should be treated. The impact of using different strategies is investigated. The results show that the test information functions obtained under different strategies can be substantially different in some ability ranges.  相似文献   

16.
This study examined the relationship of multiple-choice and free-response items contained on the College Board's Advanced Placement Computer Science (APCS) examination. Confirmatory factor analysis was used to test the fit of a two-factor model where each item format marked its own factor. Results showed a single-factor solution to provide the most parsimonious fit in each of two random-half samples. This finding might be accounted for by several mechanisms, including overlap in the specific processes assessed by the multiple-choice and free-response items and the limited opportunity for skill differentiation afforded by the year-long APCS course.  相似文献   

17.
Divgi's (1986) study concludes, largely on the basis of a proposed test o f fit that is designed to be more powerful than Wright and Panchapakesan's (I 969) test, that the Rasch model is never appropriate for use with multiple-choice type test items. This paper is an attempt to refute the conclusions of Divgi's study.  相似文献   

18.
This article proposes a model-based procedure, intended for personality measures, for exploiting the auxiliary information provided by the certainty with which individuals answer every item (response certainty). This information is used to (a) obtain more accurate estimates of individual trait levels, and (b) provide a more detailed assessment of the consistency with which the individual responds to the test. The basis model consists of 2 submodels: an item response theory submodel for the responses, and a linear-in-the-coefficients submodel that describes the response certainties. The latter is based on the distance-difficulty hypothesis, and is parameterized as a factor-analytic model. Procedures for (a) estimating the structural parameters, (b) assessing model–data fit, (c) estimating the individual parameters, and (d) assessing individual fit are discussed. The proposal was used in an empirical study. Model–data fit was acceptable and estimates were meaningful. Furthermore, the precision of the individual trait estimates and the assessment of the individual consistency improved noticeably.  相似文献   

19.
Efficacy of the Measure of Understanding of Macroevolution (MUM) as a measurement tool has been a point of contention among scholars needing a valid measure for knowledge of macroevolution. We explored the structure and construct validity of the MUM using Rasch methodologies in the context of a general education biology course designed with an emphasis on macroevolution content. The Rasch model was utilized to quantify item- and test-level characteristics, including dimensionality, reliability, and fit with the Rasch model. Contrary to previous work, we found that the MUM provides a valid, reliable, and unidimensional scale for measuring knowledge of macroevolution in introductory non-science majors, and that its psychometric behavior does not exhibit large changes across time. While we found that all items provide productive measurement information, several depart substantially from ideal behavior, warranting a collective effort to improve these items. Suggestions for improving the measurement characteristics of the MUM at the item and test levels are put forward and discussed.  相似文献   

20.
《Educational Assessment》2013,18(3):201-224
This article discusses an approach to analyzing performance assessments that identifies potential reasons for misfitting items and uses this information to improve on items and rubrics for these assessments. Specifically, the approach involves identifying psychometric features and qualitative features of items and rubrics that may possibly influence misfit; examining relations between these features and the fit statistic; conducting an analysis of student responses to a sample of misfitting items; and finally, based on the results of the previous analyses, modifying characteristics of the items or rubrics and reexamining fit. A mathematics performance assessment containing 53 constructed-response items scored on a holistic scale from 0 to 4 is used to illustrate the approach. The 2-parameter graded response model (Samejima, 1969) is used to calibrate the data. Implications of this method of data analysis for improving performance assessment items and rubrics are discussed as well as issues and limitations related to the use of the approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号