首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Gender fairness in testing can be impeded by the presence of differential item functioning (DIF), which potentially causes test bias. In this study, the presence and causes of gender-related DIF were investigated with real data from 800 items answered by 250,000 test takers. DIF was examined using the Mantel–Haenszel and logistic regression procedures. Little DIF was found in the quantitative items and a moderate amount was found in the verbal items. Vocabulary items favored women if sampled from traditionally female domains but generally not vice versa if sampled from male domains. The sentence completion item format in the English reading comprehension subtest favored men regardless of content. The findings, if supported in a cross-validation study, can potentially lead to changes in how vocabulary items are sampled and in the use of the sentence completion format in English reading comprehension, thereby increasing gender fairness in the examined test.  相似文献   

2.
《教育实用测度》2013,26(2):175-199
This study used three different differential item functioning (DIF) detection proce- dures to examine the extent to which items in a mathematics performance assessment functioned differently for matched gender groups. In addition to examining the appropriateness of individual items in terms of DIF with respect to gender, an attempt was made to identify factors (e.g., content, cognitive processes, differences in ability distributions, etc.) that may be related to DIF. The QUASAR (Quantitative Under- standing: Amplifying Student Achievement and Reasoning) Cognitive Assessment Instrument (QCAI) is designed to measure students' mathematical thinking and reasoning skills and consists of open-ended items that require students to show their solution processes and provide explanations for their answers. In this study, 33 polytomously scored items, which were distributed within four test forms, were evaluated with respect to gender-related DIF. The data source was sixth- and seventh- grade student responses to each of the four test forms administrated in the spring of 1992 at all six school sites participatingin the QUASARproject. The sample consisted of 1,782 students with approximately equal numbers of female and male students. The results indicated that DIF may not be serious for 3 1 of the 33 items (94%) in the QCAI. For the two items that were detected as functioning differently for male and female students, several plausible factors for DIF were discussed. The results from the secondary analyses, which removed the mutual influence of the two items, indicated that DIF in one item, PPPl, which favored female students rather than their matched male students, was of particular concern. These secondary analyses suggest that the detection of DIF in the other item in the original analysis may have been due to the influence of Item PPPl because they were both in the same test form.  相似文献   

3.
Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists.  相似文献   

4.
The effect of item parameters (discrimination, difficulty, and level of guessing) on the item-fit statistic was investigated using simulated dichotomous data. Nine tests were simulated using 1,000 persons, 50 items, three levels of item discrimination, three levels of item difficulty, and three levels of guessing. The item fit was estimated using two fit statistics: the likelihood ratio statistic (X2B), and the standardized residuals (SRs). All the item parameters were simulated to be normally distributed. Results showed that the levels of item discrimination and guessing affected the item-fit values. As the level of item discrimination or guessing increased, item-fit values increased and more items misfit the model. The level of item difficulty did not affect the item-fit statistic.  相似文献   

5.
This article demonstrates the utility of restricted item response models for examining item difficulty ordering and slope uniformity for an item set that reflects varying cognitive processes. Twelve sets of paired algebra word problems were developed to systematically reflect various types of cognitive processes required for successful performance. This resulted in a total of 24 items. They reflected distance-rate–time (DRT), interest, and area problems. Hypotheses concerning difficulty ordering and slope uniformity for the items were tested by constraining item difficulty and discrimination parameters in hierarchical item response models. The first set of model comparisons tested the equality of the discrimination and difficulty parameters for each set of paired items. The second set of model comparisons examined slope uniformity within the complex DRT problems. The third set of model comparisons examined whether the familiarity of the story context affected item difficulty for two types of complex DRT problems. The last set of model comparisons tested the hypothesized difficulty ordering of the items.  相似文献   

6.
Researchers interested in exploring substantive group differences are increasingly attending to bundles of items (or testlets): the aim is to understand how gender differences, for instance, are explained by differential performances on different types or bundles of items, hence differential bundle functioning (DBF). Some previous work has modelled hierarchies in data in this context or considered item responses within persons, but here we model the bundles themselves as explanatory variables at the item level potentially explaining significant intra-class correlation due to gender differences in item difficulty, and thus explaining variation at the second item level. In this study, we analyse DBF using single- and two-level models (the latter modelling random item effects, which models responses at Level 1 and items at Level 2) in a high-stakes National Mathematics test. The models show comparable regression coefficients but the statistical significances of the two-level models are smaller due to the larger values of the estimated standard errors. We discuss the contrasting relevance of this effect for test developers and gender researchers.  相似文献   

7.
基于Grabe&Stoller(2005)的阅读能力层次理论以及《高校英语专业四级考试大纲(2004)》对于基础阶段英语专业学生英语阅读能力的考查要求,从能力分类、能力层次评价以及题目难度和区分度与能力层次关系的角度,研究2010年TEM4阅读理解试题中体现的阅读理解能力。结果表明,教师对题目的能力分类和能力层次评价存在明显分歧,但题目难度和区分度与题目能力层次存在正相关。  相似文献   

8.
Studies that have investigated differences in examinee performance on items administered in paper-and-pencil form or on a computer screen have produced equivocal results. Certain item administration procedures were hypothesized to be among the most important variables causing differences in item performance and ultimately in test scores obtained from these different administration media. A study where these item administration procedures were made as identical as possible for each presentation medium is described. In addition, a methodology is presented for studying the difficulty and discrimination of items under each presentation medium as a post hoc procedure.  相似文献   

9.
In order to investigate the effect of two item-writing practices on test characteristics, examinations were chosen for study in two undergraduate courses (N = 71 and 210) . About one-fourth of the items on each examination included a practice generally regarded as undesirable in measurement textbooks and alleged to make test items more difficult. Alternate forms which eliminated the undesirable practice were developed and administered at the same time as the original form. Rewriting item stems so that they formed a complete sentence or question resulted in about 6 percent more students answering items correctly. Eliminating unnecessary material in item stems, however, had little effect on difficulty. KR20 values were not appreciably different for the two versions of either test. Neither flaw was found to affect item discrimination indices noticeably. The absence of any substantial practice-by-achievement level interactions suggested little effect of the practices on the validity of the tests.  相似文献   

10.
A procedure for the detection of differential item performance (DIP) is used to investigate the relationships between characteristics of mathematics achievement items and gender differences in performance. Eight randomly equivalent samples of high school seniors were each given a unique form of the ACT Assessment Mathematics Usage Test (ACTM). Students without requisite mathematics courses were deleted from the samples to reduce the confounding effects of differences in instruction at the high school level. Signed measures of DIP were obtained for each item in the eight ACTM forms. These DIP estimates were then analyzed in a 6 × 8 (item category by form) experimental design. A significant item category effect was found indicating a relationship between item characteristics and gender-based DIP. Predictions, based on previous research about the categories of items that would contribute to gender-based DIP, were supported: Geometry and mathematics reasoning items were relatively more difficult for female examinees and the more algorithmic, computation-oriented items were relatively easier.  相似文献   

11.
In this study, we examine the degree of construct comparability and possible sources of incomparability of the English and French versions of the Programme for International Student Assessment (PISA) 2003 problem-solving measure administered in Canada. Several approaches were used to examine construct comparability at the test- (examination of test data structure, reliability comparisons and test characteristic curves) and item-levels (differential item functioning, item parameter correlations, and linguistic comparisons). Results from the test-level analyses indicate that the two language versions of PISA are highly similar as shown by similarity of internal consistency coefficients, test data structure (same number of factors and item factor loadings) and test characteristic curves for the two language versions of the tests. However, results of item-level analyses reveal several differences between the two language versions as shown by large proportions of items displaying differential item functioning, differences in item parameter correlations (discrimination parameters) and number of items found to contain linguistic differences.  相似文献   

12.
Reading and Mathematics tests of multiple-choice items for grades Kindergarten through 9 were vertically scaled using the three-parameter logistic model and two different scaling procedures: concurrent and separate by grade groups. Item parameters were estimated using Markov chain Monte Carlo methodology while fixing the grade 4 population abilities to have a standard normal distribution. For the separate grade-groups scaling, grade groupings were linked using the Stocking and Lord test characteristic curve procedure. Abilities were estimated using the maximum-likelihood method. In either content area, scatterplots of item difficulty, discrimination, and ability estimates from the two methods showed consistently strong linear relationships. However, as grade deviated from the base grade of four, the best-fit linear line through the pairs of item discriminations started to rotate away from the identity line. This indicated the discrimination estimates from the separate grade-groups procedure for extreme grades to be, on average, higher than those from the concurrent analysis. The study also observed some systematic change in score variability across grades. In general, the two vertical scaling approaches yielded similar results at more grades in Reading than in Mathematics.  相似文献   

13.
采用随机整群抽样抽取505名中小学教师作为被试,其中,男教师189名,女教师271名,年龄均在25至55岁之间。采用教学效能感问卷进行施测,基于项目反应理论,对测试结果进行分析,得出所有项目的区分度、难度和项目信息峰值,参考项目区分度、难度及项目信息函数峰值对教学效能感量表做了修订,再运用结构方程模型、层面理论技术和最小空间分析对修订后的量表进行质量检验,结果表明修订后的量表测量拥有更为清晰的结构效度和更高的信度,测量更为精确。运用SPSS15.0管理数据,运用Hudap6.0和MULTILOG 7.03分析数据,研究得出如下五个结论:1)教学效能感量表为单一维度,可以使用项目反应理论进行分析;2)修订后的量表项目的区分度、难度更为合理;3)修订后的量表的测验信息峰值较原量表稍低;4)修订前后量表对应层面元素之间存在高相关;5)量表的三个方面内容结构得以证实,即学生品德行为教育、课堂组织管理和知识传授。  相似文献   

14.
Abstract

In an attempt to identify some of the causes of answer changing behavior, the effects of four tests and item specific variables were evaluated. Three samples of New Zealand school children of different ages were administered tests of study skills. The number of answer changes per item was compared with the position of each item in a group of items, the position of each item in the test, the discrimination index and the difficulty index of each item. It is shown that answer changes were more likely to be made on items occurring early in a group of items and toward the end of a test. There was also a tendency for difficult items and items with poor discriminations to be changed more frequently. Some implications of answer changing in the design of tests are discussed.  相似文献   

15.
This study investigated the psychometric characteristics of constructed-response (CR) items referring to choice and non-choice passages administered to students in Grades 3, 5, and 8. The items were scaled using item response theory (IRT) methodology. The results indicated no consistent differences in the difficulty and discrimination of the items referring to the two types of passages. On the average, students' scale scores on the choice and non-choice passages were comparable. Finally, the choice passages differed in terms of overall popularity and in their attractiveness to different gender and ethnic groups  相似文献   

16.
This paper presents findings from research exploring gender by item difficulty interaction on mathematics test scores in Cyprus. Data steamed from 2 longitudinal studies with 4 different age groups of primary school students. The hypothesis that boys tended to outperform girls on the hardest items and girls tended to outperform boys on the easiest items was generally supported for each year group. The effect of social class was also examined. For each social class, there was a correlation between the item difficulty differences estimated on girls and boys separately and the difficulty of the item estimated on the whole sample. It is claimed that in understanding gender differences in mathematics, item difficulty should be treated as an independent variable. Suggestions for further studies are provided, and implications for the development of assessment policy in mathematics are drawn.  相似文献   

17.
Gender effects in large-scale assessments have become an increasingly important research area within and across countries. Yet few studies have linked differences in assessment results of male and female students in higher education to construct-relevant features of the target construct. This paper examines gender effects on students’ economic content knowledge with a focus on construct-relevant explanations. Moreover, we compare gender effects cross-nationally between Germany, Japan, and the United States. To assess economic content knowledge of higher education students, we used translated, adapted, and validated versions of the Test of Understanding in College Economics (TUCE, 4th ed.), an instrument that is commonly used internationally. We found gender effects on test scores in all three countries; effects were larger in Germany and the United States than in Japan. Gender effects were generally more pronounced on the numeracy subscale than on the literacy subscale, that is, male students had a greater edge over female students when items required calculations. In our conclusion, we discuss how numeracy and literacy items may tap different abilities.  相似文献   

18.
A 1998 study by Bielinski and Davison reported a sex difference by item difficulty interaction in which easy items tended to be easier for females than males, and hard items tended to be harder for females than males. To extend their research to nationally representative samples of students, this study used math achievement data from the 1992 NAEP, the TIMSS, and the NELS:88. The data included students in grades 4, 8, 10, and 12. The interaction was assessed by correlating the item difficulty difference (bmale− bfemale) with item difficulty computed on the combined male/female sample. Using only the multiple-choice mathematics items, the predicted negative correlation was found for all eight populations and was significant in five. An argument is made that this phenomenon may help explain the greater variability in math achievement among males as compared to females and the emergence of higher performance of males in late adolescence.  相似文献   

19.
为保证语言测试题目的质量和加强题库建设,本文基于经典测试理论,使用Gitest Ⅲ对一份高考试卷(阅读部分)题目进行项目分析,结果显示:该阅读题目的难度、区分度较理想,但难度分布并不理想。建议在使用题库中的组合试卷前先进行试测,以改进试题的难度分布以及部分题目选项的质量,从而提高试题的信度和效度。  相似文献   

20.
The “Teacher Education and Development Study in Mathematics” assessed the knowledge of primary and lower-secondary teachers at the end of their training. The large-scale assessment represented the common denominator of what constitutes mathematics content knowledge and mathematics pedagogical content knowledge in the 16 participating countries. The country means provided information on the overall teacher performance in these 2 areas. By detecting and explaining differential item functioning (DIF), this paper goes beyond the country means and investigates item-by-item strengths and weaknesses of future teachers. We hypothesized that due to differences in the cultural context, teachers from different countries responded differently to subgroups of test items with certain item characteristics. Content domains, cognitive demands (including item difficulty), and item format represented, in fact, such characteristics: They significantly explained variance in DIF. Country pairs showed similar patterns in the relationship of DIF to the item characteristics. Future teachers from Taiwan and Singapore were particularly strong on mathematics content and constructed-response items. Future teachers from Russia and Poland were particularly strong on items requiring non-standard mathematical operations. The USA and Norway did particularly well on mathematics pedagogical content and data items. Thus, conditional on the countries’ mean performance, the knowledge profiles of the future teachers matched the respective national debates. This result points to the influences of the cultural context on mathematics teacher knowledge.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号