期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Comparison of Alternative Matching Strategies for DIF Detection in Tests That Are Multidimensional

Brian E. Clauser Ronald J. Nungester Kathleen Mazor Douglas Ripkey 《Journal of Educational Measurement》1996,33(2):202-214

Most currently accepted approaches for identifying differentially functioning test items compare performance across groups after first matching examinees on the ability of interest. The typical basis for this matching is the total test score. Previous research indicates that when the test is not approximately unidimensional, matching using the total test score may result in an inflated Type I error rate. This study compares the results of differential item functioning (DIF) analysis with matching based on the total test score, matching based on subtest scores, or multivariate matching using multiple subtest scores. Analysis of both actual and simulated data indicate that for the dimensionally complex test examined in this study, using the total test score as the matching criterion is inappropriate. The results suggest that matching on multiple subtest scores simultaneously may be superior to using either the total test score or individual relevant subtest scores. 相似文献

2.

Using Weighted Sum Scores to Close the Gap Between DIF Practice and Theory

Hongwen Guo Neil J. Dorans 《Journal of Educational Measurement》2020,57(4):484-510

We make a distinction between the operational practice of using an observed score to assess differential item functioning (DIF) and the concept of departure from measurement invariance (DMI) that conditions on a latent variable. DMI and DIF indices of effect sizes, based on the Mantel-Haenszel test of common odds ratio, converge under restricted conditions if a simple sum score is used as the matching or conditioning variable in a DIF analysis. Based on theoretical results, we demonstrate analytically that matching on a weighted sum score can significantly reduce the difference between DIF and DMI measures over what can be achieved with a simple sum score. We also examine the utility of binning methods that could facilitate potential operational use of DIF with weighted sum scores. A real data application was included to show this feasibility. 相似文献

3.

Decisions that make a difference in detecting differential item functioning

Stephen G. Sireci Joseph A. Rios 《Educational Research and Evaluation》2013,19(2-3):170-187

There are numerous statistical procedures for detecting items that function differently across subgroups of examinees that take a test or survey. However, in endeavouring to detect items that may function differentially, selection of the statistical method is only one of many important decisions. In this article, we discuss the important decisions that affect investigations of differential item functioning (DIF) such as choice of method, sample size, effect size criteria, conditioning variable, purification, DIF amplification, DIF cancellation, and research designs for evaluating DIF. Our review highlights the necessity of matching the DIF procedure to the nature of the data analysed, the need to include effect size criteria, the need to consider the direction and balance of items flagged for DIF, and the need to use replication to reduce Type I errors whenever possible. Directions for future research and practice in using DIF to enhance the validity of test scores are provided. 相似文献

4.

Using Log-Linear Smoothing to Improve Small-Sample DIF Estimation

Gautam Puhan Timothy P. Moses Lei Yu Neil J. Dorans 《Journal of Educational Measurement》2009,46(1):59-83

This study examined the extent to which log-linear smoothing could improve the accuracy of differential item functioning (DIF) estimates in small samples of examinees. Examinee responses from a certification test were analyzed using White examinees in the reference group and African American examinees in the focal group. Using a simulation approach, separate DIF estimates for seven small-sample-size conditions were obtained using unsmoothed (U) and smoothed (S) score distributions. These small sample U and S DIF estimates were compared to a criterion (i.e., DIF estimates obtained using the unsmoothed total data) to assess their degree of variability (random error) and accuracy (bias). Results indicate that for most studied items smoothing the raw score distributions reduced random error and bias of the DIF estimates, especially in the small-sample-size conditions. Implications of these results for operational testing programs are discussed. 相似文献

5.

Using Logistic Regression and the Mantel-Haenszel With Multiple Ability Estimates to Detect Differential Item Functioning

Kathleen M. Mazor Anil Kanjee Brian E. Clauser 《Journal of Educational Measurement》1995,32(2):131-144

Logistic regression has recently been advanced as a viable procedure for detecting differential item functioning (DIF). One of the advantages of this procedure is the considerable flexibility it offers in the specification of the regression equation. This article describes incorporating two ability estimates into a single regression analysis, with the result that substantially fewer items exhibit DIF. A comparable analysis is conducted using the Mantel-Haenszel with similar results. It is argued that by simultaneously conditioning on two relevant ability estimates, more accurate matching of examinees in the reference and focal groups is obtained, and thus multidimensional item impact is not mistakenly identified as DIF. 相似文献

6.

国家职业汉语能力测试的效度分析 总被引：1，自引：1，他引：1

XIE Xiaoqing REN Jie 《中国考试》2008,(9)

国家职业汉语能力测试(ZHC)由中国劳动和社会保障部职业技能鉴定中心(OSTA)组织国内语言学、语言教学、心理学和教育测量学等方面的专家开发研制。ZHC是测查应试者在职业活动中的汉语能力的国家级职业核心能力测试。效度研究是一个关于考试有效性资料的积累过程,是通过积累证据对考试提供支持的过程,我们需要从多种角度对考试的有效性进行检验,积累资料。在本项研究中,从ZHC成绩与学历的相关、试卷的内部结构分析(不同意型间相关和因素分析)、DIF分析(包含关于性别和文理科专业的分析)、名牌大学与普通大学在校生成绩的比较、用户调查等几个方面对ZHC进行了效度分析。测验分数与学历的相关分析、试卷结构分析、DIF分析、组间比较和用户调查的结果都对ZHC的效度提供了支持,显示ZHC具有较好的效度,可以比较真实准确地反映出受测者的实际语言能力和逻辑思维能力。相似文献

7.

An Empirical Investigation Demonstrating the Multidimensional DIF Paradigm: A Cognitive Explanation for DIF

Cindy M. Walker S. Natasha Beretvas 《Journal of Educational Measurement》2001,38(2):147-163

Differential Item Functioning (DIF) is traditionally used to identify different item performance patterns between intact groups, most commonly involving race or sex comparisons. This study advocates expanding the utility of DIF as a step in construct validation. Rather than grouping examinees based on cultural differences, the reference and focal groups are chosen from two extremes along a distinct cognitive dimension that is hypothesized to supplement the dominant latent trait being measured. Specifically, this study investigates DIF between proficient and non-proficient fourth- and seventh-grade writers on open-ended mathematics test items that require students to communicate about mathematics. It is suggested that the occurrence of DIF in this situation actually enhances, rather than detracts from, the construct validity of the test because, according to the National Council of Teachers of Mathematics (NCTM), mathematical communication is an important component of mathematical ability, the dominant construct being assessed. However, the presence of DIF influences the validity of inferences that can be made from test scores and suggests that two scores should be reported, one for general mathematical ability and one for mathematical communication. The fact that currently only one test score is reported, a simple composite of scores on multiple-choice and open-ended items, may lead to incorrect decisions being made about examinees. 相似文献

8.

Applying the Mantel-Haenszel Procedure to Complex Samples of Items

Nancy L. Allen John R. Donoghue 《Journal of Educational Measurement》1996,33(2):231-251

This Monte Carlo study examined the effect of complex sampling of items on the measurement of differential item functioning (DIF) using the Mantel-Haenszel procedure. Data were generated using a 3-parameter logistic item response theory model according to the balanced incomplete block (BIB) design used in the National Assessment of Educational Progress (NAEP). The length of each block of items and the number of DIF items in the matching variable were varied, as was the difficulty, discrimination, and presence of DIF in the studied item. Block, booklet, pooled booklet, and extra-information analyses were compared to a complete data analysis using the transformed log-odds on the delta scale. The pooled booklet approach is recommended for use when items are selected for examinees according to a BIB design. This study has implications for DIF analyses of other complex samples of items, such as computer administered testing or another complex assessment design. 相似文献

9.

Assessing Differential Item Functioning in Direct Writing Assessments: Problems and an Example

Catherine J. Welch Timothy R. Miller 《Journal of Educational Measurement》1995,32(2):163-178

The recent emphasis on various types of performance assessments raises questions concerning the differential effects of such assessments on population subgroups. Procedures for detecting differential item functioning (DIF) in data from performance assessments are available but may be hindered by problems that stem from this mode of assessment. Foremost among these are problems related to finding an appropriate matching variable. These problems are discussed and results are presented for three methods for DIF detection in polytomous items using data from a direct writing assessment. The purpose of the study is to examine the effects of using different combinations of internal and external matching variables. The procedures included a generalized Mantel-Haenszel statistic, a technique based on meta-analysis methodology, and logistic discriminant function analysis. In general, the results did not support the use of an external matching criterion and indicated that continued problems may be expected in attempts to assess DIF in performance assessments. 相似文献

10.

Testing Features of Graphical DIF: Application of a Regression Correction to Three Nonparametric Statistical Tests

Daniel M. Bolt Mark J. Gierl 《Journal of Educational Measurement》2006,43(4):313-333

Inspection of differential item functioning (DIF) in translated test items can be informed by graphical comparisons of item response functions (IRFs) across translated forms. Due to the many forms of DIF that can emerge in such analyses, it is important to develop statistical tests that can confirm various characteristics of DIF when present. Traditional nonparametric tests of DIF (Mantel-Haenszel, SIBTEST) are not designed to test for the presence of nonuniform or local DIF, while common probability difference (P-DIF) tests (e.g., SIBTEST) do not optimize power in testing for uniform DIF, and thus may be less useful in the context of graphical DIF analyses. In this article, modifications of three alternative nonparametric statistical tests for DIF, Fisher's χ ² test, Cochran's Z test, and Goodman's U test ( Marascuilo & Slaughter, 1981 ), are investigated for these purposes. A simulation study demonstrates the effectiveness of a regression correction procedure in improving the statistical performance of the tests when using an internal test score as the matching criterion. Simulation power and real data analyses demonstrate the unique information provided by these alternative methods compared to SIBTEST and Mantel-Haenszel in confirming various forms of DIF in translated tests. 相似文献

11.

Using Dimensionality-Based DIF Analyses to Identify and Interpret Constructs That Elicit Group Differences

Mark J. Gierl 《Educational Measurement》2005,24(1):3-14

In this paper I describe and illustrate the Roussos-Stout (1996) multidimensionality-based DIF analysis paradigm, with emphasis on its implication for the selection of a matching and studied subtest for DIF analyses. Standard DIF practice encourages an exploratory search for matching subtest items based on purely statistical criteria, such as a failure to display DIF. By contrast, the multidimensional DIF paradigm emphasizes a substantively-informed selection of items for both the matching and studied subtest based on the dimensions suspected of underlying the test data. Using two examples, I demonstrate that these two approaches lead to different interpretations about the occurrence of DIF in a test. It is argued that selecting a matching and studied subtest, as identified using the DIF analysis paradigm, can lead to a more informed understanding of why DIF occurs. 相似文献

12.

Effect of Rasch Calibration on Ability and DIF Estimation in Computer-Adaptive Tests

Rebecca Zwick Dorothy T. Thayer Marilyn Wingersky 《Journal of Educational Measurement》1995,32(4):341-363

In a previous simulation study of methods for assessing differential item functioning (DIF) in computer-adaptive tests (Zwick, Thayer, & Wingersky, 1993, 1994), modified versions of the Mantel-Haenszel and standardization methods were found to perform well. In that study, data were generated using the 3-parameter logistic (3PL) model and this same model was assumed in obtaining item parameter estimates. In the current study, the 3PL data were used but the Rasch model was assumed in obtaining the item parameter estimates, which determined the information table used for item selection. Although the obtained DIF statistics were highly correlated with the generating DIF values, they tended to be smaller in magnitude than in the 3PL analysis, resulting in a lower probability of DIF detection. This reduced sensitivity appeared to be related to a degradation in the accuracy of matching. Expected true scores from the Rasch-based computer-adaptive test tended to be biased downward, particularly for lower-ability examinees 相似文献

13.

Exploring the Utility of Background and Cognitive Variables in Explaining Latent Differential Item Functioning: An Example of the PISA 2009 Reading Assessment

Ying-Fang Chen Hong Jiao 《Educational Assessment》2013,18(2):77-96

Differential item functioning (DIF) may be caused by an interaction of multiple manifest grouping variables or unexplored manifest variables, which cannot be detected by conventional DIF detection methods that are based on a single manifest grouping variable. Such DIF may be detected by a latent approach using the mixture item response theory model and subsequently explained by multiple manifest variables. This study facilitates the interpretation of latent DIF with the use of background and cognitive variables. The PISA 2009 reading assessment and student survey are analyzed. Results show that members in manifest groups were not homogenously advantaged or disadvantaged and that a single manifest grouping variable did not suffice to be a proxy of latent DIF. This study also demonstrates that DIF items arising from the interaction of multiple variables can be effectively screened by the latent DIF analysis approach. Background and cognitive variables jointly well predicted latent class membership. 相似文献

14.

Analyzing Fairness Among Linguistic Minority Populations Using a Latent Class Differential Item Functioning Approach

Maria Elena Oliveri Kadriye Ercikan Juliette Lyons-Thomas Steven Holtzman 《教育实用测度》2013,26(1):17-29

ABSTRACT

Differential item functioning (DIF) analyses have been used as the primary method in large-scale assessments to examine fairness for subgroups. Currently, DIF analyses are conducted utilizing manifest methods using observed characteristics (gender and race/ethnicity) for grouping examinees. Homogeneity of item responses is assumed denoting that all examinees respond to test items using a similar approach. This assumption may not hold with all groups. In this study, we demonstrate the first application of the latent class (LC) approach to investigate DIF and its sources with heterogeneous (linguistic minority groups). We found at least three LCs within each linguistic group, suggesting the need to empirically evaluate this assumption in DIF analysis. We obtained larger proportions of DIF items with larger effect sizes when LCs within language groups versus the overall (majority/minority) language groups were examined. The illustrated approach could be used to improve the ways in which DIF analyses are typically conducted to enhance DIF detection accuracy and score-based inferences when analyzing DIF with heterogeneous populations. 相似文献

15.

Hierarchical Logistic Regression: Accounting for Multilevel Data in DIF Detection

Brian F. French W. Holmes Finch 《Journal of Educational Measurement》2010,47(3):299-317

The purpose of this study was to examine the performance of differential item functioning (DIF) assessment in the presence of a multilevel structure that often underlies data from large-scale testing programs. Analyses were conducted using logistic regression (LR), a popular, flexible, and effective tool for DIF detection. Data were simulated using a hierarchical framework, such as might be seen when examinees are clustered in schools, for example. Both standard and hierarchical LR (accounting for multilevel data) approaches to DIF detection were employed. Results highlight the differences in DIF detection rates when the analytic strategy matches the data structure. Specifically, when the grouping variable was within clusters, LR and HLR performed similarly in terms of Type I error control and power. However, when the grouping variable was between clusters, LR failed to maintain the nominal Type I error rate of .05. HLR was able to maintain this rate. However, power for HLR tended to be low under many conditions in the between cluster variable case. 相似文献

16.

云南首次初级建(构)筑物消防员基础知识考试结果分析

颜连宇阎文华李世友王加春魏振华《中国教育技术装备》2012,(18):30-32

对首次参加初级建(构)筑物消防员职业技能鉴定全国基础知识统一考试的云南考生考试结果进行统计分析,1716名考生的平均分为73.2,及格率为86.6%,计算各分数段的人数、平均分和标准差。对考生成绩与学历结构、年龄结构、性别结构、所服务的行业等因素的关系分析表明,总体上高学历、中青年、机场等大型消防安全重点单位考生的成绩较好,女性考生的平均分和及格率均高于男性考生。据此提出提高培训质量的多项建议。相似文献

17.

Item Difficulty of Four Verbal Item Types and an Index of Differential Item Functioning for Black and White Examinees 总被引：1，自引：0，他引：1

Roy Freedle Irene Kostin 《Journal of Educational Measurement》1990,27(4):329-343

In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored. 相似文献

18.

DIF Detection and Interpretation in Large-Scale Science Assessments: Informing Item Writing Practices

April L. Zenisky Ronald K. Hambleton Frederic Robin 《Educational Assessment》2013,18(1-2):61-78

Differential item functioning (DIF) analyses are a routine part of the development of large-scale assessments. Less common are studies to understand the potential sources of DIF. The goals of this study were (a) to identify gender DIF in a large-scale science assessment and (b) to look for trends in the DIF and non-DIF items due to content, cognitive demands, item type, item text, and visual-spatial or reference factors. To facilitate the analyses, DIF studies were conducted at 3 grade levels and for 2 randomly equivalent forms of the science assessment at each grade level (administered in different years). The DIF procedure itself was a variant of the "standardization procedure" of Dorans and Kulick (1986) and was applied to very large sets of data (6 sets of data, each involving 60,000 students). It has the advantages of being easy to understand and to explain to practitioners. Several findings emerged from the study that would be useful to pass on to test development committees. For example, when there was DIF in science items, MC items tended to favor male examinees and OR items tended to favor female examinees. Compiling DIF information across multiple grades and years increases the likelihood that important trends in the data will be identified and that item writing practices will be informed by more than anecdotal reports about DIF. 相似文献

19.

Using a Multidimensional Differential Item Functioning Framework to Determine if Reading Ability Affects Student Performance in Mathematics

Cindy M. Walker Bo Zhang John Surber 《教育实用测度》2013,26(2):162-181

Many teachers and curriculum specialists claim that the reading demand of many mathematics items is so great that students do not perform well on mathematics tests, even though they have a good understanding of mathematics. The purpose of this research was to test this claim empirically. This analysis was accomplished by considering examinees that differed in reading ability within the context of a multidimensional DIF framework. Results indicated that student performance on some mathematics items was influenced by their level of reading ability so that examinees with lower proficiency classifications in reading were less likely to obtain correct answers to these items. This finding suggests that incorrect proficiency classifications may have occurred for some examinees. However, it is argued that rather than eliminating these mathematics items from the test, which would seem to decrease the construct validity of the test, attempts should be made to control the confounding effect of reading that is measured by some of the mathematics items. 相似文献

20.

Examing Gender DIF on a Multiplechoice Test of Mathematics: A Confirmatory Approach

Katherine E. Ryan Meichu Fan 《Educational Measurement》1996,15(4):15-20

Do eighth-grade males and females display the same DIF patterns as older examinees? Are the patterns the same for different content areas in mathematics? Does a DIF test for essential dimensionality yield expected results? 相似文献