期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

SCRAMBLING CONTENT IN ACHIEVEMENT TESTING: AN APPLICATION OF MULTIPLE MATRIX SAMPLING IN EXPERIMENTAL DESIGN

KEN SIROTNIK ROGER WELLINGTON 《Journal of Educational Measurement》1974,11(3):179-188

This study was designed to research the question of scrambling item content in the construction of achievement tests, so that very general implications could be drawn for both examinee and item populations. To achieve this generality, the methodology of multiple matrix sampling was combined with a simple two group experimental design: a random group of 8th graders responded to mathematics, science, social studies, reading, and language arts achievement items organized in a scrambled (random) test format, while another random group responded to the same items organized in a fixed (segregated by subject matter) test format. The results indicated that scrambling cognitive test items has minimal or no effect on mean examinee test performance or on any of the other parameters included in the analysis. 相似文献

2.

Effects of Item Wording on Sex Bias 总被引：1，自引：0，他引：1

Joyce R. McLarty A. Candace Noble Renee M. Huntley 《Journal of Educational Measurement》1989,26(3):285-293

This study examined the effects of gender-related item-wording changes on the performance of male and female examinees. Mathematics word problems and English language items were created in neuter, male, and female versions. Items were administered to randomly equivalent samples of about 300 high school juniors and seniors. Loglinear analysis was used to assess the impact of item gender and its interaction with examinee sex on the difficulty and discrimination of each item in each context. No items were found to have sex bias in either context. Mathematics items did not have different difficulty or discrimination in the three gender versions. Neither mathematics nor English items had different discrimination levels in the three gender-related versions. Some English items, however, were found to have different difficulty levels in the three gender-related versions. These difficulty differences were not systematic." none of the three gender versions appeared consistently more or less difficult than the others. 相似文献

3.

Quantitative Methods for Assessing the Fit Between Test and Curriculum

《教育实用测度》2013,26(2):179-194

Techniques for quantifying the degree of fit between test items and curricula are classified according to three distinct purposes: (a) assessing overall fit between test and curriculum, (b) assessing the fit of individual items to a content domain, and (c) assessing the impact of test specifications on examinee performance. Procedures for calculating each index are provided, accompanied by discussion of specific properties of the indices within the context of content validation. 相似文献

4.

Examinee Non-Effort on Contextualized and Non-Contextualized Mathematics Items in Large-Scale Assessments

Daniel Van Nijlen Rianne Janssen 《教育实用测度》2015,28(1):68-84

In this study it is investigated to what extent contextualized and non-contextualized mathematics test items have a differential impact on examinee effort. Mixture item response theory (IRT) models are applied to two subsets of items from a national assessment on mathematics in the second grade of the pre-vocational track in secondary education in Flanders. One subset focused on elementary arithmetic and consisted of non-contextualized items. Another subset of contextualized items focused on the application of arithmetic in authentic problem-solving situations. Results indicate that differential performance on the subsets is to a large extent due to test effort. The non-contextualized items appear to be much more susceptible to low examinee effort in low-stakes testing situations. However, subgroups of students can be found with regard to the extent to which they show low effort. One can distinguish a compliant, an underachieving, and a dropout group. Group membership is also linked to relevant background characteristics. 相似文献

5.

Hierarchical Generalized Linear Models for the Analysis of Judge Ratings

Timothy J. Muckle George Karabatsos 《Journal of Educational Measurement》2009,46(2):198-219

It is known that the Rasch model is a special two-level hierarchical generalized linear model (HGLM). This article demonstrates that the many-faceted Rasch model (MFRM) is also a special case of the two-level HGLM, with a random intercept representing examinee ability on a test, and fixed effects for the test items, judges, and possibly other facets. This perspective suggests useful modeling extensions of the MFRM. For example, in the HGLM framework it is possible to model random effects for items and judges in order to assess their stability across examinees. The MFRM can also be extended so that item difficulty and judge severity are modeled as functions of examinee characteristics (covariates), for the purposes of detecting differential item functioning and differential rater functioning. Practical illustrations of the HGLM are presented through the analysis of simulated and real judge-mediated data sets involving ordinal responses. 相似文献

6.

AN EXPERIMENTAL COMPARISON OF ITEM SAMPLING AND EXAMINEE SAMPLING FOR ESTIMATING TEST NORMS

THOMAS R. OWENS DANIEL L. STUFFLEBEAM 《Journal of Educational Measurement》1969,6(2):75-83

An empirical comparison of the accuracy of item sampling and examinee sampling in estimating norm statistics. Item samples were composed of 3, 6, or 12 items selected from a total test of 50 multiple-choice vocabulary questions. Overall, the study findings provided empirical evidence that item sampling is approximately as effective as examinee sampling for estimating the population mean and standard deviation. Contradictory trends occurred for lower ability and higher ability student populations in accuracy of estimated means and standard deviations when the number of items administered increased from 3 to 6 to 12. The findings from this study indicate that the variation of sequences of items occurring in item sampling need not have a significant affect on test performance. 相似文献

7.

The Use of Hierarchical Generalized Linear Model for Item Dimensionality Assessment

S. Natasha Beretvas Natasha J. Williams 《Journal of Educational Measurement》2004,41(4):379-395

To assess item dimensionality, the following two approaches are described and compared: hierarchical generalized linear model (HGLM) and multidimensional item response theory (MIRT) model. Two generating models are used to simulate dichotomous responses to a 17-item test: the unidimensional and compensatory two-dimensional (C2D) models. For C2D data, seven items are modeled to load on the first and second factors, θ₁ and θ₂, with the remaining 10 items modeled unidimensionally emulating a mathematics test with seven items requiring an additional reading ability dimension. For both types of generated data, the multidimensionality of item responses is investigated using HGLM and MIRT. Comparison of HGLM and MIRT's results are possible through a transformation of items' difficulty estimates into probabilities of a correct response for a hypothetical examinee at the mean on θ and θ₂. HGLM and MIRT performed similarly. The benefits of HGLM for item dimensionality analyses are discussed. 相似文献

8.

A COMPARISON OF EXAMINEE SAMPLING AND MULTIPLE MATRIX SAMPLING IN TEST DEVELOPMENT

RASHMI GARG MARVIN W. BOSS JAMES E. CARLSON 《Journal of Educational Measurement》1986,23(2):119-130

For the purpose of obtaining data to use in test development, multiple matrix sampling (MMS) plans were compared to examinee sampling plans. Data were simulated for examinees, sampled from a population with a normal distribution of ability, responding to items selected from an item universe. Three item universes were considered: one that would produce a normal distribution of test scores, one a moderately platykurtic distribution, and one a very platykurtic distribution. When comparing sampling plans, total numbers of observations were held constant. No differences were found among plans in estimating item difficulty. Examinee sampling produced better estimates of item discrimination, test reliability, and test validity. As total number of observations increased, estimates improved considerably, especially for those MMS plans with larger subtest sizes. Larger numbers of observations were needed for tests designed to produce a normal distribution of test scores. With an adequate number of observations, MMS is seen as an alternative to examinee sampling in test development. 相似文献

9.

Gender Differences in Performance on Mathematics Achievement Items

《教育实用测度》2013,26(2):161-177

Gender differences in performance on three types of mathematics test items were investigated using data from students with three different course backgrounds. Eight randomly equivalent samples of high school seniors were each given a unique form of the ACT Assessment Mathematics Usage Test. Only students with three specific profiles of high school mathematics coursework were considered in the analysis. The three background conditions ranged from little mathematics (Algebra I only) to a modest background (two Algebra courses and Geometry) to a full mathematics program including Introductory Calculus. For each background condition, examinee performance was analyzed in a 2 (Gender) x 3 (Item Category) x 8 (Test Form) split-plot factorial design. The results indicated, that, at each of the studied background levels, females performed less well than males on geometry (strategic, geometric) and reasoning (strategic, nongeometric) items. On the other hand, females performed as well as males on algorithmic, operationsoriented items. 相似文献

10.

Exploration of Factors Affecting the Added Value of Test Subscores

Xiaolin Wang Dubravka Svetina Shenghai Dai 《Journal of Experimental Education》2019,87(2):179-192

Recently, interest in test subscore reporting for diagnosis purposes has been growing rapidly. The two simulation studies here examined factors (sample size, number of subscales, correlation between subscales, and three factors affecting subscore reliability: number of items per subscale, item parameter distribution, and data generating model) that affected the value of reporting subscores within the classical test theory framework. Results showed that a higher proportion of subscores of added value was related to lower correlation between subscales, more items per subscale, no guessing in responses, smaller variability in difficulty parameters, and matched average item difficulty and average examinee ability. 相似文献

11.

COLLEGE STUDENTS’ REACTIONS TOWARDS KEY FACETS OF CLASSROOM TESTING

Moshe Zeidner 《Assessment & Evaluation in Higher Education》1990,15(2):151-169

The major aim of the present study is to assess college students’ attitudes, perceptions, emotional reactions and affective dispositions with respect to various critical dimensions of course achievement testing and assessment, including: “papers” vs. “exams”, “essay” vs. “multiple choice” type formats, “open book” vs. “closed book” exams, “free choice” among items vs. “no free choice” among items, and “oral” vs. “written” modes of test administration. A further aim is to delineate the construction, properties, and potential classroom uses and applications of a selected sample of examinee feedback inventories designed to gauge students’ test attitudes and dispositions. The use of each examinee feedback inventory is demonstrated and exemplified in the context of an empirical study. This paper discusses the assumptions underlying the use of feedback systems in college achievement evaluation; their importance for assessing the face validity of classroom tests; some possible future applications of feedback inventories for research and applied purposes in college; and some guidelines for future research. A mapping sentence specifying the universe of content of test attitude and examinee feedback research is suggested as a heuristic device for guiding future research. 相似文献

12.

Using a Multidimensional Differential Item Functioning Framework to Determine if Reading Ability Affects Student Performance in Mathematics

Cindy M. Walker Bo Zhang John Surber 《教育实用测度》2013,26(2):162-181

Many teachers and curriculum specialists claim that the reading demand of many mathematics items is so great that students do not perform well on mathematics tests, even though they have a good understanding of mathematics. The purpose of this research was to test this claim empirically. This analysis was accomplished by considering examinees that differed in reading ability within the context of a multidimensional DIF framework. Results indicated that student performance on some mathematics items was influenced by their level of reading ability so that examinees with lower proficiency classifications in reading were less likely to obtain correct answers to these items. This finding suggests that incorrect proficiency classifications may have occurred for some examinees. However, it is argued that rather than eliminating these mathematics items from the test, which would seem to decrease the construct validity of the test, attempts should be made to control the confounding effect of reading that is measured by some of the mathematics items. 相似文献

13.

Mathematics Assessment for Children with English as an Additional Language

Eleanore Hargreaves 《Assessment in Education: Principles, Policy & Practice》1997,4(3):401-412

The research reported in this paper was carried out to establish whether children in England who spoke English as an additional, rather than a first language performed significantly less well in mathematics than their counterparts whose first language was English. An evaluation of the two National Curriculum Mathematics Assessment instruments for Year 2 children in England provided the context. The sample consisted of over 600 children, mainly of Asian origin, who spoke English as an additional language. Their mathematics results were compared to those of children who spoke English as a first language, and it was established that, overall, mean scores were significantly lower for children with English as an additional language. This was true for all four of the most highly represented additional language groups, and for both assessment instruments. An item analysis of the written test indicated that the difference was not constant across all items. 相似文献

14.

Test item linguistic complexity and assessments for deaf students

Cawthon S 《American annals of the deaf》2011,156(3):255-269

Linguistic complexity of test items is one test format element that has been studied in the context of struggling readers and their participation in paper-and-pencil tests. The present article presents findings from an exploratory study on the potential relationship between linguistic complexity and test performance for deaf readers. A total of 64 students completed 52 multiple-choice items, 32 in mathematics and 20 in reading. These items were coded for linguistic complexity components of vocabulary, syntax, and discourse. Mathematics items had higher linguistic complexity ratings than reading items, but there were no significant relationships between item linguistic complexity scores and student performance on the test items. The discussion addresses issues related to the subject area, student proficiency levels in the test content, factors to look for in determining a "linguistic complexity effect," and areas for further research in test item development and deaf students. 相似文献

15.

Are Accommodations for English Learners on State Accountability Assessments Evidence-Based? A Multistudy Systematic Review and Meta-Analysis

Joseph A. Rios Samuel D. Ihlenfeldt Carlos Chavez 《Educational Measurement》2020,39(4):65-75

The objectives of this two-part study were to: (a) investigate English learner (EL) accommodation practices on state accountability assessments of reading/English language arts and mathematics in grades 3–8, and (b) conduct a meta-analysis of EL accommodation effectiveness on improving test performance. Across all distinct testing programs, we found that at least one EL test accommodation was provided for both test content areas. The most popular accommodations provided were supplying students with word-to-word dual language dictionaries, reading aloud test directions and items in English, and allowing flexible time/scheduling. However, we found minimal evidence that testing programs provide practitioners with recommendations on how to assign relevant accommodations to EL test takers’ English proficiency level. To evaluate whether accommodations used in practice are supported with evidence of their effectiveness, a meta-analysis was conducted. On average, across 26 studies and 95 effect sizes (N = 11,069), accommodations improved test performance by .16 standard deviations. Both test content and sampling design were found to moderate accommodation effectiveness; however, none of the accommodations investigated were found to have intervention effects that were statistically different from zero. Overall, these results suggest that currently employed EL test accommodations lack evidence of their effectiveness. 相似文献

16.

Item Difficulty and Interviewer Knowledge Effects on the Accuracy and Consistency of Examinee Response Processes in Verbal Reports

Jacqueline P. Leighton 《教育实用测度》2013,26(2):136-157

The Standards for Educational and Psychological Testing indicate that multiple sources of validity evidence should be used to support the interpretation of test scores. In the past decade, examinee response processes, as a source of validity evidence, have received increased attention. However, there have been relatively few methodological studies of the accuracy and consistency of examinee response processes as measured by verbal reports in the context of educational measurement. The objective of the current study was to investigate the accuracy and consistency of examinee response processes—as measured by verbal reports—as a function of varying interviewer and item variables in a think aloud interview within an educational measurement context. Results indicate that the accuracy of responses may be undermined when students perceive the interviewer to be an expert in the domain. Further, the consistency of response processes may be undermined when items that are too easy or difficult are used to elicit reports. The implications of these results for conducting think-aloud studies are explored. 相似文献

17.

对我国硕士研究生英语入学考试模式的思考

李兴华《扬州大学学报(高教研究版)》2005,9(4):69-71

目前我国硕士生入学考试中英语科目考试模式存在很多问题,如对英语的过分重视导致对专业课的严重冲击,题型设计不尽合理,命题的模式化对正常的英语学习产生了不良导向作用等。为此可以用大学英语六级或相当水平的专门资格证书形式的考试,取代传统的硕士研究生英语入学考试,并在复试时强化对专业英语能力的测试。相似文献

18.

Multilevel Modeling of Item Position Effects

Anthony D. Albano 《Journal of Educational Measurement》2013,50(4):408-426

In many testing programs it is assumed that the context or position in which an item is administered does not have a differential effect on examinee responses to the item. Violations of this assumption may bias item response theory estimates of item and person parameters. This study examines the potentially biasing effects of item position. A hierarchical generalized linear model is formulated for estimating item‐position effects. The model is demonstrated using data from a pilot administration of the GRE wherein the same items appeared in different positions across the test form. Methods for detecting and assessing position effects are discussed, as are applications of the model in the contexts of test development and item analysis. 相似文献

19.

Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test

April L. Zenisky Ronald K. Hambleton Stephen G. Sired 《Journal of Educational Measurement》2002,39(4):291-309

Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory. 相似文献

20.

Using a New Statistical Model for Testlets to Score TOEFL 总被引：1，自引：0，他引：1

Howard Wainer Xiaohui Wang 《Journal of Educational Measurement》2000,37(3):203-220

Standard item response theory (IRT) models fit to examination responses ignore the fact that sets of items (testlets) often are matched with a single common stimulus (e.g., a reading comprehension passage). In this setting, all items given to an examinee are unlikely to be conditionally independent (given examinee proficiency). Models that assume conditional independence will overestimate the precision with which examinee proficiency is measured. Overstatement of precision may lead to inaccurate inferences as well as prematurely ended examinations in which the stopping rule is based on the estimated standard error of examinee proficiency (e.g., an adaptive test). The standard three parameter IRT model was modified to include an additional random effect for items nested within the same testlet (Wainer, Bradlow, & Du, 2000). This parameter, γ characterizes the amount of local dependence in a testlet.
We fit 86 TOEFL testlets (50 reading comprehension and 36 listening comprehension) with the new model, and obtained a value for the variance of γ for each testlet. We compared the standard parameters (discrimination (a), difficulty (b) and guessing (c)) with what is obtained through traditional modeling. We found that difficulties were well estimated either way, but estimates of both a and c were biased if conditional independence is incorrectly assumed. Of greater import, we found that test information was substantially over-estimated when conditional independence was incorrectly assumed. 相似文献