期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Computer-Based Task for Measuring the Representational Component of Quantitative Proficiency

Randy Elliot Bennett Marc M. Sebrechts 《Journal of Educational Measurement》1997,34(1):64-77

In this study, we created a computer-delivered problem-solving task based on the cognitive research literature and investigated its validity for graduate admissions assessment. The task asked examinees to sort mathematical word problem stems according to prototypes. Data analyses focused on the meaning of sorting scores and examinee perceptions of the task. Results showed that those who sorted well tended to have higher GRE General Test scores and college grades than did examinees who sorted less proficiently. Examinees generally preferred this task to multiple-choice items like those found on the General Test's Quantitative section and felt the task was a fairer measure of their ability to succeed in graduate school. Adaptations of the task might be used in admissions tests, as well as for instructional assessments to help lower- scoring examinees localize and remediate problem-solving difficulties. 相似文献

2.

Improving Measurement for Graduate Admissions

Mary K. Enright Donald A. Rock Randy Elliot Bennett 《Journal of Educational Measurement》1998,35(3):250-267

In this study we examined alternative item types and section configurations for improving the discriminant and convergent validity of the GRE General Test. A computer-based test of reasoning items and a generating-explanations measure was administered to a sample of 388 examinees who previously had taken the General Test. Confirmatory factor analyses indicated that three dimensions of reasoning—verbal, analytical, and quantitative—and a fourth dimension of verbal fluency based on the generating-explanations task could be distinguished. Notably, generating explanations was as distinct from new variations of reasoning items as it was from verbal and quantitative reasoning. In the full sample, this differentiation was evident in relation to such external criteria as undergraduate grade point average (UGPA), self-reported accomplishments, and a measure of ideational fluency, with generating explanations relating uniquely to aesthetic and linguistic accomplishments and to ideational fluency. For the subset of participants with undergraduate majors in the humanities and social sciences, generating explanations added to the relationship with UGPA over that contributed by the General Test. 相似文献

3.

Language and Cultural Characteristics That Explain Differential Item Functioning for Hispanic Examinees on the Scholastic Aptitude Test

Alieia P. Sehmitt 《Journal of Educational Measurement》1988,25(1):1-13

The standardization methodology was used to help identify item characteristics that might explain differential item functioning among Hispanics on the Scholastic Aptitude Test. Results indicated that true cognates or words with a common root in English and Spanish and content of special interest for Hispanics seemed to help Hispanics performance. Limited occurrence of false cognates (words that appear to be cognates but have different meanings in both languages) and of homographs (words that are spelled alike but have different meanings in English) restricted their evaluation. Nevertheless, examination of items with false cognates or homographs gave some evidence indicating that their occurrence might make items unexpectedly more difficult for Hispanic examinees 相似文献

4.

An Experimental, Exploratory Study of Causes of Bias in Test Items

Janiee Dowd Seheuneman 《Journal of Educational Measurement》1987,24(2):97-118

This study evaluated 16 hypotheses, subsumed under 7 more general hypotheses, concerning possible sources of bias in test items for black and white examinees on the Graduate Record Examination General Test (GRE). Items were developed in pairs that were varied according to a particular hypothesis, with each item from a pair administered in different forms of an experimental portion of the GRE. Data were analyzed using log linear methods. Ten of the 16 hypotheses showed interactions between group membership and the item version indicating a differential effect of the item manipulation on the performance of black and white examinees. The complexity of some of the interactions found, however, suggested that uncontrolled factors were also differentially affecting performance. 相似文献

5.

Graphical Modeling: A New Response Type for Measuring the Qualitative Component of Mathematical Reasoning

《教育实用测度》2013,26(3):303-322

We investigated the functioning of a new computer-delivered response type for potential use in graduate admissions assessment. This response type, which is open-ended and automatically scorable, presents problems calling for the examinee to draw a graph modeling a given situation. Problem situations can be like the single-best-answer items currently found on the Graduate Record Examinations (GRE) General Test (ETS, 1998) or they can be more loosely defined, allowing for multiple-correct responses. Two graphical modeling (GM) tests differing from one another in the manipulation of specific item features were randomly spiraled among study participants. Results showed that GM scores were very reliable and moderately related to the General Test's quantitative section, suggesting that GM might help broaden the GRE quantitative construct. In exploratory difficulty analyses, 1 of 3 manipulated item features, problem structure, had a dependable effect. No significant gender differences independent of those associated with the GRE quantitative section were detected. Finally, more participants preferred regular multiple-choice graphical reasoning questions to GM items but thought GM was the fairer indicator of their ability to undertake graduate study. 相似文献

6.

EFFECTS OF COACHING ON GRE APTITUDE TEST SCORES

DONALD E. POWERS 《Journal of Educational Measurement》1985,22(2):121-136

Test preparation activities were determined for a large representative sample of Graduate Record Examination (GRE) Aptitude Test takers. About 3% of these examinees had attended formal coaching programs for one or more sections of the test.
After adjusting for differences in the background characteristics of coached and uncoached students, effects on test scores were related to the length and the type of programs offered. The effects on GRE verbal ability scores were not significantly related to the amount of coaching examinees received, and quantitative coaching effects increased slightly but not significantly with additional coaching. Effects on analytical ability scores, on the other hand, were related significantly to the length of coaching programs, through improved performance on two analytical item types, which have since been deleted from the test.
Overall, the data suggest that, when compared with the two highly susceptible item types that have been removed from the GRE Aptitude Test, the test item types in the current version of the test (now called the GRE General Test) appear to show relatively little susceptibility to formal coaching experiences of the kinds considered here. 相似文献

7.

The Effect of Including Pretest Items in an Operational Computerized Adaptive Test: Do Different Ability Examinees Spend Different Amounts of Time on Embedded Pretest Items?

Abdullah A. Ferdous Barbara S. Plake Shu-Ren Chang 《Educational Assessment》2013,18(2):161-173

The purpose of this study was to examine the effect of pretest items on response time in an operational, fixed-length, time-limited computerized adaptive test (CAT). These pretest items are embedded within the CAT, but unlike the operational items, are not tailored to the examinee's ability level. If examinees with higher ability levels need less time to complete these items than do their counterparts with lower ability levels, they will have more time to devote to the operational test questions. Data were from a graduate admissions test that was administered worldwide. Data from both quantitative and verbal sections of the test were considered. For the verbal section, examinees in the lower ability groups spent systematically more time on their pretest items than did those in the higher ability groups, though for the quantitative section the differences were less clear. 相似文献

8.

The Effects of Component Variables on Performance in Graph Comprehension Tests

Yigal Attalim Chanan Goldschmidt 《Journal of Educational Measurement》1996,33(1):93-105

Some cognitive characteristics of graph comprehension items were studied, and a model comprised of several variables was developed. 132 graph items of the Psychometric Entrance Test were included in the study. By analyzing the actual difficulty of the items, an evaluation of the impact of the cognitive variables on item difficulties could be made. Results indicate that successful prediction of item difficulty can be calculated on the basis of a wide range of item characteristics and task demands. This suggests that items can be screened for processing difficulty prior to being administered to examinees. However, the results also have implications for test validity in that the various processing variables identified involve distinct ability dimensions. 相似文献

9.

The validity and comparability of entrance examination scores after accommodations are made for students with LD

Zurcher R Bryant DP 《Journal of learning disabilities》2001,34(5):462-471

Every year, thousands of college and university applicants with learning disabilities (LD) present scores from standardized examinations as part of the admissions process for postsecondary education. Many of these scores are from tests administered with nonstandard procedures due to the examinees' learning disabilities. Using a sample of college students with LD and a control sample, this study investigated the criterion validity and comparability of scores on the Miller Analogies Test when accommodations for the examinees with LD were in place. Scores for examinees with LD from test administrations with accommodations were similar to those of examinees without LD on standard administrations, but less well associated with grade point averages. The results of this study provide evidence that although scores for examinees with LD from nonstandard test administrations are comparable to scores for examinees without LD, they have less criterion validity and are less meaningful for their intended purpose. 相似文献

10.

Improving Multiple-Choice Test Performance for Examinees with Different Levels of Test Anxiety

Linda Crocker Alicia Schmitt 《Journal of Experimental Education》2013,81(4):201-205

The effectiveness of a strategy for improving performance on multiple-choice items for examinees was assessed. An aptitude-treatment interaction model was used to test the possibility of different treatment effects for examinees with different levels of test anxiety. Undergraduate measurement students responded to the Mandler-Sarason Test Anxiety Scale and to an objective test covering course content. For low-anxious examinees, generation of an answer before selecting a multiple-choice response led to higher test performance; for highly test anxious examinees, there was a slightly negative effect on performance. 相似文献

11.

Evaluating an Automatically Scorable, Open-Ended Response Type for Measuring Mathematical Reasoning in Computer-Adaptive Tests

Randy Elliot Bennett Manfred Steffen Mark Kevin Singley Mary Morley Daniel Jacquemin 《Journal of Educational Measurement》1997,34(2):162-176

The first generation of computer-based tests depends largely on multiple-choice items and constructed-response questions that can be scored through literal matches with a key. This study evaluated scoring accuracy and item functioning for an open-ended response type where correct answers, posed as mathematical expressions, can take many different surface forms. Items were administered to 1,864 participants in field trials of a new admissions test for quantitatively oriented graduate programs. Results showed automatic scoring to approximate the accuracy of multiple-choice scanning, with all processing errors stemming from examinees improperly entering responses. In addition, the items functioned similarly in difficulty, item-total relations, and male-female performance differences to other response types being considered for the measure. 相似文献

12.

Item analysis of written expression scoring systems from the PIAT‐R and WIAT

Tracy A. Muenz Bryan Y. Ouchi Jason C. Cole 《Psychology in the schools》1999,36(1):31-40

The Peabody Individual Achievement Test–Revised (PIAT‐R) and Wechsler Individual Achievement Test (WIAT) contain measures of written expression. However, these subtests are not theory‐based and were assessed with inappropriate psychometric analyses. This study attempted to enhance the study of written expression by reexamining the reliability and validity of the PIAT‐R and WIAT Written Expression scoring systems, applying theory and more appropriate statistical analyses. First, items were identified that were the most and least reliable, determined by interrater agreement. Next, the most and least valid items were identified, based on item–total correlations. Subjects included 50 adolescents, men, and women aged 13 to 46 years; raters were three graduate students with experience and training similar to that of the typical test user. Results indicate that seven items were too easy, as virtually all subjects received the maximum score on these items—these items were eliminated. The remaining 24 items were classified as both reliable and valid (9 items), reliable but not valid (4 items), valid with limited reliability (5 items), and neither reliable nor valid (6 items). The WIAT written expression scoring system was found to have more items that were both reliable and valid compared to the PIAT‐R scoring system. Items measuring global, rather than specific, content were also found to be more reliable and valid. © 1999 John Wiley & Sons, Inc. 相似文献

13.

The Personal Statement as an Indicator of Writing Skill: A Cautionary Note

《Educational Assessment》2013,18(1):75-87

The objective of this study was to evaluate the personal statement as an indicator of writing skill. The evaluation was based on a comparison of specially evaluated personal statements with a standardized measure of writing ability requiring examinees to write timed expository essays. A sample of prospective graduate students wrote test essays and provided copies of personal statements they had submitted for application to graduate school. A majority of the sample acknowledged receiving help in either drafting or revising their statements. Correlations of the 2 indicators (test essay and personal statement) with each of several nontest indicators of writing skill (e.g., self-reports of writing ability and grades on writing assignments) revealed that the traditional expository essay was significantly more highly related to nearly all of the indicators. It is suggested that, although the personal statement may provide certain unique and important information about applicants (and thus be valid for this purpose), its validity as an indicator of writing skill (defined as the ability to present and sustain a coherent discussion of a complex issue) needs to be better established. 相似文献

14.

A Note on Presenting What Predictive Validity Numbers Mean

Brent Bridgeman Nancy Burton Frederick Cline 《教育实用测度》2013,26(2):109-119

相似文献

15.

Repeater Analysis for Combining Information From Different Assessments

下载免费PDF全文

Shelby Haberman Lili Yao 《Journal of Educational Measurement》2015,52(2):223-251

Admission decisions frequently rely on multiple assessments. As a consequence, it is important to explore rational approaches to combine the information from different educational tests. For example, U.S. graduate schools usually receive both TOEFL iBT^® scores and GRE^® General scores of foreign applicants for admission; however, little guidance has been given to combine information from these two assessments, even though the relationships between such sections as GRE Verbal and TOEFL iBT Reading are obvious. In this study, principles are provided to explore the extent to which different assessments complement one another and are distinguishable. Augmentation approaches developed for individual tests are applied to provide an accurate evaluation of combined assessments. Because augmentation methods require estimates of measurement error and internal reliability data are unavailable, required estimates of measurement error are obtained from repeaters, examinees who took the same test more than once. Because repeaters are not representative of all examinees in typical assessments, minimum discriminant information adjustment techniques are applied to the available sample of repeaters to treat the effect of selection bias. To illustrate methodology, combining information from TOEFL iBT scores and GRE General scores is examined. Analysis suggests that information from the GRE General and TOEFL iBT assessments is complementary but not redundant, indicating that the two tests measure related but somewhat different constructs. The proposed methodology can be readily applied to other situations where multiple assessments are needed. 相似文献

16.

Item-Response Changes on Multiple-Choice Tests as a Function of Test Anxiety

Kathy Green 《Journal of Experimental Education》2013,81(4):225-228

Item-response changing as a function of test anxiety was investigated. Seventy graduate students completed the Test Anxiety Scale and 73 multiple-choice items during the quarter. The data supported the hypothesis that high test-anxious students make more item-response changes than low test-anxious students. Results also suggested that both high- and low-anxious students profit to a similar extent proportionally from answer changing. It was further found that more responses were changed on difficult than on easy items for both high- and low-anxious students. Test anxiety is suggested as a factor forming test-taking style. 相似文献

17.

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees’ Cognitive Skills in Critical Reading

Changjiang Wang Mark J. Gierl 《Journal of Educational Measurement》2011,48(2):165-187

The purpose of this study is to apply the attribute hierarchy method (AHM) to a subset of SAT critical reading items and illustrate how the method can be used to promote cognitive diagnostic inferences. The AHM is a psychometric procedure for classifying examinees’ test item responses into a set of attribute mastery patterns associated with different components from a cognitive model. The study was conducted in two steps. In step 1, three cognitive models were developed by reviewing selected literature in reading comprehension as well as research related to SAT Critical Reading. Then, the cognitive models were validated by having a sample of students think aloud as they solved each item. In step 2, psychometric analyses were conducted on the SAT critical reading cognitive models by evaluating the model‐data fit between the expected and observed response patterns produced from two random samples of 2,000 examinees who wrote the items. The model that provided best data‐model fit was then used to calculate attribute probabilities for 15 examinees to illustrate our diagnostic testing procedure. 相似文献

18.

Equating Minimum-Competency Tests: Comparisons of Methods

John R. Hills Raja G. Subhiyah Thomas M. Hirsch 《Journal of Educational Measurement》1988,25(3):221-231

The 1986 scores from Florida's Statewide Student Assessment Test, Part II (SSAT-II), a minimum-competency test required for high school graduation in Florida, were placed on the scale of the 1984 scores from that test using five different equating procedures. For the highest scoring 84 % of the students, four of the five methods yielded results within 1.5 raw-score points of each other. They would be essentially equally satisfactory in this situation, in which the tests were made parallel item by item in difficulty and content and the groups of examinees were population cohorts separated by only 2 years. Also, the results from six different lengths of anchor items were compared. Anchors of 25, 20, 15, or 10 randomly selected items provided equatings as effective as 30 items using the concurrent IRT equating method, but an anchor of 5 randomly selected items did not 相似文献

19.

基于语料库的“商务短语选择”测试项目质量分析

李延玉《深圳职业技术学院学报》2014,(6):50-56

以汇集全国国际商务英语考试（一级）历次考试（2007年4月至2014年5月）为样本,选取阅读模块中“商务短语选择”（BPMC）项目为例子,运用英语语料库和测试成绩分析等工具剖析“商务短语选择”项目命题质量和项目难度、区分度和信度。同时,结合命题和施考实践,探寻提高商务英语词汇测试质量的对策。相似文献

20.

Differential Item Functioning for Minority Examinees on the SAT

Alieia P. Sehmitt Nell J. Dorans 《Journal of Educational Measurement》1990,27(1):67-81

The standardization approach to assessing differential item functioning (DIF), including standardized distractor analysis, is described. The results of studies conducted on Asian Americans, Hispanics (Mexican Americans and Puerto Ricans), and Blacks on the Scholastic Aptitude Test (SAT) are described and then synthesized across studies. Where the groups were limited to include only examinees who spoke English as their best language, very few items across forms and ethnic groups exhibited large DIF. Major findings include evidence of differential speededness (where minority examinees did not complete SAT-Verbal sections at the same rate as White students with comparable SAT-Verbal scores) for Blacks and Hispanics and, when the item content is of special interest, advantages for the relevant ethnic group. In addition, homographs tend to disadvantage all three ethnic groups, but the effect of vertical relationships in analogy items are not as consistent. Although these findings are important in understanding DIF, they do not seem to account for all differences. Other variables related to DIF still need to be identified. Furthermore, these findings are seen as tentative until corroborated by studies using controlled data collection designs. 相似文献