期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Can a Two‐Question Test Be Reliable and Valid for Predicting Academic Outcomes?

Brent Bridgeman 《Educational Measurement》2016,35(4):21-24

Scores on essay‐based assessments that are part of standardized admissions tests are typically given relatively little weight in admissions decisions compared to the weight given to scores from multiple‐choice assessments. Evidence is presented to suggest that more weight should be given to these assessments. The reliability of the writing scores from two of the large volume admissions tests, the GRE General Test (GRE) and the Test of English as a Foreign Language Internet‐based test (TOEFL iBT), based on retesting with a parallel form, is comparable to the reliability of the multiple‐choice Verbal or Reading scores from those tests. Furthermore, and even more important, the writing scores from both tests are as effective as the multiple‐choice scores in predicting academic success and could contribute to fairer admissions decisions. 相似文献

2.

Validity of GRE General Test scores and TOEFL scores for graduate admission to a technical university in Western Europe*

Judith Zimmermann Alina A. von Davier Joachim M. Buhmann Hans R. Heinimann 《European Journal of Engineering Education》2018,43(1):144-165

Graduate admission has become a critical process in tertiary education, whereby selecting valid admissions instruments is key. This study assessed the validity of Graduate Record Examination (GRE) General Test scores for admission to Master’s programmes at a technical university in Europe. We investigated the indicative value of GRE scores for the Master’s programme grade point average (GGPA) with and without the addition of the undergraduate GPA (UGPA) and the TOEFL score, and of GRE scores for study completion and Master’s thesis performance. GRE scores explained 20% of the variation in the GGPA, while additional 7% were explained by the TOEFL score and 3% by the UGPA. Contrary to common belief, the GRE quantitative reasoning score showed only little explanatory power. GRE scores were also weakly related to study progress but not to thesis performance. Nevertheless, GRE and TOEFL scores were found to be sensible admissions instruments. Rigorous methodology was used to obtain highly reliable results. 相似文献

3.

A study of the long-term stability of GRE general test scores

Kenneth M. Wilson 《Research in higher education》1988,29(1):3-40

Some applicants for admission to graduate programs present Graduate Record Examinations (GRE) General Test scores that are several years old. Due to different experiences over time, older GRE verbal, quantitative, and analytical scores may no longer accurately reflect the current capabilities of the applicants. To provide evidence regarding the long-term stability of GRE scores, test-retest correlations and average change (net gain) in test performance were analyzed for GRE General Test repeaters classified by time between test administrations in intervals ranging from less than 6 months to 10 years or more. Findings regarding average changes in verbal and quantitative test performance for long-term repeaters (with 5 years or more between tests), generally, and by graduate major area, sex, and ethnicity, appeared to be consistent with a differential growth hypothesis: Long-term repeaters generally, and in all of the subgroups, registered greater average (net) score gain on verbal tests than on quantitative tests and, for subgroups, the amount of gain tended to vary directly with initial means. A rationale is presented for a growth interpretation of the observed average gains in test performance. Implications for graduate school and GRE Program policies regarding the treatment of older test scores are considered. 相似文献

4.

A Computer-Based Task for Measuring the Representational Component of Quantitative Proficiency

Randy Elliot Bennett Marc M. Sebrechts 《Journal of Educational Measurement》1997,34(1):64-77

In this study, we created a computer-delivered problem-solving task based on the cognitive research literature and investigated its validity for graduate admissions assessment. The task asked examinees to sort mathematical word problem stems according to prototypes. Data analyses focused on the meaning of sorting scores and examinee perceptions of the task. Results showed that those who sorted well tended to have higher GRE General Test scores and college grades than did examinees who sorted less proficiently. Examinees generally preferred this task to multiple-choice items like those found on the General Test's Quantitative section and felt the task was a fairer measure of their ability to succeed in graduate school. Adaptations of the task might be used in admissions tests, as well as for instructional assessments to help lower- scoring examinees localize and remediate problem-solving difficulties. 相似文献

5.

Viability of construct validity of the speaking modules of international language examinations (IELTS vs. TOEFL iBT): evidence from Iranian test-takers

Keivan Zahedi Saeedeh Shamsaee 《Educational Assessment, Evaluation and Accountability》2012,24(3):263-277

The aim of the present research is to examine the viability of the construct validity of the speaking modules of two internationally recognized language proficiency examinations, namely IELTS and TOEFL iBT. High-stake standardized tests play a crucial and decisive role in determining the future academic life of many people. Overall obtained scores of candidates are believed to reflect their general proficiency level. Appropriate interpretation and use of test scores depend on the extent to which items measuring a particular skill (here speaking) can meet the criteria to examine the intended construct. Speaking, amongst the other four skills, has a central place in assessing general proficiency of the candidates. This research seeks to scrutinize how IELTS and TOEFL iBT tap on the speaking proficiency of their candidates. Moreover, this study investigates whether obtained speaking scores of candidates in these two international high-stake tests show an acceptable degree of consistency in measuring the skill being examined. The chosen sample of the study consisted of 60 students who successfully completed TOEFL iBT and IELTS preparation courses in Tehran. The results of the statistical analysis show that there is a meaningful discrepancy between the two exams in assessing the speaking abilities of the exam-takers and therefore challenge the construct validity of the exams in question. Findings are then used to discuss the repercussions for language proficiency measurement and assessment. 相似文献

6.

Computer-Adaptive Testing of ESL Reading Proficiency

杨眉《读与写:教育教学刊》2011,(3):10+14

With the advent of modern computer technology,there have been growing efforts in recent years to computerize standardized tests,including the popular Graduate Record Examination(GRE),the Graduate Management Admission Test(GMAT) and the Test of English as a Foreign Language(TOEFL).Many of such computer-based tests are known as the computerized adaptive tests,whose major feature is that,depending on their performance in the course of testing,different examinees may be given with different sets of items(questions).In doing so,items can be efficiently utilized to realize maximum accuracy for estimation of examinee’s ability. In this short paper we will introduce briefly the computer-adaptive test(CAT).The application of CAT to the assessment of reading comprehension in a second language will also be illustrated in this paper.The advantages and disadvantages will be analyzed,based on which some recommendations will be given for future study. 相似文献

7.

从新托福模式看大学英语四六级考试改革新动向

周文《四川教育学院学报》2009,25(6):94-95

托福考试的变革增加了口语部分,实现了人机对话,并且除了阅读部分,在其他部分中都与听力理解密切结合,听力的地位凸显。同时在口语和写作部分都有综合试题,分别融合了从读到听再到说或是写的过程,或是从听到说的过程。这些都对新一轮大学英语四六级考试的改革方向带来新的契机,并产生了重要的影响,具有深远的意义。相似文献

8.

EFFECTS OF COACHING ON GRE APTITUDE TEST SCORES

DONALD E. POWERS 《Journal of Educational Measurement》1985,22(2):121-136

Test preparation activities were determined for a large representative sample of Graduate Record Examination (GRE) Aptitude Test takers. About 3% of these examinees had attended formal coaching programs for one or more sections of the test.
After adjusting for differences in the background characteristics of coached and uncoached students, effects on test scores were related to the length and the type of programs offered. The effects on GRE verbal ability scores were not significantly related to the amount of coaching examinees received, and quantitative coaching effects increased slightly but not significantly with additional coaching. Effects on analytical ability scores, on the other hand, were related significantly to the length of coaching programs, through improved performance on two analytical item types, which have since been deleted from the test.
Overall, the data suggest that, when compared with the two highly susceptible item types that have been removed from the GRE Aptitude Test, the test item types in the current version of the test (now called the GRE General Test) appear to show relatively little susceptibility to formal coaching experiences of the kinds considered here. 相似文献

9.

Who Benefits Most From Preparing for a “Coachable” Admissions Test?

Donald E. Powers 《Journal of Educational Measurement》1987,24(3):247-262

A previous study of the initial, preoperational version of the Graduate Record Examinations (GRE) analytical ability measure (Powers & Swinton, 1984) revealed practically and statistically significant effects of test familiarization on analytical test scores. (Two susceptible item types were subsequently removed from the test.) Data from this study were reanalyzed for evidence of differential effects for subgroups of examinees classified by age, ethnicity, degree aspiration, English language dominance, and performance on other sections of the GRE General Test. The results suggested little, if any, difference among subgroups of examinees with respect to their response to the particular kind of test preparation considered in the study. Within the limits of the data, no particular subgroup appeared to benefit significantly more or significantly less than any other subgroup. 相似文献

10.

Predictors of Success on the Counselor Preparation Comprehensive Examination

E. A. Schmidt Linda E. Homeyer John L. Walker 《Counselor Education & Supervision》2009,48(4):226-238

This study examined the relationship between 403 counseling graduate students' scores on the Counselor Preparation Comprehensive Examination (CPCE; Center for Credentialing and Education, n.d.) and 3 admissions requirements used as predictor variables: undergraduate grade point average (UGPA), Graduate Record Examinations (GRE) General Test Verbal Reasoning (GRE‐V) score, and GRE General Test Quantitative Reasoning (GRE‐Q) score. Multiple regression analyses revealed that all predictor variables accounted for somewhat limited, yet significant variations in the CPCE‐Total scores (R² = .21). Results indicated that UGPAs, GRE‐V scores, and GRE‐Q scores are valid criteria for determining counseling graduate student success on the CPCE. 相似文献

11.

Effects of Differentially Time-Consuming Tests on Computer-Adaptive Test Scores

Brent Bridgeman Frederick Cline 《Journal of Educational Measurement》2004,41(2):137-148

Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test. 相似文献

12.

雅思与托福交际测试特点与有用性的对比分析 总被引：1，自引：0，他引：1

齐佳敏《怀化学院学报》2010,29(10)

雅思与托福iBT皆是交际测试理论实践的先驱和典范。从Weir(2003)所归纳的交际测试特征以及Bachman和Palmer(1996)的交际测试有用性评价原则出发,对这两种大型语言测试分项进行交际测试对比和评价,以检测两者在应用交际测试理论的彻底性。研究结果显示两种语言测试在各项交际测试理论应用中各有侧重,互有千秋。相似文献

13.

使用聚类分析验证Angoff专家判断法有效性的研究——以医师资格考试医学综合笔试临床执业类别考试为例

卢燕张颖《中国考试》2010,(5)

本研究随机抽取了21387名参加某年医师资格考试医学综合笔试临床执业类别考试考生,对其在外科学上的得分进行了聚类分析,并将按照边界组法以及对照组法计算出的边界分数与Angoff专家判断的合格分数进行了对比。结果表明:两者对考生分类的一致性Kappa系数高达0.934,充分证明了Angoff合格分数判断法的有效性。相似文献

14.

Differences in written discourse in independent and integrated prototype tasks for next generation TOEFL

《Assessing Writing》2005,10(1):5-43

We assessed whether and how the discourse written for prototype integrated tasks (involving writing in response to print or audio source texts) field tested for Next Generation TOEFL^® differs from the discourse written for independent essays (i.e., the TOEFL Essay^®). We selected 216 compositions written for six tasks by 36 examinees in a field test—representing score levels 3, 4, and 5 on the TOEFL Essay—then coded the texts for lexical and syntactic complexity, grammatical accuracy, argument structure, orientations to evidence, and verbatim uses of source text. Analyses with non-parametric MANOVAs followed a three (task type: TOEFL Essay, writing in response to a reading passage, writing in response to a listening passage) by three (English proficiency level: score levels 3, 4, and 5 on the TOEFL Essay) within-subjects factorial design. The discourse produced for the integrated writing tasks differed significantly from the discourse produced in the independent essay for the variables of: lexical complexity (text length, word length, ratio of different words to total words written), syntactic complexity (number of words per T-unit, number clauses per T-unit), rhetoric (quality of propositions, claims, data, warrants, and oppositions in argument structure), and pragmatics (orientations to source evidence in respect to self or others and to phrasing the message as either declarations, paraphrases, or summaries). Across the three English proficiency levels, significant differences appeared for the variables of grammatical accuracy as well as all indicators of lexical complexity (text length, word length, ratio of different words to total words written), one indicator of syntactic complexity (words per T-unit), one rhetorical aspect (quality of claims in argument structure), and two pragmatic aspects (expression of self as voice, messages phrased as summaries). 相似文献

15.

Patterns of Solution Behavior across Items in Low-Stakes Assessments

Dena A. Pastor 《Educational Assessment》2019,24(3):189-212

The trustworthiness of low-stakes assessment results largely depends on examinee effort, which can be measured by the amount of time examinees devote to items using solution behavior (SB) indices. Because SB indices are calculated for each item, they can be used to understand how examinee motivation changes across items within a test. Latent class analysis (LCA) was used with the SB indices from three low-stakes assessments to explore patterns of solution behavior across items. Across tests, the favored models consisted of two classes, with Class 1 characterized by high and consistent solution behavior (>90% of examinees) and Class 2 by lower and less consistent solution behavior (<10% of examinees). Additional analyses provided supportive validity evidence for the two-class solution with notable differences between classes in self-reported effort, test scores, gender composition, and testing context. Although results were generally similar across the three assessments, striking differences were found in the nature of the solution behavior pattern for Class 2 and the ability of item characteristics to explain the pattern. The variability in the results suggests motivational changes across items may be unique to aspects of the testing situation (e.g., content of the assessment) for less motivated examinees. 相似文献

16.

Psychometric and Cognitive Functioning of an Under-Determined Computer-Based Response Type for Quantitative Reasoning

Randy Elliot Bennett Mary Morley Dennis Quardt Donald A. Rock Mark K. Singley Irvin R. Katz Adisack Nhouyvanisvong 《Journal of Educational Measurement》1999,36(3):233-252

We evaluated a computer-delivered response type for measuring quantitative skill. "Generating Examples" (GE) presents under-determined problems that can have many right answers. We administered two GE tests that differed in the manipulation of specific item features hypothesized to affect difficulty. Analyses related to internal consistency reliability, external relations, and features contributing to item difficulty, adverse impact, and examinee perceptions. Results showed that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills. Item features that increased difficulty included asking examinees to supply more than one correct answer and to identify whether an item was solvable. Gender differences were similar to those found on the GRE quantitative and analytical test sections. Finally, examinees were divided on whether GE items were a fairer indicator of ability than multiple-choice items, but still overwhelmingly preferred to take the more conventional questions. 相似文献

17.

Generalizability, Validity, and Examinee Perceptions of a Computer-Delivered Formulating-Hypotheses Test

Randy Elliot Bennett Donald A. Rock 《Journal of Educational Measurement》1995,32(1):19-36

The Formulating-Hypotheses (F-H) item presents a situation and asks examinees to generate as many explanations for it as possible. This study examined the generalizability, validity, and examinee perceptions of a computer-delivered version of the task. Eight F-H questions were administered to 192 graduate students. Half of the items restricted examinees to 7 words per explanation, and half allowed up to 15 words. Generalizability results showed high interrater agreement, with tests of between 2 and 4 items scored by one judge achieving coefficients in the .80s. Construct validity analyses found that F-H was only marginally related to the GRE General Test, and more strongly related than the General Test to a measure of ideational fluency. Different response limits tapped somewhat different abilities, with the 15-word constraint appearing more useful for graduate assessment. These items added significantly to conventional measures in explaining school performance and creative expression. 相似文献

18.

An Experimental, Exploratory Study of Causes of Bias in Test Items

Janiee Dowd Seheuneman 《Journal of Educational Measurement》1987,24(2):97-118

This study evaluated 16 hypotheses, subsumed under 7 more general hypotheses, concerning possible sources of bias in test items for black and white examinees on the Graduate Record Examination General Test (GRE). Items were developed in pairs that were varied according to a particular hypothesis, with each item from a pair administered in different forms of an experimental portion of the GRE. Data were analyzed using log linear methods. Ten of the 16 hypotheses showed interactions between group membership and the item version indicating a differential effect of the item manipulation on the performance of black and white examinees. The complexity of some of the interactions found, however, suggested that uncontrolled factors were also differentially affecting performance. 相似文献

19.

OBTAINING MAXIMUM LIKELIHOOD TRAIT ESTIMATES FROM NUMBER-CORRECT SCORES FOR THE THREE-PARAMETER LOGISTIC MODEL

WENDY M. YEN 《Journal of Educational Measurement》1984,21(2):93-111

A procedure is presented for obtaining maximum likelihood trait estimates from number-correct (NC) scores for the three-parameter logistic model. The procedure produces an NC score to trait estimate conversion table, which can be used when the hand scoring of tests is desired or when item response pattern (IP) scoring is unacceptable for other (e.g., political) reasons. Simulated data are produced for four 20-item and four 40-item tests of varying difficulties. These data indicate that the NC scoring procedure produces trait estimates that are tau-equivalent to the IP trait estimates (i.e., they are expected to have the same mean for all groups of examinees), but the NC trait estimates have higher standard errors of measurement than IP trait estimates. Data for six real achievement tests verify that the NC trait estimates are quite similar to the IP trait estimates but have higher empirical standard errors than IP trait estimates, particularly for low-scoring examinees. Analyses in the estimated true score metric confirm the conclusions made in the trait metric. 相似文献

20.

Psychometric Equivalence of Ratings for Repeat Examinees on a Performance Assessment for Physician Licensure

Mark R. Raymond Kimberly A. Swygert Nilufer Kahraman 《Journal of Educational Measurement》2012,49(4):339-361

Although a few studies report sizable score gains for examinees who repeat performance‐based assessments, research has not yet addressed the reliability and validity of inferences based on ratings of repeat examinees on such tests. This study analyzed scores for 8,457 single‐take examinees and 4,030 repeat examinees who completed a 6‐hour clinical skills assessment required for physician licensure. Each examinee was rated in four skill domains: data gathering, communication‐interpersonal skills, spoken English proficiency, and documentation proficiency. Conditional standard errors of measurement computed for single‐take and multiple‐take examinees indicated that ratings were of comparable precision for the two groups within each of the four skill domains; however, conditional errors were larger for low‐scoring examinees regardless of retest status. In addition, on their first attempt multiple‐take examinees exhibited less score consistency across the skill domains but on their second attempt their scores became more consistent. Further, the median correlation between scores on the four clinical skill domains and three external measures was .15 for multiple‐take examinees on their first attempt but increased to .27 for their second attempt, a value, which was comparable to the median correlation of .26 for single‐take examinees. The findings support the validity of inferences based on scores from the second attempt. 相似文献