期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Test Development with Performance Standards and Achievement Growth in Mind

Steve Ferrara Dubravka Svetina Sylvia Skucha Anne H. Davidson 《Educational Measurement》2011,30(4):3-15

相似文献

2.

Same-Form Retest Effects on Credentialing Examinations

Mark R. Raymond Sandra Neustel Dan Anderson 《Educational Measurement》2009,28(2):19-27

Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs; thus, it is important to determine if unwarranted score gains actually occur. Prior studies provide strong evidence that the same-form advantage is pronounced for aptitude tests. However, the sparse research within the context of achievement and credentialing testing suggests that the same-form advantage is minimal. For the present experiment, 541 examinees who failed a national certification test were randomly assigned to receive either the same test or a different (parallel) test on their second attempt. Although the same-form group had shorter response times on the second administration, score gains for the two groups were indistinguishable. We discuss factors that may limit the generalizability of these findings to other assessment contexts. 相似文献

3.

Repeat Testing Effects on Credentialing Exams: Are Repeaters Misinformed or Uninformed?

下载免费PDF全文

Richard A. Feinberg Mark R. Raymond Steven A. Haist 《Educational Measurement》2015,34(1):34-39

To mitigate security concerns and unfair score gains, credentialing programs routinely administer new test material to examinees retesting after an initial failing attempt. Counterintuitively, a small but growing body of recent research suggests that repeating the identical form does not create an unfair advantage. This study builds upon and extends this research by investigating changes in responses to specific items encountered on both the first and repeat attempts. Results indicate that scores gains for repeat examinees who were assigned an identical form were not different from repeat examinees who received a different, but parallel, form. Analyses of responses to individual items answered incorrectly on the initial attempt found that examinees 68% of the time selected the same incorrect option on their second attempt, suggesting repeaters are misinformed rather than uninformed. Implications for feedback, remediation, and retesting policies are discussed. 相似文献

4.

An Application of Item Response Time: The Effort-Moderated IRT Model

Steven L. Wise Christine E. DeMars 《Journal of Educational Measurement》2006,43(1):19-38

The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity. 相似文献

5.

Viability of construct validity of the speaking modules of international language examinations (IELTS vs. TOEFL iBT): evidence from Iranian test-takers

Keivan Zahedi Saeedeh Shamsaee 《Educational Assessment, Evaluation and Accountability》2012,24(3):263-277

The aim of the present research is to examine the viability of the construct validity of the speaking modules of two internationally recognized language proficiency examinations, namely IELTS and TOEFL iBT. High-stake standardized tests play a crucial and decisive role in determining the future academic life of many people. Overall obtained scores of candidates are believed to reflect their general proficiency level. Appropriate interpretation and use of test scores depend on the extent to which items measuring a particular skill (here speaking) can meet the criteria to examine the intended construct. Speaking, amongst the other four skills, has a central place in assessing general proficiency of the candidates. This research seeks to scrutinize how IELTS and TOEFL iBT tap on the speaking proficiency of their candidates. Moreover, this study investigates whether obtained speaking scores of candidates in these two international high-stake tests show an acceptable degree of consistency in measuring the skill being examined. The chosen sample of the study consisted of 60 students who successfully completed TOEFL iBT and IELTS preparation courses in Tehran. The results of the statistical analysis show that there is a meaningful discrepancy between the two exams in assessing the speaking abilities of the exam-takers and therefore challenge the construct validity of the exams in question. Findings are then used to discuss the repercussions for language proficiency measurement and assessment. 相似文献

6.

Uncovering Multivariate Structure in Classroom Observations in the Presence of Rater Errors

下载免费PDF全文

Daniel F. McCaffrey Kun Yuan Terrance D. Savitsky J. R. Lockwood Maria O. Edelen 《Educational Measurement》2015,34(2):34-46

We examine the factor structure of scores from the CLASS‐S protocol obtained from observations of middle school classroom teaching. Factor analysis has been used to support both interpretations of scores from classroom observation protocols, like CLASS‐S, and the theories about teaching that underlie them. However, classroom observations contain multiple sources of error, most predominantly rater errors. We demonstrate that errors in scores made by two raters on the same lesson have a factor structure that is distinct from the factor structure at the teacher level. Consequently, the “standard” approach of analyzing on teacher‐level average dimension scores can yield incorrect inferences about the factor structure at the teacher level and possibly misleading evidence about the validity of scores and theories of teaching. We consider alternative hierarchical estimation approaches designed to prevent the contamination of estimated teacher‐level factors. These alternative approaches find a teacher‐level factor structure for CLASS‐S that consists of strongly correlated support and classroom management factors. Our results have implications for future studies using factor analysis on classroom observation data to develop validity evidence and test theories of teaching and for practitioners who rely on the results of such studies to support their use and interpretation of the classroom observation scores. 相似文献

7.

Preventive screening for early readers: Predictive validity of the Dynamic Indicators of Basic Early Literacy Skills (DIBELS)

Catherine T. Goffreda James Clyde Diperna Jason A. Pedersen 《Psychology in the schools》2009,46(6):539-552

Current empirical evidence indicates poor learning trajectories for students with early literacy skill deficits. As such, reliable and valid detection of at‐risk students through regular screening and progress monitoring is imperative. This study investigated the predictive validity of scores on the Dynamic Indicators of Basic Early Literacy Skills (DIBELS). Logistic regression analyses were used to test the utility of the DIBELS first grade indicators for predicting reading proficiency on TerraNova California Achievement Test (CAT) Assessment and Pennsylvania System of School Assessment (PSSA) in second and third grade, respectively. Results suggest that students' first grade Oral Reading Fluency (ORF) DIBELS risk category scores were the only significant predictor of future TerraNova and PSSA reading proficiency. Although the current data present encouraging results for the predictive validity of ORF as a screening tool for early readers, further investigations of the utility of the remaining indicators (Letter Naming Fluency, Nonsense Word Fluency, and Phonemic Segmentation Fluency) are warranted. © 2009 Wiley Periodicals, Inc. 相似文献

8.

蕴涵量表法在HSK阅读理解测验公平性研究中的应用

柴省三《考试研究》2012,(5):54-62

阅读理解能力测验中所选择的文章在内容方面对不同专业背景的考生亚团体是否具有公平性的问题,是测验效度高低的重要证据,也是测验效度验证（validation）的重要环节。本研究以中国语言与文学专业考生为目标组,分别将经济学专业和生物医学专业考生作为参照组,采用效标测量和蕴涵量表分析相结合的方法,对HSK（高等）阅读理解测验的文章难度对三个不同专业背景的考生组的公平性问题进行了检验。研究结果表明,两个参照组考生尽管具有各自的相对专业优势,但他们在六篇阅读材料上获得的难度排列顺序与目标组考生完全一致;虽然目标组考生不具备汉语知识以外的其他专业优势,但因为HSK考试所选择的阅读材料没有涉及语言知识本身以外的特殊专业要求,因而测验对三个不同专业背景的考生具有较高的公平性。相似文献

9.

INTERNAL FACTOR STRUCTURE AND CONVERGENT VALIDITY EVIDENCE: THE SELF‐REPORT VERSION OF SELF‐REGULATION STRATEGY INVENTORY

Timothy J. Cleary Leah Dembitzer Ryan J. Kettler 《Psychology in the schools》2015,52(9):829-844

Using a sample of 348 middle school students, we gathered evidence regarding the internal consistency of scores, as well as the internal factor structure and convergent validity evidence for inferences from a self‐report questionnaire called the Self‐Regulation Strategy Inventory–Self Report. Confirmatory factor analysis revealed that the fit indexes for a hierarchical model (composite, three factors) and a single‐level, three‐factor model were highly similar but mixed. Respecification of the hierarchical model based on conceptual overlap of items led to substantial improvement in the overall fit of the model, as indicated by the root mean square error of approximation, chi‐square/df, and the comparative fit index. Correlational analyses also provided strong convergent validity evidence, as the three subscales exhibited statistically significant relations with four motivation beliefs (i.e., self‐efficacy, perceived instrumentality, task interest, perceived responsibility) and two distinct markers of regulation‐related behaviors (i.e., teacher ratings, office discipline referrals). 相似文献

10.

Building Validity Evidence for Scores on a State-Wide Alternate Assessment: A Contrasting Groups, Multimethod Approach

Stephen N. Elliott Elizabeth Compton rew T. Roach 《Educational Measurement》2007,26(2):30-43

The relationships between ratings on the Idaho Alternate Assessment (IAA) for 116 students with significant disabilities and corresponding ratings for the same students on two norm-referenced teacher rating scales were examined to gain evidence about the validity of resulting IAA scores. To contextualize these findings, another group of 54 students who had disabilities, but were not officially eligible for the alternate assessment also was assessed. Evidence to support the validity of the inferences about IAA scores was mixed, yet promising. Specifically, the relationship among the reading, language arts, and mathematics achievement level ratings on the IAA and the concurrent scores on the ACES-Academic Skills scales for the eligible students varied across grade clusters, but in general were moderate. These findings provided evidence that IAA scales measure skills indicative of the state's content standards. This point was further reinforced by moderate to high correlations between the IAA and Idaho State Achievement Test (ISAT) for the not eligible students. Additional evidence concerning the valid use of the IAA was provided by logistic regression results that the scores do an excellent job of differentiating students who were eligible from those not eligible to participate in an alternate assessment. The collective evidence for the validity of the IAA scores suggests it is a promising assessment for NCLB accountability of students with significant disabilities. The methods of establishing this evidence have the potential to advance validation efforts of other states' alternate assessments. 相似文献

11.

Comprehension monitoring and reading comprehension in bilingual students

Svjetlana Koli&#x;‐Vehovec Igor Baj&#x;anski 《Journal of Research in Reading》2007,30(2):198-211

This study explored comprehension monitoring, use of reading strategies and reading comprehension of bilingual students at different levels of perceived proficiency in Italian. The participants were bilingual fifth to eighth‐grade elementary school students from four Italian schools in Rijeka, Croatia. Students' reading comprehension was assessed. Their comprehension monitoring skill was measured on the Metacomprehension test and through use of a cloze task. The Strategic Reading Questionnaire (SRQ) was used as a self‐report measure of strategic reading. A questionnaire investigating Italian language use and perceived proficiency in the Italian language was also administered. Perceived proficiency in Italian was not clearly determined by early or late preschool age of second language acquisition. Bilingual students with high perceived proficiency in Italian (high PP group) had better meta‐cognitive reading skills than those with low perceived proficiency in Italian (low PP group). Comprehension monitoring was the most important predictor of reading comprehension in all students. 相似文献

12.

Validity Issues for Performance-Based Tests Scored With Computer-Automated Scoring Systems

《教育实用测度》2013,26(4):413-432

With the increasing use of automated scoring systems in high-stakes testing, it has become essential that test developers assess the validity of the inferences based on scores produced by these systems. In this article, we attempt to place the issues associated with computer-automated scoring within the context of current validity theory. Although it is assumed that the criteria appropriate for evaluating the validity of score interpretations are the same for tests using automated scoring procedures as for other assessments, different aspects of the validity argument may require emphasis as a function of the scoring procedure. We begin the article with a taxonomy of automated scoring procedures. The presentation of this taxonomy provides a framework for discussing threats to validity that may take on increased importance for specific approaches to automated scoring. We then present a general discussion of the process by which test-based inferences are validated, followed by a discussion of the special issues that must be considered when scoring is done by computer. 相似文献

13.

贵州省体育专业高考术科考试调查研究

王有智张东秀李康林单春华《黔南民族师范学院学报》2013,(6):97-101

采用专家访谈、文献资料、数理统计等研究方法,对贵州省体育专业高考术科考试进行了研究.研究表明：贵州省体育专业高考术科考试固定四项身体素质测试不能全面检测考生情况.建议：身体素质测试增加灵敏性素质,由4类4小项增为5类5小项,增加专项运动技能考试,身体素质和专项技能比例应为75∶25,测试总分宜统一定为100分,体育考试和文化课考试成绩都达到分数线的考生,按体育专业成绩由高到低录取. 相似文献

14.

Can Examinees Use Judgments of Item Difficulty to Improve Proficiency Estimates on Computerized Adaptive Vocabulary Tests?

Walter P. Vispoel Sara J. Clough Timothy Bleiler Amy B. Hendrickson Damien Ihrig 《Journal of Educational Measurement》2002,39(4):311-330

Recent simulation studies indicate that there are occasions when examinees can use judgments of relative item difficulty to obtain positively biased proficiency estimates on computerized adaptive tests (CATs) that permit item review and answer change. Our purpose in the study reported here was to evaluate examinees' success in using these strategies while taking CATs in a live testing setting. We taught examinees two item difficulty judgment strategies designed to increase proficiency estimates. Examinees who were taught each strategy and examinees who were taught neither strategy were assigned at random to complete vocabulary CATs under conditions in which review was allowed after completing all items and when review was allowed only within successive blocks of items. We found that proficiency estimate changes following review were significantly higher in the regular review conditions than in the strategy conditions. Failure to obtain systematically higher scores in the strategy conditions was due in large part to errors examinees made in judging the relative difficulty of CAT items. 相似文献

15.

Psychometric and Cognitive Functioning of an Under-Determined Computer-Based Response Type for Quantitative Reasoning

Randy Elliot Bennett Mary Morley Dennis Quardt Donald A. Rock Mark K. Singley Irvin R. Katz Adisack Nhouyvanisvong 《Journal of Educational Measurement》1999,36(3):233-252

We evaluated a computer-delivered response type for measuring quantitative skill. "Generating Examples" (GE) presents under-determined problems that can have many right answers. We administered two GE tests that differed in the manipulation of specific item features hypothesized to affect difficulty. Analyses related to internal consistency reliability, external relations, and features contributing to item difficulty, adverse impact, and examinee perceptions. Results showed that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills. Item features that increased difficulty included asking examinees to supply more than one correct answer and to identify whether an item was solvable. Gender differences were similar to those found on the GRE quantitative and analytical test sections. Finally, examinees were divided on whether GE items were a fairer indicator of ability than multiple-choice items, but still overwhelmingly preferred to take the more conventional questions. 相似文献

16.

Using behavioral and academic indicators in the classroom to screen for at‐risk status

Laura Belsito Bruce A. Ryan Kathleen Brophy 《Psychology in the schools》2005,42(2):151-158

The present study validated a brief at‐risk screening instrument designed for easy use by teachers in the elementary school. School performance measures were collected for students in first to sixth grade one year following initial teacher ratings using the Screening For At‐Risk Status screening instrument. Findings indicated that the instrument is best seen as measuring a single at‐risk construct with items drawn from three domains: academic skills, social confidence, and social cooperation. Correlations between at‐risk scores and school performance measures taken one year later demonstrated predictive validity. The screening instrument correctly identified at‐risk students with 88% accuracy and not‐at‐risk students with 74% accuracy. There were 12% false negatives. Use of the instrument provides teachers with a quick, easy screening of students who may develop difficulties in the future. For schools, the screening can be used as the first step in a supportive response system to assist at‐risk students from developing serious school difficulties and possibly failure in the longer term. © 2005 Wiley Periodicals, Inc. Psychol Schs 42: 151–158, 2005. 相似文献

17.

The validity and comparability of entrance examination scores after accommodations are made for students with LD

Zurcher R Bryant DP 《Journal of learning disabilities》2001,34(5):462-471

Every year, thousands of college and university applicants with learning disabilities (LD) present scores from standardized examinations as part of the admissions process for postsecondary education. Many of these scores are from tests administered with nonstandard procedures due to the examinees' learning disabilities. Using a sample of college students with LD and a control sample, this study investigated the criterion validity and comparability of scores on the Miller Analogies Test when accommodations for the examinees with LD were in place. Scores for examinees with LD from test administrations with accommodations were similar to those of examinees without LD on standard administrations, but less well associated with grade point averages. The results of this study provide evidence that although scores for examinees with LD from nonstandard test administrations are comparable to scores for examinees without LD, they have less criterion validity and are less meaningful for their intended purpose. 相似文献

18.

Interactive Homework: A Tool for Fostering Parent–Child Interactions and Improving Learning Outcomes for At-risk Young Children

Lora Battle Bailey 《Early Childhood Education Journal》2006,34(2):155-167

The notion that parent involvement impacts student learning outcomes for children who are at risk for failing academically has been supported by prominent early childhood education experts. Recent attention has been given to specific ways parents can help increase student learning through their interactions with children as they complete home learning activities. It is important to note that the term parent is used interchangeably with the terms adult, guardian and family member. The term “at-risk reader” refers to readers who are at risk of failing school because of reading deficiencies. This report will examine whether parent training to increase parent–child interactions during the completion of second grade Interactive Homework Assignments (IHA) can facilitate increases in a student’s ability to draw inferences from reading selections, a skill closely aligned with proficiency in reading acquisition. The second grade level was chosen because these children were those whose teachers were concerned with preparing them to take the third grade SAT9. Third grade level was not selected because many of their professional development activities were prescribed due to their immediate concern with preparing students to take the SAT9. IHA, for the scope of this study, is homework designed to increase parent involvement and student achievement. The results indicate that specific parent training during a brief period of time, approximately four weeks, has the potential for improving academic performance for academically at-risk students. 相似文献

19.

以CEFR为基础之华语文初级能力测验研发与应用

王暄博蔡雅熏郭伯臣赵日彰《暨南大学华文学院学报》2012,(1):32-41

近年来,随着华语文学习需求的日益升高,使得以“母语为非华语者”的华语文能力测验也逐渐受到各国瞩目,然而,这些华语文能力测验仍有一些限制与不足之处。本研究目的是以欧洲语言共同参考架构（The Common European Framework of Reference,CEFR）为基础,参考蔡雅熏（2009）编制的《华语文能力指标》,研发A2级的华语文听力与阅读测验,并导人现代测验理论（item response theory,IRT）之技术,建立一套具有信度、效度的华语文能力计算机化测验。最后,本文透过次级量尺分数估计方法,探讨受试者在CEFR中四种语言能力之表现,研究显示受试者表达与理解能力优于互动与转述能力。相似文献

20.

EFFECTS OF TEACHER–CHILD INTERACTION TRAINING (TCIT) ON TEACHER RATINGS OF BEHAVIOR CHANGE

Lauren L. Garbacz Kristen E. Zychinski Rachel M. Feuer Jocelyn S. Carter Karen S. Budd 《Psychology in the schools》2014,51(8):850-865

Problem behaviors in preschool‐aged children negatively affect teacher‐child relationships and children's skill development. In this clinical replication of an initial study, we implemented Teacher–Child Interaction Training (TCIT), a teacher‐delivered, universal intervention designed for early childhood settings. The initial study evaluated the TCIT program in a sample of 4‐ to 5‐year‐old children, whereas the current study focused on 2‐ to 3‐year‐old children. Teacher ratings of children's behavior indicated a significant main effect for time on children's protective factor scores, but not on behavioral concerns. However, for children whose ratings fell in the below‐average range at baseline, significant large effect sizes were obtained for changes over time for both protective factors and behavioral concerns. Higher levels of teacher skill change were significantly associated with overall higher protective factor scores, as well as lower behavioral concern scores for children when baseline levels of behavioral concerns were high. Results provide further support for the effectiveness of TCIT as a universal intervention designed to improve children's behaviors through targeted improvements in teachers’ relationship‐building skills and classroom management strategies. 相似文献