期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluation of Procedure-Based Scoring for Hands-On Science Assessment

Gail P. Baxter Richard J. Shavelson Susan R. Goldman Jerry Pine 《Journal of Educational Measurement》1992,29(1):1-17

This article evaluates a procedure-based scoring system for a performance assessment (an observed paper towels investigation) and a notebook surrogate completed by fifth-grade students varying in hands-on science experience. Results suggested interrater reliability of scores for observed performance and notebooks was adequate (>.80) with the reliability of the former higher. In contrast, interrater agreement on procedures was higher for observed hands-on performance (.92) than for notebooks (.66). Moreover, for the notebooks, the reliability of scores and agreement on procedures varied by student experience, but this was not so for observed performance. Both the observed-performance and notebook measures correlated less with traditional ability than did a multiple-choice science achievement test. The correlation between the two performance assessments and the multiple-choice test was only moderate (mean = .46), suggesting that different aspects of science achievement have been measured. Finally, the correlation between the observed-performance scores and the notebook scores was .83, suggesting that notebooks may provide a reasonable, albeit less reliable, surrogate for the observed hands-on performance of students. 相似文献

2.

The overall effects of end-of-course assessment on student performance: A comparison between multiple choice testing, peer assessment, case-based assessment and portfolio assessment

Katrien Struyven Filip Dochy Steven Janssens Wouter Schelfhou Sarah Gielen 《Studies in Educational Evaluation》2006,32(3):202

This study investigates the effect of method of assessment on student performance. Five research conditions go together with one of four assessment modes, namely: portfolio, case-based, peer assessment, and multiple choice evaluation. Data collection is done by means of a pre-test/ post-test-design with the help of two standardised tests (N=816). Results show that assessment method does make a difference: assessments do not produce overall effects on student performance. Moreover, student-activating instruction efforts do not automatically result in more extensive learning gains. Finally, test results show, when compared to other assessments, a statistically significant positive effect of the multiple choice test on students' test scores. However, students' preparation level and the closed book format of the tests might serve explanatory purposes. 相似文献

3.

The common core conundrum: To what extent should we worry that changes to assessments will affect test-based measures of teacher performance?

《Economics of Education Review》2018

Policies that require the use of information about student achievement to evaluate teacher performance are becoming increasingly common across the United States, but there is some question as to how or whether to use student test-based teacher evaluations when student assessments change. We bring empirical evidence to bear on this issue. Specifically, we examine how estimates of teacher value-added are influenced by assessment changes across 12 test transitions in two subjects and five states. In all of the math transitions we study, value-added measures from test change years and stable regime years are broadly similar in terms of their statistical properties and informational content. This is also true for some of the reading transitions; we do find, however, some cases in which an assessment change in reading meaningfully alters value-added measures. Our study directly informs contemporary policy debates about how to evaluate teachers when new assessments are introduced and provides a general analytic framework for examining employee evaluation policies in the face of changing evaluation metrics. 相似文献

4.

Why increasing the number of raters only helps sometimes: Reliability and validity of peer assessment across tasks of different complexity

《Studies in Educational Evaluation》2023

Number of raters is theoretically central to peer assessment reliability and validity, yet rarely studied. Further, requiring each student to assess more peers’ documents both increases the number of evaluations per document but also assessor workload, which can decline performance. Moreover, task complexity is likely a moderating factor, influencing both workload and validity. This study examined whether changing the number of required peer assessments per student / number of raters per document affected peer assessment reliability and validity for tasks at different levels of task complexity. 181 students completed and provided peer assessments for tasks at three levels of task complexity: low complexity (dictation), medium complexity (oral imitation), and high complexity (writing). Adequate validity of peer assessments was observed for all three task complexities at low reviewing loads. However, the impacts of increasing reviewing load varied by reliability vs. validity outcomes and by task complexity. 相似文献

5.

Labels [Agrave] La Aristotle

Franklin L. Burdette 《The Educational forum》2013,77(3):339-341

High-stakes standardized student assessments are increasingly used in value-added evaluation models to connect teacher performance to P–12 student learning. These assessments are also being used to evaluate teacher preparation programs, despite validity and reliability threats. A more rational model linking student performance to candidates who actually teach these students is presented. Preliminary findings with three candidate cohorts indicate that the majority of their students met learning objectives and showed substantial pre-to-post learning gains. 相似文献

6.

A Study of the Effectiveness of Web-Based Homework in Teaching Undergraduate Business Statistics

Susan W. Palocsay Scott P. Stevens 《Decision Sciences Journal of Innovative Education》2008,6(2):213-232

Web-based homework (WBH) Technology can simplify the creation and grading of assignments as well as provide a feasible platform for assessment testing, but its effect on student learning in business statistics is unknown. This is particularly true of the latest software development of Web-based tutoring agents that dynamically evaluate individual students' skill level and purport to respond with appropriate, targeted teaching to improve learning efficiency. In this article, we compare traditional, textbook-based homework assignments with three systems of WBH for undergraduate business statistics courses: ALEKS, PH Grade Assist, and custom-made online quizzes in Blackboard. These systems represent a range of media from artificial intelligence–based tutoring to instructor-controlled objective testing. Using a common assessment test, we compare the performance of students taught with these different systems. Our study finds, as we anticipated, that student performance depends significantly upon teacher experience and student academic competence. Once these factors are controlled for, however, the technique used to deliver homework makes little difference in student success. In contrast to other published research, we do not find any advantage to automated tutoring and identify some limitations of this approach based on both instructor and student feedback. 相似文献

7.

Linking Teacher Assessment to Student Performance: A Benchmarking, Generalizability, and Validity Study of the Use of Teacher Work Samples* 总被引：1，自引：0，他引：1

Peter R. Denner Stephanie A. Salzman Arthur W. Bangert 《Journal of Personnel Evaluation in Education》2001,15(4):287-307

This study examined the validity and generalizability of the use of Teacher Work Samples to assess preservice and inservice teachers' abilities to meet national and state teaching standards and to impact the learning of their students. Our approach built upon the Teacher Work Sample Methodology of Western Oregon University (Schalock, 1998; Schalock, Cowart, & Staebler, 1993). To assess the ability of work sample assessments to differentiate performances along the full developmental continuum from beginning to expert teaching, we recruited junior-level candidates, student teaching interns, experienced teachers, and National Board Certified teachers to complete teacher work samples. We also examined whether work samples could be feasibly and equitably administered and scored with sufficient reliability to warrant their use for high-stakes decisions about the effectiveness of teaching performance. Results of the study show initial support for teacher work sample assessment as a way to provide valid and credible evidence connecting teaching performance to student learning. 相似文献

8.

Rhetoric and reality in science performance assessments: An update

Maria Araceli Ruiz-Primo Richard J. Shavelson 《科学教学研究杂志》1996,33(10):1045-1063

This article addresses the rhetoric of performance assessment with research on important claims about science performance assessments. We found the following: (a) Concepts and terminology used to refer to performance assessments often were not consistent within and across researchers, educators, and policy-makers. (b) Performance assessments are highly sensitive not only to the tasks and the occasions sampled, but also to the method (e.g., hands-on, computer simulation) used to measure performance. (c) Performance assessments do not necessarily tap higher-order thinking, especially when they are poorly designed. (d) Performance assessments are expensive to develop and use: technology is needed for developing these assessments in an efficient way. (e) Performance assessments do not necessarily have the expected positive impact on teachers' teaching and students' understanding. (f) If teachers are to use performance assessments in their classrooms, they need professional development to help them construct the necessary knowledge and skills. This article attempts to address some of these realities by presenting a conceptual framework that might guide the development and the evaluation of performance assessments, as well as steps that might be taken to create a performance assessment technology and develop teacher inservice programs. © 1996 John Wiley & Sons, Inc. 相似文献

9.

Enhancing Digital Simulated Laboratory Assessments: a Test of Pre-Laboratory Activities with the Learning Error and Formative Feedback Model

Chu Man-Wai Leighton Jacqueline P. 《Journal of Science Education and Technology》2019,28(3):251-264

Digitally simulated laboratory assessments (DSLAs) may be used to measure competencies such as problem solving and scientific inquiry because they provide an environment that allows the process of learning to be captured. These assessments provide many benefits that are superior to traditional hands-on laboratory tasks; as such, it is important to investigate different ways to maximize the potential of DSLAs in increasing student learning. This study investigated two enhancements—a pre-laboratory activity (PLA) and a learning error intervention (LEI)—that are hypothesized to enhance the use of DSLAs as an educational tool. The results indicate students who were administered the PLA reported statistically lower levels of test anxiety when compared to their peers who did not receive the activity. Furthermore, students who received the LEI scored statistically higher scores on the more difficult problems administered during and after the DSLA. These findings provide preliminary evidence that both a PLA and LEI may be beneficial in improving students’ performance on a DSLA. Understanding the benefits of these enhancements may help educators better utilize DSLAs in the classroom to improve student science achievement.

相似文献

10.

Improving Student Performance Through Computer-based Assessment: Insights from recent research

C. Ricketts S. J. Wilks 《Assessment & Evaluation in Higher Education》2002,27(5):475-479

One of the benefits claimed for computer-based assessment is that it can improve student performance in summative assessments. During the introduction of computer-based assessment in a first-year module on numeracy and statistics in Biology, online assessment was used to replace OMR-marked multiple-choice tests. Analysis of student results after the first year (Ricketts & Wilks, 2001) showed that students using online assessment did not perform as well as those using OMR-marked multiple-choice questions. The difference in performance could not be attributed to a weaker student cohort. In the second year student performance improved dramatically when they were not required to scroll through the question paper. Our results suggest that students may be disadvantaged by the introduction of online assessment, unless care is taken with the student-assessment interface. 相似文献

11.

Alignment and Implications for Test Takers

Catherine J. Welch Stephen B. Dunbar 《Educational Measurement》2020,39(2):8-17

The use of assessment results to inform school accountability relies on the assumption that the test design appropriately represents the content and cognitive emphasis reflected in the state's standards. Since the passage of the Every Student Succeeds Act and the certification of accountability assessments through federal peer review practices, the content validity arguments supporting accountability have relied almost exclusively on the alignment of statewide assessments to state standards. It is assumed that if alignment does not hold, the scores will not provide valid inferences regarding the degree to which test takers have performed. Although alignment results are commonly used as evidence of test appropriateness, Polikoff (this issue) would argue that given the importance of alignment in policy decisions, research related to alignment is surprisingly limited. Few studies have addressed the adequacy of alignment methodologies and results as support for the inferences to be made (i.e., proficient on state standards). This paper uses an example of test taker performance (and common performance indicators) to investigate to what extent the degree of alignment impacts inferences made about performance (i.e., classification into performance levels, estimates of student ability, and student rank order). 相似文献

12.

Evaluating the quality of peer and self evaluations as measures of student contributions to group projects

Mary Sprague Kate F. Wilson 《高等教育研究与发展》2019,38(5):1061-1074

ABSTRACT

Educators in higher education commonly use peer and self evaluations to help assess student performance on group projects. Although these evaluations provide multiple benefits, many educators are wary of using them due to concerns about their quality. This study addresses three questions debated in the literature regarding the quality of these assessments. How much do students differentiate among peer contributions through their ratings? How reliable are peer ratings? How much agreement exists between peer and self ratings? Although these questions have been addressed to varying degrees in past work, their answers have been far from settled. While many studies focus on just one of the questions, this study’s data make it possible to address all three questions for the same group of students as well as examine each question by student performance level. The evaluations assessed in this study were completed by a large number of students under conditions associated with obtaining more valid and reliable ratings. Overall, the results provide support for using peer and self evaluations to help assess student contributions to group projects. Peer ratings were largely reliable as group members generally agreed on the scores given to their peers. In addition, most students differentiated among group member contributions through their ratings. Students also tended to rate themselves higher than their peers rated them. This study has implications for how peer and self evaluations can be most effectively used by educators to measure student performance in group work. 相似文献

13.

Development of an Early Hebrew Oral Reading Fluency Measure

Scott J. Goldberg Elana R. Weinberger Nina E. Goodman Shoshana Ross 《Journal of Jewish Education》2013,79(3):198-214

Currently, there are no Hebrew (L2) reading assessments that have been tested to obtain evidence for reliability and validity on which to base decisions about Hebrew instruction. The authors developed a Hebrew benchmark assessment tool for first grade students modeled after Dynamic Indicators of Basic Early Literacy Skills, a standardized test of accuracy and fluency used to identify at-risk students and to monitor student progress. Results of pilot data collection (N=53) provide evidence for strong alternate form reliability for this measure, as well as evidence for content, face and criterion-related validity. Future directions for research and development are discussed. 相似文献

14.

Stability of School-Level Scores From Large-Scale Student Assessments

《教育实用测度》2013,26(2):173-185

More attention is being given to evaluating the quality of school-level assessment scores due to their importance for school-based planning and monitoring effectiveness. In this study, cross-year stability is proposed as an indicator of data quality and the degree of stability that is appropriate for large-scale assessments of student performance is explored. Following a search of Internet sites, Year 1 to Year 2 stability coefficients were calculated for assessment data from 21 states and 2 provinces. The median stability coefficient was .78 in mathematics and reading, but coefficients for writing were generally lower. A stability coefficient of .80 is recommended as the standard for large-scale assessments of student performance. A high degree of cross-year stability makes it easier to detect and attribute changes in school-level scores to school improvement efforts. The link between stability and reliability and several factors that may attenuate stability are discussed. 相似文献

15.

Analysis of assessment practice and subsequent performance of third year level students in natural sciences

《Africa Education Review》2013,10(4):563-583

Abstract

Summative assessment qualifies the achievement of a student in a particular field of specialization at a given time. Questions should include a range of cognitive levels from Bloom's taxonomy and be consistent with the learning outcomes of the module in question. Furthermore, a holistic approach to assessment, such as the application of the principles of the Herrmann Whole Brain Model, needs to be used to accommodate learning style diversity. The purpose of this study was to analyse, assess and compare the summative assessment of two third year level modules in the Bachelor of Science degree programme, namely Biochemistry and Zoology as part of action research with a view to enhancing the professional development of the lecturers involved. The questions posed in summative assessments were classified in terms of Bloom's differentiation of cognitive levels and the four different learning styles determined by Herrmann. Spearman's non-parametric analysis indicated that no correlation existed in this study between cognitive level and student performance based on achievement. In addition, there was not much difference between the cognitive levels and student performance between the two disciplines. Although the students seemed to do better at application level questions, the authors need to reflect on whether the assessments were valid with respect to the learning outcomes, methods of facilitating learning, and the assessments based on cognitive levels and learning style preferences. We conclude that continuous action research must be taken to improve the formulation of learning outcomes and students' achievement of these outcomes and quality of student learning – the main aim being the successful completion of the modules. 相似文献

16.

Shoe Shopping and the Reliability Coefficient

《Educational Assessment》2013,18(4):255-258

Editor's Introduction. Reliability Versus Accuracy: A Critical Distinction Test reliability coefficients traditionally have been used to judge the quality of measurement. And, reliability coefficients of .90 have often been considered adequate to assure the quality for standardized testing and large-scale assessment programs. However, a test reliability of .90 (or above) does not ensure that individual test scores, such as national percentile ranks, are accurate. Consider, for example, a mathematics test with a reliability of .90 and imagine a student taking that test whose true score is at the 50th percentile; that is, we know that the student's actual capability is at that level. The probability is less than one third (.309) that when the student takes the test, he or she will obtain a score within 5 percentile points of his or her true score, the 50th percentile (Rogosa 1999a, 1999b). The following informal example attempts to explain why high test reliability does not indicate good accuracy for an individual score, without the encumbrances of percentile rank scoring, complex measurement models, and other technical detail. Dedicated to Al Bundy-A man who cares as much about good measurement as he does about his own children. 相似文献

17.

Who Is at Risk and Why? Teachers' Reasons for Concern and Their Understanding and Assessment of Early Literacy

Alison L. Bailey Kathryn V. Drummond 《Educational Assessment》2013,18(3-4):149-178

This study focuses on an issue of recent policy significance—the need to aid teachers in successfully identifying why children struggle to acquire literacy. This study (a) asked K–1 teachers to nominate students that they believed to be at risk for literacy difficulties and to provide reasons for their concern, (b) examined how these reasons relate to teachers' broader conceptions of literacy, and (c) investigated whether teachers' initial reasons and checklist-guided ratings align with concurrently administered standardized assessments. Results revealed that teachers have a wide array of initial concerns for students. There was some discordance between teachers's specific reasons for concern and their broader conceptions of early literacy. Comparison of student performance on standardized measures with teacher rationale also revealed discordance. Specific guidelines to teachers on use of a literacy checklist increased concordance between subsequent teacher ratings and standardized measures in some reading-related skills but not others. Implications for the use of multiple sources of evidence for student performance, as well as professional development, are discussed. 相似文献

18.

Multi-informant assessment of treatment integrity in the classroom

Evan H. Dart Melissa A. Collier-Meek Caitlyn Chambers Ashley Murphy 《Psychology in the schools》2020,57(5):805-822

Assessing the degree to which interventions are implemented in school settings is critical to making decisions about student outcomes. School psychologists may not be available to regularly conduct observations of intervention implementation, however, their data may be used alongside other methods for multi-informant assessment. Teacher self-report is a commonly used and feasible assessment method. Students have been trained to implement interventions with their peers in instances where traditional adult interventionists were unavailable. This exploratory study investigated the accuracy with which classroom teachers and middle and high school students assessed implementation of the Good Behavior Game and the impact of performance feedback on their accuracy. Results indicated that most students and teachers were able to provide accurate assessments of treatment integrity compared to researcher direct observation; however, some required performance feedback to do so. These findings suggest that multi-informant assessment may be a feasible and accurate way for school psychologists to collect formative treatment-integrity data in the classroom. Limitations and future directions are discussed. 相似文献

19.

On the evaluation of systemic science education reform: Searching for instructional sensitivity

Maria Araceli Ruiz‐Primo Richard J. Shavelson Laura Hamilton Steve Klein 《科学教学研究杂志》2002,39(5):369-393

We propose a multilevel‐multifaceted approach to evaluating the impact of education reform on student achievement that would be sensitive to context and small treatment effects. The approach uses different assessments based on their proximity to the enacted curriculum. Immediate assessments are artifacts (students' products) from the enactment of the curriculum; close assessments parallel the content and activities of the unit/curriculum; proximal assessments tap knowledge and skills relevant to the curriculum, but topics can be different; and distal assessments reflect state/national standards in a particular knowledge domain. To provide evidence about the sensitivity of the multilevel approach in ascertaining outcomes of hands‐on science programs we administered close, proximal, and distal performance assessments to evaluate the impact of instruction based on two Full Option Science System units—Variables, and Mixtures and Solutions—in a Bay Area school district. Results indicated that close assessments were more sensitive to the changes in students' pre‐ to post‐test performance than proximal assessments. © 2002 Wiley Periodicals, Inc. J Res Sci Teach 39: 369–393, 2002 相似文献

20.

Thinking beyond the score: Multidimensional analysis of student performance to inform the next generation of science assessments

Lourdes Cardozo-Gaibisso Seohyun Kim Cory Buxton Allan Cohen 《科学教学研究杂志》2020,57(6):856-878

Conventional assessment analysis of student results, referred to as rubric-based assessments (RBA), has emphasized numeric scores as the primary way of communicating information to teachers about their students’ learning. In this light, rethinking and reflecting on not only how scores are generated but also what analyses are done with them to inform classroom practices is of utmost importance. Informed by Systemic Functional Linguistics and Latent Dirichlet Allocation analyses, this study utilizes an innovative bilingual (Spanish–English) constructed response assessment of science and language practices for middle and high school students to perform a multilayered analysis of student responses. We explore multiple ways of looking at students’ performance through their written assessments and discuss features of student responses that are made visible through these analyses. Findings from this study suggest that science educators would benefit from a multidimensional model which deploys complementary ways in which we can interpret student performance. This understanding leads us to think that researchers and developers in the field of assessment need to promote approaches that analyze student science performance as a multilayered phenomenon. 相似文献