首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
In this digital ITEMS module, Dr. Sue Lottridge, Amy Burkhardt, and Dr. Michelle Boyer provide an overview of automated scoring. Automated scoring is the use of computer algorithms to score unconstrained open-ended test items by mimicking human scoring. The use of automated scoring is increasing in educational assessment programs because it allows scores to be returned faster at lower cost. In the module, they discuss automated scoring from a number of perspectives. First, they discuss benefits and weaknesses of automated scoring, and what psychometricians should know about automated scoring. Next, they describe the overall process of automated scoring, moving from data collection to engine training to operational scoring. Then, they describe how automated scoring systems work, including the basic functions around score prediction as well as other flagging methods. Finally, they conclude with a discussion of the specific validity demands around automated scoring and how they align with the larger validity demands around test scores. Two data activities are provided. The first is an interactive activity that allows the user to train and evaluate a simple automated scoring engine. The second is a worked example that examines the impact of rater error on test scores. The digital module contains a link to an interactive web application as well as its R-Shiny code, diagnostic quiz questions, activities, curated resources, and a glossary.  相似文献   

2.
A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.  相似文献   

3.
Multiple scoring is widely used in large-scale assessments. The use of a single response for making multiple inferences as is done in multiple scoring has implications on the validity of these inferences and interpretations based on assessment results. The purpose of this article is to review two types of multiple scoring practices and discuss how multiple scoring affects inferences.  相似文献   

4.
5.
《教育实用测度》2013,26(3):281-299
The growing use of computers for test delivery, along with increased interest in performance assessments, has motivated test developers to develop automated systems for scoring complex constructed-response assessment formats. In this article, we add to the available information describing the performance of such automated scoring systems by reporting on generalizability analyses of expert ratings and computer-produced scores for a computer-delivered performance assessment of physicians' patient management skills. Two different automated scoring systems were examined. These automated systems produced scores that were approximately as generalizable as those produced by expert raters. Additional analyses also suggested that the traits assessed by the expert raters and the automated scoring systems were highly related (i.e., true correlations between test forms, across scoring methods, were approximately 1.0). In the appendix, we discuss methods for estimating this correlation, using ratings and scores produced by an automated system from a single test form.  相似文献   

6.
Performance assessments are typically scored by having experts rate individual performances. The cost associated with using expert raters may represent a serious limitation in many large-scale testing programs. The use of raters may also introduce an additional source of error into the assessment. These limitations have motivated development of automated scoring systems for performance assessments. Preliminary research has shown these systems to have application across a variety of tasks ranging from simple mathematics to architectural problem solving. This study extends research on automated scoring by comparing alternative automated systems for scoring a computer simulation test of physicians'patient management skills; one system uses regression-derived weights for components of the performance, the other uses complex rules to map performances into score levels. The procedures are evaluated by comparing the resulting scores to expert ratings of the same performances.  相似文献   

7.
The scoring process is critical in the validation of tests that rely on constructed responses. Documenting that readers carry out the scoring in ways consistent with the construct and measurement goals is an important aspect of score validity. In this article, rater cognition is approached as a source of support for a validity argument for scores based on constructed responses, whether such scores are to be used on their own or as the basis for other scoring processes, for example, automated scoring.  相似文献   

8.
ABSTRACT

This article discusses critical methodological design decisions for collecting, interpreting, and synthesizing empirical evidence during the design, deployment, and operational quality-control phases for automated scoring systems. The discussion is inspired by work on operational large-scale systems for automated essay scoring but many of the principles have implications for principled reasoning and workflow management for other use contexts. The overall workflow is described as a series of five phases, each one having two critical sub-phases with a large number of associated methodological design decisions. These phases involve assessment design, linguistic component design, model design, model validation, and operational deployment. Through brief examples, the various considerations for these design decisions are illustrated, which have to be carefully weighed in the overall decision-making process for the system in order to unveil the complexities that underlie this work. The article closes with reflections on resource demands as well as recommendations for best practices of interdisciplinary teams who engage in this work, underscoring how this work is a blend of scientific rigor and artful practice.  相似文献   

9.
What are the validity issues involved in automated scoring of tests? What is the nature of the interplay among construct definition, task design, examinee interface, tutorial, test development tools, and automated scoring and reporting?  相似文献   

10.
This article presents considerations for using automated scoring systems to evaluate second language writing. A distinction is made between English language learners in English-medium educational systems and those studying English in their own countries for a variety of purposes, and between learning-to-write and writing-to-learn in a second language (Manchón, 2011a), extending Manchón's framework from instruction to assessment and drawing implications for construct definition. Next, an approach to validity based on articulating an interpretive argument is presented and discussed with reference to a recent study of the use of e-rater on the TOEFL. Challenges and opportunities for the use of automated scoring system are presented.  相似文献   

11.
'Mental models' used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders. Candidate solutions (N = 3613) received both automated and human holistic scores. Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented. Solutions with discrepancies between automated and human scores were selected for qualitative analysis. The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process. After review, slightly more than half of the score discrepancies were reduced or eliminated. Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified. The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy. We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring. We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency.  相似文献   

12.
Martin   《Assessing Writing》2009,14(2):88-115
The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed.  相似文献   

13.
The purpose of this study was to examine how different scoring procedures affect interpretation of maze curriculum‐based measurements. Fall and spring data were collected from 199 students receiving supplemental reading instruction. Maze probes were scored first by counting all correct maze choices, followed by four scoring variations designed to reduce the effect of random guessing. Pearson's r correlation coefficients were calculated among scoring procedures and between maze scores and a standardized measure of reading. In addition, t tests were conducted to compare fall to spring growth for each scoring procedure. Results indicated that scores derived from the different procedures are highly correlated, demonstrate criterion‐related validity, and show fall‐to‐spring growth. Educators working with struggling readers may use any of the five scoring procedures to obtain technically sound scores.  相似文献   

14.
As methods for automated scoring of constructed‐response items become more widely adopted in state assessments, and are used in more consequential operational configurations, it is critical that their susceptibility to gaming behavior be investigated and managed. This article provides a review of research relevant to how construct‐irrelevant response behavior may affect automated constructed‐response scoring, and aims to address a gap in that literature: the need to assess the degree of risk before operational launch. A general framework is proposed for evaluating susceptibility to gaming, and an initial empirical demonstration is presented using the open‐source short‐answer scoring engines from the Automated Student Assessment Prize (ASAP) Challenge.  相似文献   

15.
Scientific argumentation is one of the core practices for teachers to implement in science classrooms. We developed a computer-based formative assessment to support students’ construction and revision of scientific arguments. The assessment is built upon automated scoring of students’ arguments and provides feedback to students and teachers. Preliminary validity evidence was collected in this study to support the use of automated scoring in this formative assessment. The results showed satisfactory psychometric properties related to this formative assessment. The automated scores showed satisfactory agreement with human scores, but small discrepancies still existed. Automated scores and feedback encouraged students to revise their answers. Students’ scientific argumentation skills improved during the revision process. These findings provided preliminary evident to support the use of automated scoring in the formative assessment to diagnose and enhance students’ argumentation skills in the context of climate change in secondary school science classrooms.  相似文献   

16.
ABSTRACT

Due to myriad applications of the nominal group technique (NGT), a highly flexible iterative focus group method, researchers know little about its optimal scoring procedures. Exploring benefits and biases that such procedures might present, we aim to clarify how NGT scoring systems can privilege consensus or prioritization. In conducting the first study both to feature NGT data from the same participants at multiple time points or to compare scoring procedures with actual, not simulated, data, we found clear differences between consensus (ratings) and prioritization (rankings) scoring schemata’s abilities to discriminate categories. We recommend that NGT users (1) state whether they intend to emphasize consensus, prioritization, or both; (2) name their scoring schema and explain it mathematically; and (3) detail implications of their choices. We also discuss uses of NGT as a research tool, especially for global citizenship education including study-abroad programmes in contexts where reliable access to electricity and/or the internet may be challenging.  相似文献   

17.
ABSTRACT

The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are used to calibrate and test AES systems; (b) which human raters provided the scores on these essays; and (c) given that multiple human raters are generally used for this purpose, which human scores should ultimately be used when there are score disagreements? This article provides commentary on the first two questions and an empirical investigation into the third question. The authors suggest that addressing these three questions strengthens the scoring component of the validity argument for any assessment that includes AES scoring.  相似文献   

18.
Typical assessment systems often measure isolated ideas rather than the coherent understanding valued in current science classrooms. Such assessments may motivate students to memorize, rather than to use new ideas to solve complex problems. To meet the requirements of the Next Generation Science Standards, instruction needs to emphasize sustained investigations, and assessments need to create a detailed picture of students’ conceptual understanding and reasoning processes.

This article describes the design process and potential for automated scoring of 2 forms of inquiry assessment: Energy Stories and MySystem. To design these assessments, we formed a partnership of teachers, discipline experts, researchers, technologists, and psychometricians to align curriculum, assessments, and rubrics. We illustrate how these items document middle school students’ reasoning about energy flow in life science. We used evidence from review by science teachers and experts in the discipline; classroom experiments; and psychometric analysis to validate the assessments, rubrics, and automated scoring.  相似文献   

19.
The purpose of this study was to test a taxonomy of seven proposed responses to anomalous data. Our results generally supported the taxonomy but indicated that one additional type of response should be added to the taxonomy. We conclude that there are eight possible responses to anomalous data: (a) ignoring the data, (b) rejecting the data, (c) professing uncertainty about the validity of the data, (d) excluding the data from the domain of the current theory, (e) holding the data in abeyance, (f) reinterpreting the data, (g) accepting the data and making peripheral changes to the current theory, and (h) accepting the data and changing theories. We suggest that this taxonomy could help science teachers in two ways. First, science teachers could use the taxonomy to try to anticipate how students might react to anomalous data so as to make theory change more likely. Second, science teachers could use the taxonomy as a framework to guide classroom discussion about the nature of scientific rationality. In addition, the taxonomy suggests directions for future research. © 1998 John Wiley & Sons, Inc. J Res Sci Teach 35: 623–654, 1998.  相似文献   

20.
ABSTRACT

Automated essay scoring is a developing technology that can provide efficient scoring of large numbers of written responses. Its use in higher education admissions testing provides an opportunity to collect validity and fairness evidence to support current uses and inform its emergence in other areas such as K–12 large-scale assessment. In this study, human and automated scores on essays written by college students with and without learning disabilities and/or attention deficit hyperactivity disorder were compared, using a nationwide (U.S.) sample of prospective graduate students taking the revised Graduate Record Examination. The findings are that, on average, human raters and the automated scoring engine assigned similar essay scores for all groups, despite average differences among groups with respect to essay length and spelling errors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号