期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Conditional Standard Errors of Measurement for Scale Scores

Michael J. Kolen Bradley A. Hanson Robert L. Brennan 《Journal of Educational Measurement》1992,29(4):285-307

Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program. 相似文献

2.

Decision making for borderline cases in pass/fail clinical anatomy courses: The practical value of the standard error of measurement and likelihood ratio in a diagnostic test

Milton Severo Fernanda Silva‐Pereira Maria Amélia Ferreira 《Anatomical sciences education》2013,6(3):157-162

Several studies have shown that the standard error of measurement (SEM) can be used as an additional “safety net” to reduce the frequency of false‐positive or false‐negative student grading classifications. Practical examinations in clinical anatomy are often used as diagnostic tests to admit students to course final examinations. The aim of this study was to explore the diagnostic value of SEM using the likelihood ratio (LR) in establishing decisions about students with practical examination scores at or below the pass/fail cutoff score in a clinical anatomy course. Two hundred sixty‐seven students took three clinical anatomy practical examinations in 2011. The students were asked to identify 40 anatomical structures in images and prosected specimens in the practical examination. Practical examination scores were then divided according to the following cutoff scores: 2, 1 SEM below, and 0, 1, 2 SEM above the pass score. The positive predictive value (+PV) and LR of passing the final examination were estimated for each category to explore the diagnostic value of practical examination scores. The +PV (LR) in the six categories defined by the SEM was 39.1% (0.08), 70.0% (0.30), 88.9% (1.04), 91.7% (1.43), 95.8% (3.00), and 97.8% (5.74), respectively. The LR of categories 2 SEM above/below the pass score generated a moderate/large shift in the pre‐ to post‐test probability of passing. The LR increased the usefulness and practical value of SEM by improving confidence in decisions about the progress of students with borderline scores 2 SEM above/below the pass score in practical examinations in clinical anatomy courses. Anat Sci Educ. © 2013 American Association of Anatomists. 相似文献

3.

Comparing Graphical and Verbal Representations of Measurement Error In Test Score Reports

Rebecca Zwick Diego Zapata-Rivera Mary Hegarty 《Educational Assessment》2013,18(2):116-138

Research has shown that many educators do not understand the terminology or displays used in test score reports and that measurement error is a particularly challenging concept. We investigated graphical and verbal methods of representing measurement error associated with individual student scores. We created four alternative score reports, each constituting an experimental condition, and randomly assigned them to research participants. We then compared comprehension and preferences across the four conditions. In our main study, we collected data from 148 teachers. For comparison, we studied 98 introductory psychology students. Although we did not detect statistically significant differences across conditions, we found that participants who reported greater comfort with statistics tended to have higher comprehension scores and tended to prefer more informative displays that included variable-width confidence bands for scores. Our data also yielded a wealth of information regarding existing misconceptions about measurement error and about score-reporting conventions. 相似文献

4.

Multiple choice and true/false tests: reliability measures and some implications of negative marking

Richard F. Burton 《Assessment & Evaluation in Higher Education》2004,29(5):585-595

The standard error of measurement usefully provides confidence limits for scores in a given test, but is it possible to quantify the reliability of a test with just a single number that allows comparison of tests of different format? Reliability coefficients do not do this, being dependent on the spread of examinee attainment. Better in this regard is a measure produced by dividing the standard error of measurement by the test's ‘reliability length’, the latter defined as the maximum possible score minus the most probable score obtainable by blind guessing alone. This, however, can be unsatisfactory with negative marking (formula scoring), as shown by data on 13 negatively marked true/false tests. In these the examinees displayed considerable misinformation, which correlated negatively with correct knowledge. Negative marking can improve test reliability by penalizing such misinformation as well as by discouraging guessing. Reliability measures can be based on idealized theoretical models instead of on test data. These do not reflect the qualities of the test items, but can be focused on specific test objectives (e.g. in relation to cut‐off scores) and can be expressed as easily communicated statements even before tests are written. 相似文献

5.

Estimators of Conditional Scale-Score Standard Errors of Measurement: A Simulation Study

Won-Chan Lee Robert L. Brennan Michael J. Kolen 《Journal of Educational Measurement》2000,37(1):1-20

This paper describes four procedures previously developed for estimating conditional standard errors of measurement for scale scores: the IRT procedure (Kolen, Zeng, & Hanson. 1996), the binomial procedure (Brennan & Lee, 1999), the compound binomial procedure (Brennan & Lee, 1999), and the Feldt-Qualls procedure (1998). These four procedures are based on different underlying assumptions. The IRT procedure is based on the unidimensional IRT model assumptions. The binomial and compound binomial procedures employ, as the distribution of errors, the binomial model and compound binomial model, respectively. By contrast, the Feldt-Qualls procedure does not depend on a particular psychometric model, and it simply translates any estimated conditional raw-score SEM to a conditional scale-score SEM. These procedures are compared in a simulation study, which involves two-dimensional data sets. The presence of two category dimensions reflects a violation of the IRT unidimensionality assumption. The relative accuracy of these procedures for estimating conditional scale-score standard errors of measurement is evaluated under various circumstances. The effects of three different types of transformations of raw scores are investigated including developmental standard scores, grade equivalents, and percentile ranks. All the procedures discussed appear viable. A general recommendation is made that test users select a procedure based on various factors such as the type of scale score of concern, characteristics of the test, assumptions involved in the estimation procedure, and feasibility and practicability of the estimation procedure. 相似文献

6.

Effort Analysis: Individual Score Validation of Achievement Test Data

Steven L. Wise 《教育实用测度》2015,28(3):237-252

Whenever the purpose of measurement is to inform an inference about a student’s achievement level, it is important that we be able to trust that the student’s test score accurately reflects what that student knows and can do. Such trust requires the assumption that a student’s test event is not unduly influenced by construct-irrelevant factors that could distort his score. This article examines one such factor—test-taking motivation—that tends to induce a person-specific, systematic negative bias on test scores. Because current measurement models underlying achievement testing assume students respond effortfully to test items, it is important to identify test scores that have been materially distorted by non-effortful test taking. A method for conducting effort-related individual score validation is presented, and it is recommended that measurement professionals have a responsibility to identify invalid scores to individuals who make inferences about student achievement on the basis of those scores. 相似文献

7.

Conditional Standard Error of Measurement in Prediction

David Woodruff 《Journal of Educational Measurement》1990,27(3):191-208

Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM-Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale. 相似文献

8.

Evaluating the Operational Feasibility of Using Subsets of Items to Recommend Minimal Competency Cut Scores

Priya Kannan Adrienne Sgammato Richard J. Tannenbaum 《教育实用测度》2015,28(4):292-307

Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test. 相似文献

9.

A Comparison of Score Level Estimates of the Standard Error of Measurement

Audrey L. Quails-Payne 《Journal of Educational Measurement》1992,29(3):213-225

Numerous methods have been proposed and investigated for estimating · the standard error of measurement (SEM) at specific score levels. Consensus on the preferred method has not been obtained, in part because there is no standard criterion. The criterion procedure in previous investigations has been a single test occasion procedure. This study compares six estimation techniques. Two criteria were calculated by using test results obtained from a test-retest or parallel forms design. The relationship between estimated score level standard errors and the score scale was similar for the six procedures. These relationships were also congruent to findings from previous investigations. Similarity between estimates and criteria varied over methods and criteria. For test-retest conditions, the estimation techniques are interchangeable. The user's selection could be based on personal preference. However, for parallel forms conditions, the procedures resulted in estimates that were meaningfully different. The preferred estimation technique would be Feldt's method (cited in Gupta, 1965; Feldt, 1984). 相似文献

10.

Use of Adjustment by Minimum Discriminant Information in Linking Constructed‐Response Test Scores in the Absence of Common Items

Yi‐Hsuan Lee Shelby J. Haberman Neil J. Dorans 《Journal of Educational Measurement》2019,56(2):452-472

In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation. 相似文献

11.

Exploring the Effectiveness of a Measurement Error Tutorial in Helping Teachers Understand Score Report Results

Diego Zapata-Rivera Rebecca Zwick Margaret Vezzu 《Educational Assessment》2016,21(3):215-229

The goal of this study was to explore the effectiveness of a short web-based tutorial in helping teachers to better understand the portrayal of measurement error in test score reports. The short video tutorial included both verbal and graphical representations of measurement error. Results showed a significant difference in comprehension scores between each of two tutorial groups (basic and enhanced) and the control group (no tutorial) but not between the two tutorial groups. Results also provided evidence of teachers' misconceptions about the meaning of measurement error and confidence bands. 相似文献

12.

Defining Score Scales in Relation to Measurement Error

Michael J. Kolen 《Journal of Educational Measurement》1988,25(2):97-110

Scale scores for educational tests can be made more interpretable by incorporating score precision information at the time the score scale is established. Methods for incorporating this information are examined that are applicable to testing situations with number-correct scoring. Both linear and nonlinear methods are described. These methods can be used to construct score scales that discourage the overinterpretation of small differences in scores. The application of the nonlinear methods also results in scale scores that have nearly equal error variability along the score scale and that possess the property that adding a specified number of points to and subtracting the same number of points from any examinee's scale score produces an approximate two-sided confidence interval with a specified coverage. These nonlinear methods use an arcsine transformation to stabilize measurement error variance for transformed scores. The methods are compared through the use of illustrative examples. The effect of rounding on measurement error variability is also considered and illustrated using stanines 相似文献

13.

STANDARD ERRORS OF MEASUREMENT AT DIFFERENT ABILITY LEVELS

FREDERIC M. LORD 《Journal of Educational Measurement》1984,21(3):239-243

Four methods are outlined for estimating or approximating from a single test administration the standard error of measurement of number-right test score at specified ability levels or cutting scores. The methods are illustrated and compared on one set of real test data. 相似文献

14.

“测验连接”概念框架演变述评 总被引：1，自引：0，他引：1

程乾《考试研究》2013,(2):71-79

测验连接是心理与教育测量研究中一个重要的领域,是通过统计方法将一个测验的分数以另一个测验的分数单位表示,或者将两个测验的分数表示在共同的分数量尺上。虽然测验连接有较长的研究历史,但是不同学者对其有不同分类。其中有些分类术语别无二致,但其定义却大相径庭,这使研究者和实践者产生了极大混乱。鉴于此,有必要从历史的角度梳理连接的概念框架及其变化,以便更好地理解和应用测验连接。相似文献

15.

Using Weighted Sum Scores to Close the Gap Between DIF Practice and Theory

Hongwen Guo Neil J. Dorans 《Journal of Educational Measurement》2020,57(4):484-510

We make a distinction between the operational practice of using an observed score to assess differential item functioning (DIF) and the concept of departure from measurement invariance (DMI) that conditions on a latent variable. DMI and DIF indices of effect sizes, based on the Mantel-Haenszel test of common odds ratio, converge under restricted conditions if a simple sum score is used as the matching or conditioning variable in a DIF analysis. Based on theoretical results, we demonstrate analytically that matching on a weighted sum score can significantly reduce the difference between DIF and DMI measures over what can be achieved with a simple sum score. We also examine the utility of binning methods that could facilitate potential operational use of DIF with weighted sum scores. A real data application was included to show this feasibility. 相似文献

16.

COMPROMISE MODELS FOR ESTABLISHING EXAMINATION STANDARDS

DATO N. M. DE GRUIJTER 《Journal of Educational Measurement》1985,22(4):263-269

Cutoff scores based on absolute standards can be unacceptable in terms of the number of failures they produce. Cutoff scores based on relative standards, that is, cutoff scores set to achieve a fixed percentage of failures, can be unacceptable because an acceptable performance level for passed examinees cannot be guaranteed. In some situations one can improve upon an absolute standard using a compromise model, which draws on the information in the observed score distribution for a test to adjust the standard. Three compromise models are described and compared in this article. 相似文献

17.

A Comparison of Three Scoring Methods for Tests With Selected-Response and Constructed-Response Items

《Educational Assessment》2013,18(4):317-340

A number of methods for scoring tests with selected-response (SR) and constructed-response (CR) items are available. The selection of a method depends on the requirements of the program, the particular psychometric model and assumptions employed in the analysis of item and score data, and how scores are to be used. This article compares 3 methods: unweighted raw scores, Item Response Theory pattern scores, and weighted raw scores. Student score data from large-scale end-of-course high school tests in Biology and English were used in the comparisons. In the weighted raw score method evaluated in this study, the CR items were weighted so that SR and CR items contributed the same number of points toward the total score. The scoring methods were compared for the total group and for subgroups of students in terms of the resultant scaled score distributions, standard errors of measurement, and proficiency-level classifications. For most of the student ability distribution, the three scoring methods yielded similar results. Some differences in results are noted. Issues to be considered when selecting a scoring method are discussed. 相似文献

18.

天津市初等信息技术考试标准设定方法的研究与实践

高淑印郑刚《考试研究》2013,(4):76-83

天津市初等信息技术考试是面向社会测试应试者计算机应用能力的评测系统,作为一种标准参照考试,从2004年开始实施以来,一直以60分作为合格标准,但实践证明,60分并不能作为判断考生是否合格的永恒标准。该考试系统是上机考试,社会考生自愿报名参加,考试对象年龄差异较大,覆盖小学2-6年级,且每个级别会有不同年龄学生参加,60分的划界分数忽略了每次参加测试的被试者的平均能力不同这一事实,也忽略了同一次考试不同考生抽到的题目不完全一致的事实。这样可能会产生一个问题,即我们只能了解考生的相对能力和相对位置。如果不能正确地将考生归入恰当的等级类别中,这种等级考试的价值就会受很大影响。因此,本文对该考试系统的"合格"标准分数的设定进行研究,利用Angoff法设定划界分数,客观地应用到被试群体中,在提高考试信度、效度的研究与应用方面进行了有益的探索。相似文献

19.

Equivalent Pass/Fail Decisions

John J. Norcini 《Journal of Educational Measurement》1990,27(1):59-66

In competency testing, it is sometimes difficult to properly equate scores of different forms of a test and thereby assure equivalent cutting scores. Under such circumstances, it is possible to set standards separately for each test form and then scale the judgments of the standard setters to achieve equivalent pass/fail decisions. Data from standard setters and examinees for a medical certifying examination were reanalyzed. Cutting score equivalents were derived by applying a linear procedure to the standard-setting results. These were compared against criteria along with the cutting score equivalents derived from typical examination equating procedures. Results indicated that the cutting score equivalents produced by the experts were closer to the criteria than standards derived from examinee performance, especially when the number of examinees used in equating was small. The root mean square error estimate was about 1 item on a 189-item test. 相似文献

20.

Psychometric Properties of IRT Proficiency Estimates

Michael J. Kolen Ye Tong 《Educational Measurement》2010,29(3):8-14

Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring. 相似文献