首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
Will subscores provide additional information than what is provided by the total score? Is there a method that can estimate more trustworthy subscores than observed subscores? To answer the first question, this study evaluated whether the true subscore was more accurately predicted by the observed subscore or total score. To answer the second question, three subscore estimation methods (i.e., subscore estimated from the observed subscore, total score, or a combination of both the subscore and total score) were compared. Analyses were conducted using data from six licensure tests. Results indicated that reporting subscores at the examinee level may not be necessary as they did not provide much additional information over what is provided by the total score. However, at the institutional level (for institution size ≥ 30), reporting subscores may not be harmful, although they may be redundant because the subscores were predicted equally well by the observed subscores or total scores. Finally, results indicated that estimating the subscore using a combination of observed subscore and total score resulted in the highest reliability.  相似文献   

2.
Recently, interest in test subscore reporting for diagnosis purposes has been growing rapidly. The two simulation studies here examined factors (sample size, number of subscales, correlation between subscales, and three factors affecting subscore reliability: number of items per subscale, item parameter distribution, and data generating model) that affected the value of reporting subscores within the classical test theory framework. Results showed that a higher proportion of subscores of added value was related to lower correlation between subscales, more items per subscale, no guessing in responses, smaller variability in difficulty parameters, and matched average item difficulty and average examinee ability.  相似文献   

3.
This study investigates the relationships among factor correlations, inter-item correlations, and the reliability estimates of subscores, providing a guideline with respect to psychometric properties of useful subscores. In addition, it compares subscore estimation methods with respect to reliability and distinctness. The subscore estimation methods explored in the current study include augmentation based on classical test theory and multidimensional item response theory (MIRT). The study shows that there is no estimation method that is optimal according to both criteria. Augmented subscores show the most improvement in reliability compared to observed subscores but are the least distinct.  相似文献   

4.
Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman suggested a method based on classical test theory to determine whether subscores have added value over total scores. In this article I first provide a rich collection of results regarding when subscores were found to have added value for several operational data sets. Following that I provide results from a detailed simulation study that examines what properties subscores should possess in order to have added value. The results indicate that subscores have to satisfy strict standards of reliability and correlation to have added value. A weighted average of the subscore and the total score was found to have added value more often.  相似文献   

5.
Reporting confidence intervals with test scores helps test users make important decisions about examinees by providing information about the precision of test scores. Although a variety of estimation procedures based on the binomial error model are available for computing intervals for test scores, these procedures assume that items are randomly drawn from a undifferentiated universe of items, and therefore might not be suitable for tests developed according to a table of specifications. To address this issue, four interval estimation procedures that use category subscores for the computation of confidence intervals are presented in this article. All four estimation procedures assume that subscores instead of test scores follow a binomial distribution (i.e., compound binomial error model). The relative performance of the four compound binomial–based interval estimation procedures is compared to each other and to the better known normal approximation and Wilson score procedures based on the binomial error model.  相似文献   

6.
Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman (2008b) suggested reporting an augmented subscore that is a linear combination of a subscore and the total score. Sinharay and Haberman (2008) and Sinharay (2010) showed that augmented subscores often lead to more accurate diagnostic information than subscores. In order to report augmented subscores operationally, they should be comparable across the different forms of a test. One way to achieve comparability is to equate them. We suggest several methods for equating augmented subscores. Results from several operational and simulated data sets show that the error in the equating of augmented subscores appears to be small in most practical situations.  相似文献   

7.
The purpose of this ITEMS module is to provide an introduction to subscores. First, examples of subscores from an operational test are provided. Then, a review of methods that can be used to examine if subscores have adequate psychometric quality is provided. It is demonstrated, using results from operational and simulated data, that subscores have to be based on a sufficient number of items and have to be sufficiently distinct from each other to have adequate psychometric quality. It is also demonstrated that several operationally reported subscores do not have adequate psychometric quality. Recommendations are made for those interested in reporting subscores for educational tests.  相似文献   

8.
It has been suggested that the primary purpose for criterion-referenced testing in objective-based instructional programs is to classify examinees into mastery states or categories on the objectives included in the test. We have proposed that the reliability of the criterion-referenced test scores be defined in terms of the consistency of the decision-making process across repeated administrations of the test. Specifically, reliability is defined as a measure of agreement over and above that which can be expected by chance between the decisions made about examinee mastery states in repeated test administrations for each objective measured by the criterion-referenced test.  相似文献   

9.
《教育实用测度》2013,26(2):163-183
When low-stakes assessments are administered, the degree to which examinees give their best effort is often unclear, complicating the validity and interpretation of the resulting test scores. This study introduces a new method, based on item response time, for measuring examinee test-taking effort on computer-based test items. This measure, termed response time effort (RTE), is based on the hypothesis that when administered an item, unmotivated examinees will answer too quickly (i.e., before they have time to read and fully consider the item). Psychometric characteristics of RTE scores were empirically investigated and supportive evidence for score reliability and validity was found. Potential applications of RTE scores and their implications are discussed.  相似文献   

10.
Test unreliability due to guessing in multiple‐choice and true/false tests is analysed from first principles, and two new measures are described, with the intention that they should be of a sort that is easily communicated without reference to the underlying statistics. One measure is concerned with the resolution of defined levels of knowledge and the other with the probability of examinees being incorrectly ranked. How the measures decrease with both test length and number of response options per question is quantified. It is concluded that the results of many tests currently conducted are likely to be unacceptably unreliable. Procedures for increasing test reliability are discussed in a logical sequence intended to aid their understanding.  相似文献   

11.
项目反应理论下的测验信度能够评价潜在特质估计的可靠性与稳定性,由于具有宏观性的特点,项目反应理论信度的作用并不能被测验信息函数所取代,是IRT测验的一个重要指标。本文参考国内外文献,首先介绍国内外学者关于IRT信度作用的观点,并介绍和评价了多种IRT信度估计方法,然后简要介绍IRT信度的影响因素,最后展望了IRT信度领域后续研究尚可着力之处。  相似文献   

12.
《教育实用测度》2013,26(1):81-91
In this study, we investigated the hypothesis that the previously found positive effects of self-adapted testing are attributable to examinees having an increased perception of control over a stressful testing situation. Examinees were randomly assigned to either (a) take a computerized-adaptive test (CAT), (b) take a self-adapted test (SAT), or (c) choose between taking a CAT or SAT. Results showed that the strongest preference for SAT was shown by examinees reporting high levels of math anxiety. Moreover, highly mathanxious examinees who were allowed to choose between the test types exhibited higher mean proficiency estimates than examinees who were assigned to test type.  相似文献   

13.
In this study we describe an analytic method for aiding in the generation of subscales that characterize the deep structure of tests. In addition we also derive a procedure for estimating scores for these scales that are much more statistically stable than subscores computed solely from the items that are contained on that scale. These scores achieve their stability through augmentation with information from other related information on the test. These methods were used to complement each other on a data set obtained from a Praxis administration. We found that the deep structure of the test yielded ten subscales and that, because the test was essentially unidimensional, ten subscores could be computed, all with very high reliability. This result was contrasted with the calculation of six traditional subscales based on surface features of the items. These subscales also yielded augmented subscores of high reliability.  相似文献   

14.
15.
As the primary interface between test developers and multiple educational stakeholders, score reports are a critical component to the success (or failure) of any assessment program. The purpose of this review is to document recent research on individual‐level score reporting to advance the research and practice of score reporting. We conducted a search for research studies published or presented between 2005 and 2015, examining 60 scholarly works for (1) the research focus, (2) stated or implied theoretical frameworks of communication, and (3) the characteristics of data sets employed in the studies. Results show that research on score properties, especially subscores, and score report design/layout are well‐represented in the literature base. The predominant approach to score reporting has been through a cybernetics tradition of communication. Data sets were often small or localized to a single context. We present example research questions from novel communication frameworks, and encourage our colleagues to adopt new roles in their relationships to stakeholders to advance score reporting research and practice.  相似文献   

16.
学业成绩考试的信度分析   总被引:1,自引:0,他引:1  
考试信度对于任何一种有效考试来说都是必不可少的,只有信度高的考试才能使教师对学生的评价客观、可靠,考试成绩才能正确地反映被试者的程度。教育测量学、教育统计学在理论上为考试的科学化和现代化奠定了基础,使得考试分析数量化,而SPSS社科统计软件又使广大教师使用计算机进行学业成绩考试信度的定量分析成为可能。  相似文献   

17.
Criterion‐related profile analysis (CPA) can be used to assess whether subscores of a test or test battery account for more criterion variance than does a single total score. Application of CPA to subscore evaluation is described, compared to alternative procedures, and illustrated using SAT data. Considerations other than validity and reliability are discussed, including broad societal goals (e.g., affirmative action), fairness, and ties in expected criterion predictions. In simulation data, CPA results were sensitive to subscore correlations, sample size, and the proportion of criterion‐related variance accounted for by the subscores. CPA can be a useful component in a thorough subscore evaluation encompassing subscore reliability, validity, distinctiveness, fairness, and broader societal goals.  相似文献   

18.
When tests are designed to measure dimensionally complex material, DIF analysis with matching based on the total test score may be inappropriate. Previous research has demonstrated that matching can be improved by using multiple internal or both internal and external measures to more completely account for the latent ability space. The present article extends this line of research by examining the potential to improve matching by conditioning simultaneously on test score and a categorical variable representing the educational background of the examinees. The responses of male and female examinees from a test of medical competence were analyzed using a logistic regression procedure. Results show a substantial reduction in the number of items identified as displaying significant DIF when conditioning is based on total test score and a variable representing educational background as opposed to total test score only.  相似文献   

19.
调剂是硕士研究生招生平稳生源、保证质量的重要措施和手段,但重复录取的存在给招生院校、招生主管部门带来了不必要麻烦,由于无法及时调剂其他合格生源,造成招生单位招生计划的浪费,也使一些考生失去被录取的机会。本文分析了重复录取的原因,探讨并提出改进现行研究生调剂、录取工作的建议,对有序、安全、平稳、简便地开展第二志愿生源的调剂工作,减少重复录取具有一定实践意义。  相似文献   

20.
Responses to a 40-item test were simulated for 150 examinees under free-response and multiple-choice formats. The simulation was replicated three times for each of 30 variations reflecting format and the extent to which examinees were (a) misinformed, (b) successful in guessing free-response answers, and (c) able to recognize with assurance correct multiple-choice options that they could not produce under free-response testing. Internal consistency reliability (KR20) estimates were consistently higher for the free-response score sets, even when the free-response item difficulty indices were augmented to yield mean scores comparable to those from multiple-choice testing. In addition, all test score sets were correlated with four randomly generated sets of unit-normal measures, whose intercorrelations ranged from moderate to strong. These measures served as criteria because one of them had been used as the basic ability measure in the simulation of the test score sets. Again, the free-response score sets yielded superior results even when tests of equal difficulty were compared. The guessing and recognition factors had little or no effect on reliability estimates or correlations with the criteria. The extent of misinformation affected only multiple-choice score KR20's (more misinformation—higher KR20's). Although free-response tests were found to be generally superior, the extent of their advantage over multiple-choice was judged sufficiently small that other considerations might justifiably dictate format choice.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号