首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 535 毫秒
1.
Computerized adaptive testing (CAT) and multistage testing (MST) have become two of the most popular modes in large‐scale computer‐based sequential testing.  Though most designs of CAT and MST exhibit strength and weakness in recent large‐scale implementations, there is no simple answer to the question of which design is better because different modes may fit different practical situations. This article proposes a hybrid adaptive framework to combine both CAT and MST, inspired by an analysis of the history of CAT and MST. The proposed procedure is a design which transitions from a group sequential design to a fully sequential design. This allows for the robustness of MST in early stages, but also shares the advantages of CAT in later stages with fine tuning of the ability estimator once its neighborhood has been identified. Simulation results showed that hybrid designs following our proposed principles provided comparable or even better estimation accuracy and efficiency than standard CAT and MST designs, especially for examinees at the two ends of the ability range.  相似文献   

2.
The purpose of this study is to apply the attribute hierarchy method (AHM) to a subset of SAT critical reading items and illustrate how the method can be used to promote cognitive diagnostic inferences. The AHM is a psychometric procedure for classifying examinees’ test item responses into a set of attribute mastery patterns associated with different components from a cognitive model. The study was conducted in two steps. In step 1, three cognitive models were developed by reviewing selected literature in reading comprehension as well as research related to SAT Critical Reading. Then, the cognitive models were validated by having a sample of students think aloud as they solved each item. In step 2, psychometric analyses were conducted on the SAT critical reading cognitive models by evaluating the model‐data fit between the expected and observed response patterns produced from two random samples of 2,000 examinees who wrote the items. The model that provided best data‐model fit was then used to calculate attribute probabilities for 15 examinees to illustrate our diagnostic testing procedure.  相似文献   

3.
This article demonstrates the use of a new class of model‐free cumulative sum (CUSUM) statistics to detect person fit given the responses to a linear test. The fundamental statistic being accumulated is the likelihood ratio of two probabilities. The detection performance of this CUSUM scheme is compared to other model‐free person‐fit statistics found in the literature as well as an adaptation of another CUSUM approach. The study used both simulated responses and real response data from a large‐scale standardized admission test.  相似文献   

4.
Part of the controversy about allowing examinees to review and change answers to previous items on computerized adaptive tests (CATs) centers on a strategy for obtaining positively biased ability estimates attributed to Wainer (1993) in which examinees intentionally answer items incorrectly before review and to the best of their abilities upon review. Our results, based on both simulated and live testing data, showed that there were instances in which the Wainer strategy yielded inflated ability estimates as well as instances in which it yielded deflated ability estimates. The success of the strategy in inflating ability estimates depended on the ability estimation method used (maximum likelihood versus Bayesian), the examinee's true ability level, the standard error of the ability estimate, the examinee's ability to implement the strategy, and the type of decision made from the ability estimate. We discuss approaches to dealing with the Wainer strategy in operational CAT settings.  相似文献   

5.
The aim of this study is to assess the efficiency of using the multiple‐group categorical confirmatory factor analysis (MCCFA) and the robust chi‐square difference test in differential item functioning (DIF) detection for polytomous items under the minimum free baseline strategy. While testing for DIF items, despite the strong assumption that all but the examined item are set to be DIF‐free, MCCFA with such a constrained baseline approach is commonly used in the literature. The present study relaxes this strong assumption and adopts the minimum free baseline approach where, aside from those parameters constrained for identification purpose, parameters of all but the examined item are allowed to differ among groups. Based on the simulation results, the robust chi‐square difference test statistic with the mean and variance adjustment is shown to be efficient in detecting DIF for polytomous items in terms of the empirical power and Type I error rates. To sum up, MCCFA under the minimum free baseline strategy is useful for DIF detection for polytomous items.  相似文献   

6.
采用专家访谈、文献资料、数理统计等研究方法,对贵州省体育专业高考术科考试进行了研究.研究表明:贵州省体育专业高考术科考试固定四项身体素质测试不能全面检测考生情况.建议:身体素质测试增加灵敏性素质,由4类4小项增为5类5小项,增加专项运动技能考试,身体素质和专项技能比例应为75∶25,测试总分宜统一定为100分,体育考试和文化课考试成绩都达到分数线的考生,按体育专业成绩由高到低录取.  相似文献   

7.
The purpose of this study was to examine the effect of pretest items on response time in an operational, fixed-length, time-limited computerized adaptive test (CAT). These pretest items are embedded within the CAT, but unlike the operational items, are not tailored to the examinee's ability level. If examinees with higher ability levels need less time to complete these items than do their counterparts with lower ability levels, they will have more time to devote to the operational test questions. Data were from a graduate admissions test that was administered worldwide. Data from both quantitative and verbal sections of the test were considered. For the verbal section, examinees in the lower ability groups spent systematically more time on their pretest items than did those in the higher ability groups, though for the quantitative section the differences were less clear.  相似文献   

8.
In recent guidelines for fair educational testing it is advised to check the validity of individual test scores through the use of person‐fit statistics. For practitioners it is unclear on the basis of the existing literature which statistic to use. An overview of relatively simple existing nonparametric approaches to identify atypical response patterns is provided. A simulation study was conducted to compare the different approaches and on the basis of the literature review and the simulation study guidelines for the use of person‐fit approaches are given.  相似文献   

9.
本文采用数理统计和教育测量的方法,对2005年广西体育高考的篮球考试进行研究.结果表明,2005年的篮球专项考试项目设置合理,将全场比赛考试取消是有根据和较合理的;一分钟投篮和往返运球投篮的评分标准不合理,使考试的难度和区分度降低,考生比较容易得高分,评分标准不能客观和有效地评定和鉴别考生的优劣.应对这两项考试的评分标准进行修改.  相似文献   

10.
《教育实用测度》2013,26(1):81-91
In this study, we investigated the hypothesis that the previously found positive effects of self-adapted testing are attributable to examinees having an increased perception of control over a stressful testing situation. Examinees were randomly assigned to either (a) take a computerized-adaptive test (CAT), (b) take a self-adapted test (SAT), or (c) choose between taking a CAT or SAT. Results showed that the strongest preference for SAT was shown by examinees reporting high levels of math anxiety. Moreover, highly mathanxious examinees who were allowed to choose between the test types exhibited higher mean proficiency estimates than examinees who were assigned to test type.  相似文献   

11.
This article addresses the issue of how to detect item preknowledge using item response time data in two computer‐based large‐scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article.  相似文献   

12.
The statistical analysis of answer changes (ACs) has uncovered multiple testing irregularities on large‐scale assessments and is now routinely performed at testing organizations. However, AC data has an uncertainty caused by technological or human factors. Therefore, existing statistics (e.g., number of wrong‐to‐right ACs) used to detect examinees with aberrant ACs capitalize on the uncertainty, which may result in a large Type I error. In this article, the information about ACs is used only for the partitioning of administered items into two disjoint subtests: items where ACs did not occur, and items where ACs did occur. A new statistic is based on the difference in performance between these subtests (measured as Kullback–Leibler divergence between corresponding posteriors of latent traits), where, in order to avoid the uncertainty, only final responses are used. One of the subtests can be filtered such that the asymptotic distribution of the statistic is chi‐square with one degree of freedom. In computer simulations, the presented statistic demonstrated a strong robustness to the uncertainty and higher detection rates in contrast to two popular statistics based on wrong‐to‐right ACs.  相似文献   

13.
In this article, we propose using the Bayes factors (BF) to evaluate person fit in item response theory models under the framework of Bayesian evaluation of an informative diagnostic hypothesis. We first discuss the theoretical foundation for this application and how to analyze person fit using BF. To demonstrate the feasibility of this approach, we further use it to evaluate person fit in simulated and empirical data, and compare the results with those of HT and the infit and outfit statistics. We found that overall BF performed as well as HT statistics and better than the infit and outfit statistics when detecting aberrant responses. Given the BF flexibility in handling data set with a small number of examinees, we suggest that BF can be used as person fit statistics, especially in computerized adaptive tests.  相似文献   

14.
In this article, we introduce a person‐fit statistic called the hierarchy consistency index (HCI) to help detect misfitting item response vectors for tests developed and analyzed based on a cognitive model. The HCI ranges from ?1.0 to 1.0, with values close to ?1.0 indicating that students respond unexpectedly or differently from the responses expected under a given cognitive model. A simulation study was conducted to evaluate the power of the HCI in detecting different types of misfitting item response vectors. Simulation results revealed that the detection rate of the HCI was a function of type of misfit, item discriminating power, and test length. The best detection rates were achieved when the HCI was applied to tests that consisted of a large number of highly discriminating items. In addition, whether a misfitting item response vector can be correctly identified depends, to a large degree, on the number of misfits of the item response vector relative to the cognitive model. When misfitting response behavior only affects a small number of item responses, the resulting item response vector will not be substantially different from the expectations under the cognitive model and consequently may not be statistically identified as misfitting. As an item response vector deviates further from the model expectations, misfits are more easily identified and consequently higher detection rates of the HCI are expected.  相似文献   

15.
The goal of this study was to investigate the usefulness of person‐fit analysis in validating student score inferences in a cognitive diagnostic assessment. In this study, a two‐stage procedure was used to evaluate person fit for a diagnostic test in the domain of statistical hypothesis testing. In the first stage, the person‐fit statistic, the hierarchy consistency index (HCI; Cui, 2007 ; Cui & Leighton, 2009 ), was used to identify the misfitting student item‐score vectors. In the second stage, students’ verbal reports were collected to provide additional information about students’ response processes so as to reveal the actual causes of misfits. This two‐stage procedure helped to identify the misfits of item‐score vectors to the cognitive model used in the design and analysis of the diagnostic test, and to discover the reasons of misfits so that students’ problem‐solving strategies were better understood and their performances were interpreted in a more meaningful way.  相似文献   

16.
Educational Testing Service A multiple-choice test item is identified as flawed if it has no single best answer. In spite of extensive quality control procedures, the administration of flawed items to test takers is inevitable. A limited set of common strategies for dealing with flawed items in conventional testing, grounded in the principle of fairness to examinees, is reexamined in the context of adaptive testing. An additional strategy, available for adaptive testing, of retesting from a pool cleansed of flawed items, is compared to the existing strategies. Retesting was found to be no practical improvement over current strategies.  相似文献   

17.
In many testing programs it is assumed that the context or position in which an item is administered does not have a differential effect on examinee responses to the item. Violations of this assumption may bias item response theory estimates of item and person parameters. This study examines the potentially biasing effects of item position. A hierarchical generalized linear model is formulated for estimating item‐position effects. The model is demonstrated using data from a pilot administration of the GRE wherein the same items appeared in different positions across the test form. Methods for detecting and assessing position effects are discussed, as are applications of the model in the contexts of test development and item analysis.  相似文献   

18.
The personal biserial index is a correlation which measures the relationship between the difficulty of the items in a test for the person, as evidenced by his passes and failures, and the difficulty of the items, as evidenced by group-determined item difficulties. Properties of the personal biserial index were studied empirically, including an examination of the reliability of the index and the effect of using the index as a predictor of college success. The findings include that the reliability of the index is quite low and that a knowledge of the index does not significantly increase the predictability of college success from SAT scores and high school averages. Evidence was provided which supports the hypothesis that the personal biserial index is sensitive to variations in the extent to which examinees guess.  相似文献   

19.
This case-study investigates the predictive validity and reliability of Key Stage 2 test results, and teacher assessments, for target-setting and value-added assumptions at Key Stage 3. (In England Key Stage 2 tests are taken in the core subjects of English, Mathematics and Science at the age of 11. Key Stage 3 tests are taken in the same subjects at the age of 14. Teacher assessments are also completed for these subjects at both key stages.) The study employed the type of linear regression analysis recommended in several government reports, to correlate Key Stage 2 test results, and teacher assessments, in core subjects, with Key Stage 3 test results, and teacher assessments, in both core and non-core subjects. Following government recommendations that the use of any other form of testing - such as the National Foundation for Educational Research (NFER) Cognitive Abilities Test (CAT) - was now no longer necessary to provide baseline data for value-added calculations, or to set targets, correlations were also investigated between results on the CAT, and test results and teacher assessments at Key Stage 3, for both core and non-core subjects, to see whether this recommendation was well founded. The results of the case-study suggest that Key Stage 2 data, both in the form of test results and teacher assessments, have little or no predictive validity, or reliability, for test results or teacher assessments at Key Stage 3. Indeed, the predictive validity for non-core subjects at Key Stage 3 was so low as to be negligible. However, the CAT average score correlated more highly with both teacher assessments and test results at Key Stage 3 in core subjects, although this relationship was not reflected in non-core subjects. These findings suggest that the predictive validity and reliability of Key Stage 2 data is seriously open to question as baseline data for either value-added, or target-setting procedures, at Key Stage 3. It should be pointed out, however, that these findings are provisional, since they are based on data from two intake years, but preliminary analysis of data from a further three intake years appears to indicate that the concerns identified are well founded.  相似文献   

20.
In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号