首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 366 毫秒
1.
Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential.  相似文献   

2.
Four item response theory (IRT) models were compared using data from tests where multiple items were grouped into testlets focused on a common stimulus. In the bi-factor model each item was treated as a function of a primary trait plus a nuisance trait due to the testlet; in the testlet-effects model the slopes in the direction of the testlet traits were constrained within each testlet to be proportional to the slope in the direction of the primary trait; in the polytomous model the item scores were summed into a single score for each testlet; and in the independent-items model the testlet structure was ignored. Using the simulated data, reliability was overestimated somewhat by the independent-items model when the items were not independent within testlets. Under these nonindependent conditions, the independent-items model also yielded greater root mean square error (RMSE) for item difficulty and underestimated the item slopes. When the items within testlets were instead generated to be independent, the bi-factor model yielded somewhat higher RMSE in difficulty and slope. Similar differences between the models were illustrated with real data.  相似文献   

3.
The use of accommodations has been widely proposed as a means of including English language learners (ELLs) or limited English proficient (LEP) students in state and districtwide assessments. However, very little experimental research has been done on specific accommodations to determine whether these pose a threat to score comparability. This study examined the effects of linguistic simplification of 4th- and 6th-grade science test items on a state assessment. At each grade level, 4 experimental 10-item testlets were included on operational forms of a statewide science assessment. Two testlets contained regular field-test items, but in a linguistically simplified condition. The testlets were randomly assigned to LEP and non-LEP students through the spiraling of test booklets. For non-LEP students, in 4 t-test analyses of the differences in means for each corresponding testlet, 3 of the mean score comparisons were not significantly different, and the 4th showed the regular version to be slightly easier than the simplified version. Analysis of variance (ANOVA), followed by pairwise comparisons of the testlets, showed no significant differences in the scores of non-LEP students across the 2 item types. Among the 40 items administered in both regular and simplified format, item difficulty did not vary consistently in favor of either format. Qualitative analyses of items that displayed significant differences in p values were not informative, because the differences were typically very small. For LEP students, there was 1 significant difference in student means, and it favored the regular version. However, because the study was conducted in a state with a small number of LEP students, the analyses of LEP student responses lacked statistical power. The results of this study show that linguistic simplification is not helpful to monolingual English-speaking students who receive the accommodation. Therefore, the results provide evidence that linguistic simplification is not a threat to the comparability of scores of LEP and monolingual English-speaking students when offered as an accommodation to LEP students. The study findings may also have implications for the use of linguistic simplification accommodations in science assessments in other states and in content areas other than science.  相似文献   

4.
This study investigates the effect of several design and administration choices on item exposure and person/item parameter recovery under a multistage test (MST) design. In a simulation study, we examine whether number‐correct (NC) or item response theory (IRT) methods are differentially effective at routing students to the correct next stage(s) and whether routing choices (optimal versus suboptimal routing) have an impact on achievement precision. Additionally, we examine the impact of testlet length on both person and item recovery. Overall, our results suggest that no single approach works best across the studied conditions. With respect to the mean person parameter recovery, IRT scoring (via either Fisher information or preliminary EAP estimates) outperformed classical NC methods, although differences in bias and root mean squared error were generally small. Item exposure rates were found to be more evenly distributed when suboptimal routing methods were used, and item recovery (both difficulty and discrimination) was most precisely observed for items with moderate difficulties. Based on the results of the simulation study, we draw conclusions and discuss implications for practice in the context of international large‐scale assessments that recently introduced adaptive assessment in the form of MST. Future research directions are also discussed.  相似文献   

5.
Abstract

The arrangement of response options in multiple-choice (MC) items, especially the location of the most attractive distractor, is considered critical in constructing high-quality MC items. In the current study, a sample of 496 undergraduate students taking an educational assessment course was given three test forms consisting of the same items but the positions of the most attractive distractor varied across the forms. Using a multiple-indicators–multiple-causes (MIMIC) approach, the effects of the most attractive distractor's positions on item difficulty were investigated. The results indicated that the relative placement of the most attractive distractor and the distance between the most attractive distractor and the keyed option affected students’ response behaviors. Moreover, low-achieving students were more susceptible to response-position changes than high-achieving students.  相似文献   

6.
Although multiple choice examinations are often used to test anatomical knowledge, these often forgo the use of images in favor of text‐based questions and answers. Because anatomy is reliant on visual resources, examinations using images should be used when appropriate. This study was a retrospective analysis of examination items that were text based compared to the same questions when a reference image was included with the question stem. Item difficulty and discrimination were analyzed for 15 multiple choice items given across two different examinations in two sections of an undergraduate anatomy course. Results showed that there were some differences item difficulty but these were not consistent to either text items or items with reference images. Differences in difficulty were mainly attributable to one group of students performing better overall on the examinations. There were no significant differences for item discrimination for any of the analyzed items. This implies that reference images do not significantly alter the item statistics, however this does not indicate if these images were helpful to the students when answering the questions. Care should be taken by question writers to analyze item statistics when making changes to multiple choice questions, including ones that are included for the perceived benefit of the students. Anat Sci Educ 10: 68–78. © 2016 American Association of Anatomists.  相似文献   

7.
This study focuses on the teachers’ predictions of the students’ performances – in particular the middle-low achievers – while solving tasks testing inquiry competencies. The tasks come from PISA science. More specifically we study science teachers’ predictions for several aspects: levels of difficulty of the tasks, the potential sources of difficulty and the potential difficulty in solving it for medium-low achievers. We also study what assessed competencies are identified by science teachers in the tasks. Our approach is a questionnaire-based study. A sample of French teachers in science and technology (125) responded to the questionnaire. The teachers show a rather good ability to predict inquiry task levels of difficulty for medium-low achievers and are able to identify relevant potential sources of difficulty or easiness in the items. However, they are not aware of some essential difficulties that medium-low students encounter while solving science inquiry tasks. Moreover, the teachers have difficulty identifying the competencies that are tested by an item.  相似文献   

8.
This study demonstrated the equivalence between the Rasch testlet model and the three‐level one‐parameter testlet model and explored the Markov Chain Monte Carlo (MCMC) method for model parameter estimation in WINBUGS. The estimation accuracy from the MCMC method was compared with those from the marginalized maximum likelihood estimation (MMLE) with the expectation‐maximization algorithm in ConQuest and the sixth‐order Laplace approximation estimation in HLM6. The results indicated that the estimation methods had significant effects on the bias of the testlet variance and ability variance estimation, the random error in the ability parameter estimation, and the bias in the item difficulty parameter estimation. The Laplace method best recovered the testlet variance while the MMLE best recovered the ability variance. The Laplace method resulted in the smallest random error in the ability parameter estimation while the MCMC method produced the smallest bias in item parameter estimates. Analyses of three real tests generally supported the findings from the simulation and indicated that the estimates for item difficulty and ability parameters were highly correlated across estimation methods.  相似文献   

9.
叶萌 《考试研究》2010,(2):96-107
本文对项目反应理论(IRT)局部独立性问题的主要研究成果进行了文献梳理。在此基础上,阐释局部独立性假设的定义。文章同时就局部独立性与测验维度的关系,局部依赖的甄别与计算、起因和控制程序,以及局部依赖对测量实践的影响进行讨论,并探讨了题组中局部题目依赖问题的解决策略。  相似文献   

10.
本研究应用项目反应理论,从被试的阅读能力值和题目的难度值这两个方面,分析阅读理解测试中多项选择题命题者对考试效度的影响。实验设计中,将两组被试同时施测于一项“阅读水平测试”,根据测试结果估计出的两组被试能力值之间无显著性差异。再次将这两组被试分别施测于两位不同命题者所命制的题目,尽管这些题目均产生于相同的阅读材料,且题目的难度值之间并没有显著性差异,被试的表现却显著不同。Rasch模型认为,被试表现由被试能力和试题难度共同决定。因此,可以推测,这是由于不同命题者所命制的题目影响了被试的表现,并进而影响了使用多项选择题进行阅读理解测试的效度。  相似文献   

11.
In teaching, representations are used as ways to illustrate the concepts underlying a specific topic. For example, use symbols (e.g., 1?+?2?=?3) to express the concept of addition. To compare students’ abilities to interpret different representations in mathematics, the symbolic representation (SR) test and the pictorial representation (PR) test were designed, and then administered to 681 sixth graders in Taipei, Taiwan. This study adopts two different modeling perspectives, the testlet perspective and the multi-ability perspective, to analyze this SR and PR test data in the context of item response theory. The main results show that:
  1. Students scored on average significantly higher on the SR test than the PR test.
  2. The effects of the item stem testlets could be large, but they are statistically non-significant; however, the influence of the number of items in the testlet should also be considered.
  3. The nature of the option representations, SR and PR, represents two different mathematics abilities.
  4. The main factor that influences students’ item responses is students’ abilities to interpret SR and PR, and the testlet effects generated from the shared item stem can be ignored.
  5. Regarding the parameter estimates of the best-fitting model: (a) the person ability variance estimates show that the ability distributions on the SR and PR dimension may not be the same, (b) the correlation estimate between the SR and PR dimension indicates that these two abilities are moderately correlated, and (c) the item difficulty estimates for different models are similar.
Suggestions for teaching practice and future studies are provided in the Conclusion.  相似文献   

12.
ABSTRACT

Testlets, or groups of related items, are commonly included in educational assessments due to their many logistical and conceptual advantages. Despite their advantages, testlets introduce complications into the theory and practice of educational measurement. Responses to items within a testlet tend to be correlated even after controlling for latent ability, which violates the assumption of conditional independence made by traditional item response theory models. The present study used Monte Carlo simulation methods to evaluate the effects of testlet dependency on item and person parameter recovery and classification accuracy. Three calibration models were examined, including the traditional 2PL model with marginal maximum likelihood estimation, a testlet model with Bayesian estimation, and a bi-factor model with limited-information weighted least squares mean and variance adjusted estimation. Across testlet conditions, parameter types, and outcome criteria, the Bayesian testlet model outperformed, or performed equivalently to, the other approaches.  相似文献   

13.
This study reports the results of a componential analysis of items comprising Sections A and C of Form Z of the reading comprehension portions of the California Achievement Tests (CAT) (Tiegs & Clark, 1963). A set of problem components or attributes characterizing the test items in terms of manifest content, psychologically salient features, and processing demands was developed, including methods for their quantification. The contributions of these components to task difficulty were then evaluated using linear regression methodology. Item difficulty indices were transformations of the familiar proportion-correct item score, obtained from data gathered during the spring of 1989 from 158 deaf examinees. Variation in the item difficulty values was substantially accounted for in terms of a small number of predictor variables (R2 greater than or equal to .90). Implications of the results for construct validity and interpretation of test scores are discussed.  相似文献   

14.
《教育实用测度》2013,26(2):175-199
This study used three different differential item functioning (DIF) detection proce- dures to examine the extent to which items in a mathematics performance assessment functioned differently for matched gender groups. In addition to examining the appropriateness of individual items in terms of DIF with respect to gender, an attempt was made to identify factors (e.g., content, cognitive processes, differences in ability distributions, etc.) that may be related to DIF. The QUASAR (Quantitative Under- standing: Amplifying Student Achievement and Reasoning) Cognitive Assessment Instrument (QCAI) is designed to measure students' mathematical thinking and reasoning skills and consists of open-ended items that require students to show their solution processes and provide explanations for their answers. In this study, 33 polytomously scored items, which were distributed within four test forms, were evaluated with respect to gender-related DIF. The data source was sixth- and seventh- grade student responses to each of the four test forms administrated in the spring of 1992 at all six school sites participatingin the QUASARproject. The sample consisted of 1,782 students with approximately equal numbers of female and male students. The results indicated that DIF may not be serious for 3 1 of the 33 items (94%) in the QCAI. For the two items that were detected as functioning differently for male and female students, several plausible factors for DIF were discussed. The results from the secondary analyses, which removed the mutual influence of the two items, indicated that DIF in one item, PPPl, which favored female students rather than their matched male students, was of particular concern. These secondary analyses suggest that the detection of DIF in the other item in the original analysis may have been due to the influence of Item PPPl because they were both in the same test form.  相似文献   

15.
This article describes a comparative study conducted at the item level for paper and online administrations of a statewide high stakes assessment. The goal was to identify characteristics of items that may have contributed to mode effects. Item-level analyses compared two modes of the Texas Assessment of Knowledge and Skills (TAKS) for up to four subjects at two grade levels. The analyses included significance tests of p-value differences, DIF, and response distributions for each item. Additional analyses investigated item position effects and objective-level mode differences. No evidence of item position effects emerged, but significant differences were found for several items and objectives in all subjects at grade 8 and in mathematics and English language arts (ELA) at grade 11. Differences generally favored the paper group. ELA items that were longer in passage length and math items that required graphing and geometric manipulations or involved scrolling in the online administration tended to be the items showing mode differences.  相似文献   

16.
Selected Marker Tests, of Educational Testing Service and Sheridan Psychological Services, Inc., were examined in terms of problems in scoring and internal consistency. The tests were administered orally to 116 sixth and seventh grade students. Problems in scoring were discovered and changes were suggested. The agreement between two independent judges on part scores was high. Twenty-one of 28 correlations were .90 and above. Item correlations with part and total scores, using Cureton’s correction, were frequently very low, and many items did not meet the desirable difficulty level for norm-referenced tests. The study suggests that, with the sample used, the problem is not one of agreement among judges but, rather, one of item reliability.  相似文献   

17.
A subset of the items of both forms of the Peabody Picture Vocabulary Test (PPVT) was administered to a sample of 452 fourth-, fifth- and sixth-grade students. This sample of students was randomly divided into two equal subgroups. Item difficulty indices were calculated for each of the two subsamples for each of the two forms of the test. Data obtained from the first subsample were used to evaluate the published ordering of items of Forms A and B of the PPVT and to reorder the items according to the empirically derived item difficulties. The second subsample was used as a cross-validation sample to evaluate the empirically derived reordering of items. The results of the cross-validation of the reordering indicate a substantial and significant increase in the validity of the item orderings for this subset of items on both forms of the PPVT. Therefore, this new ordering may yield a more accurate estimate of the intelligence of average and above students in the fourth-, fifth-, and sixth-grades than the present, published ordering of items.  相似文献   

18.
This study presents a quantitative approach based on Differential Item Functioning analysis within the Item Response Theory framework, to quantify gender differences in tackling specific mathematical items. We use this approach to explore two crucial topics in mathematics education (misconceptions in decimal numbers, and estimation) by analysing answers given by Italian students to specific mathematical items taken from a sample of 1400 items administered in INVALSI tests since 2008. For each item, we have a sample of 30,000 students per year, statistically representative of the entire Italian student population. The results section presents a didactic interpretation of the statistical evidence and shows how interdisciplinarity between statistics and mathematics education, with a mixed-method approach (Johnson & Onwuegbuzie, 2004), represents a good approach to exploring the gender gap in relation to specific constructs of math education. This approach, common in many disciplines, can also make an interesting contribution to mathematics education research.  相似文献   

19.
难度不是试题的固有属性,而是考生因素与试题特征之间互动的结果。很多试题分析者倾向于将试题难度偏高的原因仅仅归结于学生未掌握相关知识或技能,而忽视试题本身的特征。通过分析60道难度在0.6以下的高考英语试题,探究其难度来源。结果显示,除考生因素外,难题或偏难题的难度来源也与命题技术有关,比如答案的唯一性与可接受性、考查内容超纲、考点设置与评分标准欠妥等方面的问题。为此,提出考试机构应提高命题水平,加强试题质量监控,确保大规模考试科学选拔人才。  相似文献   

20.
The purpose of this study was, first, to understand the item hierarchy regarding students’ understanding of scientific models and modeling (USM). Secondly, this study investigated Taiwanese students’ USM progression from 7th to 12th grade, and after participating in a model-based curriculum. The questionnaire items were developed based on 6 aspects of USM, namely, model type, model content, constructed nature of models, multiple models, change of models, and purpose of models. Moreover, 10 representations of models were included for surveying what a model is. Results show that the purpose of models and model type items covered a wide range of item difficulties. At the one end, items for the purpose of models are most likely to be endorsed by the students, except for the item “models are used to predict.” At the other end, the “model type” items tended to be difficult. The students were least likely to agree that models can be text, mathematical, or dynamic. The items of the constructed nature of models were consistently located above the average, while the change of models items were consistently located around the mean level of difficulty. In terms of the natural progression of USM, the results show significant differences between 7th grade and all grades above 10th, and between 8th grade and 12th grade. The students in the 7th grade intervention group performed better than the students in the 7th and 8th grades who received no special instruction on models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号