首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Methods are presented for comparing grades obtained in a situation where students can choose between different subjects. It must be expected that the comparison between the grades is complicated by the interaction between the students' pattern and level of proficiency on one hand, and the choice of the subjects on the other hand. Three methods based on item response theory (IRT) for the estimation of proficiency measures that are comparable over students and subjects are discussed: a method based on a model with a unidimensional representation of proficiency, a method based on a model with a multidimensional representation of proficiency, and a method based on a multidimensional representation of proficiency where the stochastic nature of the choice of examination subjects is explicitly modeled. The methods are compared using the data from the Central Examinations in Secondary Education in the Netherlands. The results show that the unidimensional IRT model produces unrealistic results, which do not appear when using the two multidimensional IRT models. Further, it is shown that both the multidimensional models produce acceptable model fit. However, the model that explicitly takes the choice process into account produces the best model fit.  相似文献   

2.
The paper deals with the investigation of gender differences in performances in mathematics for Italian students at the end of lower secondary school. The study is based on a new large-scale assessment test developed and administered by the National Evaluation Institute for the School System. Given the evidence in the literature which favors males, performances of female and male students are compared using different approaches. Scores proposed by educational experts based on item subgroups were considered, while a model-based approach was used within item response theory. The results revealed a significant advantage to males in overall performance, while no meaningful differences were observed with respect to item domain and type. An interpretable item map was developed crossing expert opinions with IRT abilities, and plausible proficiency levels were defined. According to the map-based student classification, a relatively lower percentage of females fell into the highest proficiency groups with respect to males.  相似文献   

3.
Empirical studies demonstrated Type-I error (TIE) inflation (especially for highly discriminating easy items) of the Mantel-Haenszel chi-square test for differential item functioning (DIF), when data conformed to item response theory (IRT) models more complex than Rasch, and when IRT proficiency distributions differed only in means. However, no published study manipulated proficiency variance ratio (VR). Data were generated with the three-parameter logistic (3PL) IRT model. Proficiency VRs were 1, 2, 3, and 4. The present study suggests inflation may be greater, and may affect all highly discriminating items (low, moderate, and high difficulty), when IRT proficiency distributions of reference and focal groups differ also in variances. Inflation was greatest on the 21-item test (vs. 41) and 2,000 total sample size (vs. 1,000). Previous studies had not systematically examined sample size ratio. Sample size ratio of 1:1 produced greater TIE inflation than 3:1, but primarily for total sample size of 2,000.  相似文献   

4.
Given the relationships of item response theory (IRT) models to confirmatory factor analysis (CFA) models, IRT model misspecifications might be detectable through model fit indexes commonly used in categorical CFA. The purpose of this study is to investigate the sensitivity of weighted least squares with adjusted means and variance (WLSMV)-based root mean square error of approximation, comparative fit index, and Tucker–Lewis Index model fit indexes to IRT models that are misspecified due to local dependence (LD). It was found that WLSMV-based fit indexes have some functional relationships to parameter estimate bias in 2-parameter logistic models caused by violations of LD. Continued exploration into these functional relationships and development of LD-detection methods based on such relationships could hold much promise for providing IRT practitioners with global information on violations of local independence.  相似文献   

5.
A key intent of the NCLB growth pilot is to reward low‐status schools who are closing the gap to proficiency. In this article, we demonstrate that the capability of proposed models to identify those schools depends on how the growth model is incorporated into accountability decisions. Six pilot‐approved growth models were applied to vertically scaled mathematics assessment data from a single state collected over 2 years. Student and school classifications were compared across models. Accountability classifications using status and growth to proficiency as defined by each model were considered from two perspectives. The first involved adding the number of students moving toward proficiency to the count of proficient students, while the second involved a multitier accountability system where each school was first held accountable for status and then held accountable for the growth of their nonproficient students. Our findings emphasize the importance of evaluating status and growth independently when attempting to identify low‐status schools with insufficient growth among nonproficient students.  相似文献   

6.
An approach to essay grading based on signal detection theory (SDT) is presented. SDT offers a basis for understanding rater behavior with respect to the scoring of construct responses, in that it provides a theory of psychological processes underlying the raters' behavior. The approach also provides measures of the precision of the raters and the accuracy of classifications. An application of latent class SDT to essay grading is detailed, and similarities to and differences from item response theory (IRT) are noted. The validity and utility of classifications obtained from the SDT model and scores obtained from IRT models are compared. Validity coefficients were found to be about equal in magnitude across SDT and IRT models. Results from a simulation study of a 5-class SDT model with eight raters are also presented.  相似文献   

7.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.  相似文献   

8.
对不同类型学校的774名有效被试实施数学学业成就水平测试,并应用IRT参数模型方法进行分析,得出四点判断:(1)测验分数、最优分数呈负偏态分布;(2)测验信息函数负向偏移,大体呈现双峰波形;(3)主观性试题与逻辑斯蒂模型的拟合性较差;(4)不同类型学校学生的数学学业成就水平存在显著性差异。  相似文献   

9.
《教育实用测度》2013,26(4):371-383
School-level assessment of student writing ability using a group-level, polytomous item response theory (IRT) model was illustrated in this study. The study supported the viability of an IRT-based school assessment as an alternative to the conventional approach based on aggregation of individual scores. The precision provided by the assumed assessment design varied dramatically depending on school size and school average ability. For small schools and students with low average abilities, differences in average school performance had to be quite large to be trustworthy. In contrast, the design provided greater precision in detecting differences for large schools and students with high average abilities. An operational use of this design would require great care in the reporting of results to ensure that unreliable school comparisons are clearly identified.  相似文献   

10.
Developing understanding of models and proficiency with modeling practice is challenging for both teachers and students. This 2‐year study first investigated existing instructional strategies employed by teachers while teaching Earth and Space Science with dynamic physical models. Summer professional development introduced a conceptual framework, based on analogical reasoning, to help students strengthen and deepen the connections they make between a model and its real‐world referent. The framework draws explicit attention to correspondences and non‐correspondences between model and referent, an often overlooked component of modeling practice which underpins the ability to evaluate and thus improve a model. Teachers were guided to reflect on their own instructional use of models and to plan for integrating specific instructional strategies around models into their Year 2 practice. Classroom observation data reveal that from Years 1 to 2, teachers shifted from a more didactic approach in which they used physical models primarily as tools for demonstration toward more student engagement with models as problem‐solving tools. On an assessment measuring their students' ability to reason with and about models, pre‐post learning gains were higher in Year 2 than Year 1 across students at all ability levels. Together, these findings present evidence that teachers can learn to guide their students toward using physical models in ways that approximate key aspects of how scientists use runnable models, as envisioned by the Developing and Using Models practice of the Next Generation Science Standards.  相似文献   

11.
The article focuses on the development and assessment of skills during the work‐based placements of business studies sandwich degree students. A total of forty‐two skills are identified and these have been subsumed within two distinct frameworks according to their general or vocational nature. The importance of the work‐based placement In acquiring these skills is borne out when this is compared against a variety of other methods.

Aggregated students’ ratings have been compared with those of employers. Whilst there appears to be some agreement about development having occurred during the placement for virtually all skills, the research highlights major inconsistencies between employers and students in their ratings of the actual level of proficiency finally achieved. This discrepancy between student and employer assessments suggests that a more comprehensive and systematic approach to assessment is required if formal credit is to be given for work‐based learning within degree programmes. This could involve techniques such as triangulation, contract learning or the assessment of competencies, but the difficulty of ensuring a common frame of reference between the parties involved, and across the methods of assessment employed, is likely to remain a major issue, particularly among those advocating formal credit towards degree classifications.  相似文献   


12.
A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models.  相似文献   

13.
When dealing with missing responses, two types of omissions can be discerned: items can be skipped or not reached by the test taker. When the occurrence of these omissions is related to the proficiency process the missingness is nonignorable. The purpose of this article is to present a tree‐based IRT framework for modeling responses and omissions jointly, taking into account that test takers as well as items can contribute to the two types of omissions. The proposed framework covers several existing models for missing responses, and many IRTree models can be estimated using standard statistical software. Further, simulated data is used to show that ignoring missing responses is less robust than often considered. Finally, as an illustration of its applicability, the IRTree approach is applied to data from the 2009 PISA reading assessment.  相似文献   

14.
In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed.  相似文献   

15.
Standard 3.9 of the Standards for Educational and Psychological Testing ( 1999 ) demands evidence of model fit when item response theory (IRT) models are employed to data from tests. Hambleton and Han ( 2005 ) and Sinharay ( 2005 ) recommended the assessment of practical significance of misfit of IRT models, but few examples of such assessment can be found in the literature concerning IRT model fit. In this article, practical significance of misfit of IRT models was assessed using data from several tests that employ IRT models to report scores. The IRT model did not fit any data set considered in this article. However, the extent of practical significance of misfit varied over the data sets.  相似文献   

16.
《教育实用测度》2013,26(2):195-208
The consistency between raters over 3 years of a high-stakes performance assessment was examined in 2 studies that involved students in Grades 3, 5, and 8. The students' performance was evaluated in reading, writing, language usage, mathematics, science, and social studies. The results showed that the groups of raters used in different years differed in severity. Their consistency tended to improve over years, but differences between the rater groups remained. It is shown that these differences could affect students' proficiency classifications, indicating the need to adjust for rater effects during the equating process. The Grade 8 raters generally were found to be more consistent than the Grade 3 and Grade 5 raters. Also, the raters in mathematics generally were the most consistent, those in the language arts areas were the least consistent, and the consistency of raters in science and social studies varied over grade levels.  相似文献   

17.
Using a New Statistical Model for Testlets to Score TOEFL   总被引:1,自引:0,他引:1  
Standard item response theory (IRT) models fit to examination responses ignore the fact that sets of items (testlets) often are matched with a single common stimulus (e.g., a reading comprehension passage). In this setting, all items given to an examinee are unlikely to be conditionally independent (given examinee proficiency). Models that assume conditional independence will overestimate the precision with which examinee proficiency is measured. Overstatement of precision may lead to inaccurate inferences as well as prematurely ended examinations in which the stopping rule is based on the estimated standard error of examinee proficiency (e.g., an adaptive test). The standard three parameter IRT model was modified to include an additional random effect for items nested within the same testlet (Wainer, Bradlow, & Du, 2000). This parameter, γ characterizes the amount of local dependence in a testlet.
We fit 86 TOEFL testlets (50 reading comprehension and 36 listening comprehension) with the new model, and obtained a value for the variance of γ for each testlet. We compared the standard parameters (discrimination (a), difficulty (b) and guessing (c)) with what is obtained through traditional modeling. We found that difficulties were well estimated either way, but estimates of both a and c were biased if conditional independence is incorrectly assumed. Of greater import, we found that test information was substantially over-estimated when conditional independence was incorrectly assumed.  相似文献   

18.
A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of polytomous IRT models. The module presents commonly encountered polytomous IRT models, describes their properties, and contrasts their defining principles and assumptions. After completing this module, the reader should have a sound understating of what a polytomous IRT model is, the manner in which the equations of the models are generated from the model's underlying step functions, how widely used polytomous IRT models differ with respect to their definitional properties, and how to interpret the parameters of polytomous IRT models.  相似文献   

19.
This article investigates ways to improve the assessment of English learner students' English language proficiency given the current movement of creating next-generation English language proficiency assessments in the Common Core era. In particular, this article discusses the integration of scaffolding strategies, which are prevalently utilized as an instructional strategy for English learner students, into the design of technology-enhanced assessment tasks. The article includes sample tasks and student responses to illustrate the design of scaffolding assessment tasks and their potential to increase the accuracy of measuring students' English language proficiency. We also explore possible scoring and psychometric models for the scaffolding tasks in large-scale standardized assessments.  相似文献   

20.
以太原理工大学二年级本科生为样本,分析了元认知监控自我评估与实际测试的相关性。研究结果表明,理工科学生自我评估与实际测试有显著差异,其在自我成绩的判断中有明显低估自己的倾向,虽能较准确地评估自己的翻译能力,但对阅读能力的评估不够准确。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号