首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
Item response theory scalings were conducted for six tests with mixed item formats. These tests differed in their proportions of constructed response (c.r.) and multiple choice (m.c.) items and in overall difficulty. The scalings included those based on scores for the c.r. items that had maintained the number of levels as the item rubrics, either produced from single ratings or multiple ratings that were averaged and rounded to the nearest integer, as well as scalings for a single form of c.r. items obtained by summing multiple ratings. A one-parameter (IPPC) or two-parameter (2PPC) partial credit model was used for the c.r. items and the one-parameter logistic (IPL) or three-parameter logistic (3PL) model for the m.c. items, ltem fit was substantially worse with the combination IPL/IPPC model than the 3PL/2PPC model due to the former's restrictive assumptions that there would be no guessing on the m.c. items and equal item discrimination across items and item types. The presence of varying item discriminations resulted in the IPL/IPPC model producing estimates of item information that could be spuriously inflated for c.r. items that had three or more score levels. Information for some items with summed ratings were usually overestimated by 300% or more for the IPL/IPPC model. These inflated information values resulted in under-estbnated standard errors of ability estimates. The constraints posed by the restricted model suggests limitations on the testing contexts in which the IPL/IPPC model can be accurately applied.  相似文献   

2.
An important part of test development is ensuring alignment between test forms and content standards. One common way of measuring alignment is the Webb (1997, 2007) alignment procedure. This article investigates (a) how well item writers understand components of the definition of Depth of Knowledge (DOK) from the Webb alignment procedure and (b) how consistent their DOK ratings are with ratings provided by other committees of educators across grade levels, content areas, and alternate assessment levels in a Midwestern state alternate assessment system. Results indicate that many item writers understand key features of DOK. However, some item writers struggled to articulate what DOK means and had some misconceptions. Additional analyses suggested some lack of consistency between the item writer DOK ratings and the committee DOK ratings. Some notable differences were found across alternate assessment levels and content areas. Implications for future item writing training and alignment studies are provided.  相似文献   

3.
The alignment of test items to content standards is critical to the validity of decisions made from standards‐based tests. Generally, alignment is determined based on judgments made by a panel of content experts with either ratings averaged or via a consensus reached through discussion. When the pool of items to be reviewed is large, or the content‐matter experts are broadly distributed geographically, panel methods present significant challenges. This article illustrates the use of an online methodology for gauging item alignment that does not require that raters convene in person, reduces the overall cost of the study, increases time flexibility, and offers an efficient means for reviewing large item banks. Latent trait methods are applied to the data to control for between‐rater severity, evaluate intrarater consistency, and provide item‐level diagnostic statistics. Use of this methodology is illustrated with a large pool (1,345) of interim‐formative mathematics test items. Implications for the field and limitations of this approach are discussed.  相似文献   

4.
Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice.  相似文献   

5.
Students rated the quality of the items on a classroom test that had been taken previously. On the same test, psychometric item indices were calculated. The results showed that the student ratings were related to the item difficulty, but not to the item-test correlation. In addition, the better-achieving students tended to rate the items as less ambiguous. Finally, the ambiguity ratings were more highly related to the item-test correlations for the better achieving students. These findings support opinions held by many instructors of students' judgments of item quality.  相似文献   

6.
An assumption of item response theory is that a person's score is a function of the item response parameters and the person's ability. In this paper, the effect of variations in instructional coverage on item characteristic functions is examined. Using data from the Second International Mathematics Study (1985), curriculum clusters were formed based on teachers' ratings of their students' opportunities to learn the items on a test. After forming curriculum clusters, item response curves were compared using signed and unsigned sum of squared differences. Some of the differences in the item response curves between curriculum clusters were found to be large, but better performance was not necessarily related to greater opportunity to learn. The item response curve differences were much larger than differences reported in prior studies based on comparisons of black and white students. Implications of the findings for applications of item response theory to educational achievement test data are discussed  相似文献   

7.
Sixty-eight graduate students made general and specific ratings of the quality of 12 classroom test items, which varied in difficulty and discrimination. Four treatment combinations defined two additional factors: group discussion/no group discussion of test items and exposure/no exposure to an instructional module on test item construction. The students rated the items differentially, depending not only on item difficulty level but also on item discriminative power. The group discussion and exposure to module factors had significant effects on the general item ratings only. Implications of the research were discussed.  相似文献   

8.
This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common‐item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common‐item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (N examinees = 4,397; N panelists = 13) were split into content‐equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common‐item equating (circle‐arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (N = 8) and by simulating panelists (N = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small‐sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard.  相似文献   

9.
This paper examines the stability and validity of a student evaluations of teaching (SET) instrument used by the administration at a university in the PR China. The SET scores for two semesters of courses taught by 435 teachers were collected. Total 388 teachers (170 males and 218 females) were also invited to fill out the 60‐item NEO Five‐Factor Inventory together with a demographic information questionnaire. The SET responses were found to have very high internal consistency and confirmatory factor analysis supported a one‐factor solution. The SET re‐test correlations were .62 for both the teachers who taught the same course (n = 234) and those who taught a different course in the second semester (n = 201). Linguistics teachers received higher SET scores than either social science or humanities or science and technology teachers. Student ratings were significantly related to Neuroticism and Extraversion. Regression results showed that the Big‐Five personality traits as a group explained only 2.6% of the total variance of student ratings and academic discipline explained 12.7% of the total variance of student ratings. Overall the stability and validity of SET was supported and future uses of SET scores in the PR China are discussed.  相似文献   

10.
Instructional sensitivity is the psychometric capacity of tests or single items of capturing effects of classroom instruction. Yet, current item sensitivity measures’ relationship to (a) actual instruction and (b) overall test sensitivity is rather unclear. The present study aims at closing these gaps by investigating test and item sensitivity to teaching quality, reanalyzing data from a quasi-experimental intervention study in primary school science education (1026 students, 53 classes, Mage = 8.79 years, SDage = 0.49, 50% female). We examine (a) the correlation of item sensitivity measures and the potential for cognitive activation in class and (b) consequences for test score interpretation when assembling tests from items varying in their degree of sensitivity to cognitive activation. Our study (a) provides validity evidence that item sensitivity measures may be related to actual classroom instruction and (b) points out that inferences on teaching drawn from test scores may vary due to test composition.  相似文献   

11.
This article discusses regression effects that are commonly observed in Angoff ratings where panelists tend to think that hard items are easier than they are and easy items are more difficult than they are in comparison to estimated item difficulties. Analyses of data from two credentialing exams illustrate these regression effects and the persistence of these regression effects across rounds of standard setting, even after panelists have received feedback information and have been given the opportunity to discuss their ratings. Additional analyses show that there tended to be a relationship between the average item ratings provided by panelists and the standard deviations of those item ratings and that the relationship followed a quadratic form with peak variation in average item ratings found toward the middle of the item difficulty scale. The study concludes with discussion of these findings and what they may imply for future standard settings.  相似文献   

12.
The purpose of this study was to investigate whether a linear factor analytic method commonly used to investigate violation of the item response theory (IRT) unidimensionality assumption is sensitive to measurable curricular differences within a school district and to examine the possibility of differential item performance for groups of students receiving different instruction. For grades 3 and 6 in reading and mathematics, personnel from two midwestern school systems that regularly administer standardized achievement tests identified the formal textbook series used and provided ratings of test-instructional match for each school building (classroom). For both districts, the factor analysis results suggested no differences in percentages of variance for large first factors and relatively small second factors across ratings or series groups. The IRT analyses indicated little, if any, differential item performance for curricular subgroups. Thus, the impact of factors that might be related to curricular differences was judged to be minor.  相似文献   

13.
This article describes a method for identifying test items as disability neutral for children with vision and motor disabilities. Graduate students rated 130 items of the Preschool Language Scale and obtained inter‐rater correlation coefficients of 0.58 for ratings of items as disability neutral for children with vision disability, and 0.77 for ratings of items as disability neutral for children with motor disability. These ratings were used to create three item sets considered disability neutral for children with vision disability, motor disability, or both disabilities. Two methods for scoring the item sets were identified: scoring each set as a partially administered developmental test, or computing standard scores based upon pro‐rated raw score totals. The pro‐rated raw score method generated standard scores that were significantly inflated and therefore less useful for the assessment purposes than the ratio quotient method. This research provides a test accommodation technique for assessing children with multiple disabilities.  相似文献   

14.
《教育实用测度》2013,26(1):33-51
The objectives of this study were to examine the impact of different curricula on standardized achievement test scores at item and objective levels and to determine if different curricula generate different patterns of item factor loadings. School buildings from a middle-sized district were rated regarding the degree to which their curricula matched the content of the standardized test, and the actual textbook series used within each building (classroom) was determined. Covariate analyses of objective scores and plots and correlations of item p values indicated very small, nonsignificant differential effects across ratings and textbook series. Factor patterns indicated no curricular effects on large first factors. These findings parallel the results of a previous study conducted at the subtest level. We conclude that educators need not be unduly concerned about the impact of specific and generally small differences in curricular offerings within a district on standardized test scores or inferences to a broad content domain.  相似文献   

15.
This study investigated the comparability of Angoff-based item ratings on a general education test battery made by judges from within-content specialties and across content domains. Judges were from English, mathematics, science, and social studies specialties in teacher education programs in a midwestem state. Cutscores established from the judges'ratings of out-of-content items differed little from the cutscores set using the ratings made by the content specialists. Further, out-of-content ratings by judges were not more influenced by performance data than were the ratings provided by judges rating items within their content specialty. The degree to -which these results generalize to other content specialties needs to be investigated.  相似文献   

16.
Identifying the Causes of DIF in Translated Verbal Items   总被引:1,自引:0,他引:1  
Translated tests are being used increasingly for assessing the knowledge and skills of individuals who speak different languages. There is little research exploring why translated items sometimes function differently across languages. If the sources of differential item functioning (DIF) across languages could be predicted, it could have important implications on test development, scoring and equating. This study focuses on two questions: “Is DIF related to item type?”, “What are the causes of DIF?” The data were taken from the Israeli Psychometric Entrance Test in Hebrew (source) and Russian (translated). The results indicated that 34% of the items functioned differentially across languages. The analogy items were the most problematic with 65% showing DIF, mostly in favor of the Russian-speaking examinees. The sentence completion items were also a problem (45% D1F). The main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.  相似文献   

17.
Large-scale assessments of student competencies address rather broad constructs and use parsimonious, unidimensional measurement models. Differential item functioning (DIF) in certain subpopulations usually has been interpreted as error or bias. Recent work in educational measurement, however, assumes that DIF reflects the multidimensionality that is inherent in broad competency constructs and leads to differential achievement profiles. Thus, DIF parameters can be used to identify the relative strengths and weaknesses of certain student subpopulations. The present paper explores profiles of mathematical competencies in upper secondary students from six countries (Austria, France, Germany, Sweden, Switzerland, the US). DIF analyses are combined with analyses of the cognitive demands of test items based on psychological conceptualisations of mathematical problem solving. Experts judged the cognitive demands of TIMSS test items, and these demand ratings were correlated with DIF parameters. We expected that cultural framings and instructional traditions would lead to specific aspects of mathematical problem solving being fostered in classroom instruction, which should be reflected in differential item functioning in international comparative assessments. Results for the TIMSS mathematics test were in line with expectations about cultural and instructional traditions in mathematics education of the six countries.  相似文献   

18.
Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning fixed‐effect parameters to describe response patterns in the bundle. Unfortunately, this model becomes difficult to manage when a polytomous item is graded by more than two raters. In this study, by adding random‐effect parameters to the facets model, we propose a class of generalized rater models to account for the local dependence among multiple ratings and intrarater variation in severity. A series of simulations was conducted with the freeware WinBUGS to evaluate parameter recovery of the new models and consequences of ignoring the local dependence or intrarater variation in severity. The results revealed a good parameter recovery when the data‐generating models were fit, and a poor estimation of parameters and test reliability when the local dependence or intrarater variation in severity was ignored. An empirical example is provided.  相似文献   

19.
In this study we evaluated and compared three item selection procedures: the maximum Fisher information procedure (F), the a-stratified multistage computer adaptive testing (CAT) (STR), and a refined stratification procedure that allows more items to be selected from the high a strata and fewer items from the low a strata (USTR), along with completely random item selection (RAN). The comparisons were with respect to error variances, reliability of ability estimates and item usage through CATs simulated under nine test conditions of various practical constraints and item selection space. The results showed that F had an apparent precision advantage over STR and USTR under unconstrained item selection, but with very poor item usage. USTR reduced error variances for STR under various conditions, with small compromises in item usage. Compared to F, USTR enhanced item usage while achieving comparable precision in ability estimates; it achieved a precision level similar to F with improved item usage when items were selected under exposure control and with limited item selection space. The results provide implications for choosing an appropriate item selection procedure in applied settings.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号