首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
HSK高等考试信度的多元概化理论研究   总被引:2,自引:0,他引:2  
本研究运用多元概化理论对HSK高等考试客观卷的信度、试卷结构、测验总分的合成、试题预测方式的改进等问题进行探讨,结果表明:(1)HSK高等考试客观卷的总体及各部分信度都较好,且测验总分的合成是合理的;(2)测验各部分内容对全域总分方差分量的贡献比例与预设的赋分比例基本一致,试卷结构较为合理;(3)在适当减少各部分题量的情况下测验的信度仍较高,今后可以考虑在正式考试中进行试题预测。  相似文献   

2.
This paper describes a four-step approach to constructing diagnostic test profiles that provide precise but practical information on students' instructional needs. The approach is based on the specification and analysis of a domain and uses generalizability theory to determine which skills within the domain need to be assessed to diagnose gaps in students' skills and to estimate score profiles. A 64-item test of pronoun use was constructed to represent 32 categories of usage defined by different combinations of five factors in the domain. Generalizability analyses were conducted to determine the optimal number of categories to be included in students' profiles and the number of items needed for each category, and to produce univariate and multivariate estimates of students' universe scores. Multivariate profiles of universe scores were the most accurate and differed substantially from observed score and univariate universe score profiles.  相似文献   

3.
《教育实用测度》2013,26(3):225-264
The principal purpose of this article is to provide a conceptual framework and heuristic model for considering the existence, magnitude, and consequences of context effects. This purpose is addressed through an extension of some concepts in generalizability theory. In particular, distinctions are drawn between different types of facets and different types of universes. For example, sets of items are distinguished from conditions of measurement typically associated with context effects (e.g., item sequence). In addition, a validity-defining universe of generalization is distinguished from a universe of allowable observations associated with a standardized measurement proce- dure and certain fixed conditions of measurement. Other fixed conditions of measurement are also considered in order to examine context effects involved in applications such as the use of so-called "scrambled" test forms and item or section preequating. It is concluded that context effects are often misunder- stood or masked, that current measurement models have rather serious limitations for examining context effects, and that the importance and magnitude of context effects need to be evaluated in context.  相似文献   

4.
Student–teacher interactions are dynamic relationships that change and evolve over the course of a school year. Measuring classroom quality through observations that focus on these interactions presents challenges when observations are conducted throughout the school year. Variability in observed scores could reflect true changes in the quality of student–teacher interaction or simply reflect measurement error. Classroom observation protocols should be designed to minimize measurement error while allowing measureable changes in the construct of interest. Treating occasions as fixed multivariate outcomes allows true changes to be separated from random measurement error. These outcomes may also be summarized through trend score composites to reflect different types of growth over the school year. We demonstrate the use of multivariate generalizability theory to estimate reliability for trend score composites, and we compare the results to traditional methods of analysis. Reliability estimates computed for average, linear, quadratic, and cubic trend scores from 118 classrooms participating in the MyTeachingPartner study indicate that universe scores account for between 57% and 88% of observed score variance.  相似文献   

5.
《教育实用测度》2013,26(4):331-345
In order to obtain objective measurement for examinations that are graded by judges, an extension of the Rasch model designed to analyze examinations with more than two facets (items/examinees) is used. This extended Rasch model calibrates the elements of each facet of the examination (i.e., examinee performances, items, and judges) on a common log-linear scale. A network for assigning judges to examinations is used to link all facets. Real examination data from the "clinical assessment" part of a certification examination are used to illustrate the application. A range of item difficulties and judge severities were found. Comparison of examinee raw scores with objective linear measures corrected for variations in judge severity shows that judge severity can have a substantial impact on a raw score. Correcting for judge severity improves the fairness of examinee measures and of the subsequent pass-fail decisions because the uncorrected raw scores favor examinee performances graded by lenient judges.  相似文献   

6.
This study examined the exchangeability of total scores (i.e., intelligent quotients [IQs]) from three brief intelligence tests. Tests were administered to 36 children with intellectual giftedness, scored live by one set of primary examiners and later scored by a secondary examiner. For each student, six IQs were calculated, and all 216 values were submitted to a generalizability theory analysis. Despite strong convergent validity and reliability evidence supporting brief IQs, the resulting dependability coefficient was only .80, which indicates relatively low exchangeability across tests and examiners. Although error variance components representing the effects of the examiner, examiner‐by‐examinee interaction, the examiner‐by‐test interaction, and the test contributed little to IQ variability, the component representing the test‐by‐examinee interaction contributed about one‐third of the variance in IQs. These findings hold implications for selecting and interpreting brief intelligence tests and general testing for intellectual giftedness.  相似文献   

7.
According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing.  相似文献   

8.
The purpose of this study was to investigate the effects of items, passages, contents, themes, and types of passages on the reliability and standard errors of measurement for complex reading comprehension tests. Seven different generalizability theory models were used in the analyses. Results indicated that generalizability coefficients estimated using multivariate models incorporating content strata and types of passages were similar in size to reliability estimates based upon a model that did not include these factors. In contrast, incorporating passages and themes within univariate generalizability theory models produced non-negligible differences in the reliability estimates. This suggested that passages and themes be taken into account when evaluating the reliability of test scores for complex reading comprehension tests.  相似文献   

9.
Item sampling and/or multiple matrix sampling techniques have been recommended for a variety of purposes. For some of these purposes, it must be assumed that examinee performance on a set of items is unaffected by the conditions under which the items are taken (i.e., no context effect exists). In this paper factors that may lead to a context effect among high school students are discussed. The net effect of such factors on examinee scores for an English test and a mathematics test is investigated empirically. For the English test there was little support for the existence of a context effect, However, a definite context effect was found for the mathematics test.  相似文献   

10.
Under the generalizability‐theory (G‐theory) framework, the estimation precision of variance components (VCs) is of significant importance in that they serve as the foundation of estimating reliability. Zhang and Lin advanced the discussion of nonadditivity in data from a theoretical perspective and showed the adverse effects of nonadditivity on the estimation precision of VCs in 2016. Contributing to this line of research, the current article directs the discussion of nonadditivity from a theoretical perspective to a practical application and highlights the importance of detecting nonadditivity in G‐theory applications. To this end, Tukey's test for nonadditivity is the only method to date that is appropriate for the typical single‐facet G‐theory design, in which a single observation is made per element within a facet. The current article evaluates the Type I and Type II error rates of Tukey's test. Results show that Tukey's test is satisfactory in controlling for falsely detecting nonadditivity when the data are actually additive and that it is generally powerful in detecting nonadditivity when it exists. Finally, the article demonstrates an application of Tukey's test in detecting nonadditivity in a judgmental study of educational standards and shows how Tukey's test results can be used to correct imprecision in the estimated VC in the presence of nonadditivity.  相似文献   

11.
Equating test forms is an essential activity in standardized testing, with increased importance with the accountability systems in existence through the mandate of Adequate Yearly Progress. It is through equating that scores from different test forms become comparable, which allows for the tracking of changes in the performance of students from one year to the next. This study compares three different item response theory scaling methods (fixed common item parameter, Stocking & Lord, and Concurrent Calibration) with respect to examinee classification into performance categories, and estimation of the ability parameter, when the content of the test form changes slightly from year to year, and the examinee ability distribution changes. The results indicate that calibration methods, especially concurrent calibration, produced more stable results than the transformation method.  相似文献   

12.
For the purpose of obtaining data to use in test development, multiple matrix sampling (MMS) plans were compared to examinee sampling plans. Data were simulated for examinees, sampled from a population with a normal distribution of ability, responding to items selected from an item universe. Three item universes were considered: one that would produce a normal distribution of test scores, one a moderately platykurtic distribution, and one a very platykurtic distribution. When comparing sampling plans, total numbers of observations were held constant. No differences were found among plans in estimating item difficulty. Examinee sampling produced better estimates of item discrimination, test reliability, and test validity. As total number of observations increased, estimates improved considerably, especially for those MMS plans with larger subtest sizes. Larger numbers of observations were needed for tests designed to produce a normal distribution of test scores. With an adequate number of observations, MMS is seen as an alternative to examinee sampling in test development.  相似文献   

13.
In this article, performance assessments are cast within a sampling framework. More specifically, a performance assessment is viewed as a sample of student performance drawn from a complex universe defined by a combination of all possible tasks, occasions, raters, and measurement methods. Using generalizability theory, we present evidence bearing on the generalizability and convergent validity of performance assessments sampled from a range of measurement facets and measurement methods. Results at both the individual and school level indicate that task-sampling variability is the major source ofmeasurment error. Large numbers of tasks are needed to get a reliable measure of mathematics and science achievement at the elementary level. With respect to convergent validity, results suggest that methods do not converge. Students' performance scores, then, are dependent on both the task and method sampled.  相似文献   

14.
An improved method is derived for estimating conditional measurement error variances, that is, error variances specific to individual examinees or specific to each point on the raw score scale of the test. The method involves partitioning the test into short parallel parts, computing for each examinee the unbiased estimate of the variance of part-test scores, and multiplying this variance by a constant dictated by classical test theory. Empirical data are used to corroborate the principal theoretical deductions.  相似文献   

15.
Responses to a 50-item, four-choice test were simulated for 1,000 examinees under conventional formula-scoring instructions. One hundred ninety-two simulation runs reflected variations in the average level o f item difficulty, the extent to which examinees tended to omit inappropriately (when the formulascoring directions recommended guessing), the extent to which they were misinformed (classified correct answers as distractors), the extent to which they guessed contrary to the formula-scoring instructions, the extent to which examinee ability and tendency to omit inappropriately were correlated, the examinee ability level at which misinformation was most prevalent, and the extent to which item difficulty was related to the probability that an examinee would be misinformed. For each examinee, formula scores and expected formula scores were determined allowing and not allowing inappropriate omissions. Under certain conditions, failure to guess as recommended by the formula-scoring instructions produced nontrivial proportions o f examinees with expected score losses o f one or more points. These conditions were a test o f at least moderate difficulty, a low level for the tendency to be misinformed, and at least a moderate level for the tendency to omit inappropriately.  相似文献   

16.
Practical considerations in conducting an equating study often require a trade-off between testing time and sample size. A counterbalanced design (Angoff's Design II) is often selected because, as each examinee is administered both test forms and therefore the errors are correlated, sample sizes can be dramatically reduced over those required by a spiraling design (Angoff's Design I), where each examinee is administered only one test form. However, the counterbalanced design may be subject to fatigue, practice, or context effects. This article investigated these two data collection designs (for a given sample size) with equipercentile and IRT equating methodology in the vertical equating of two mathematics achievement tests. Both designs and both methodologies were judged to adequately meet an equivalent expected score criterion; Design II was found to exhibit more stability over different samples.  相似文献   

17.
The major aim of the present study is to assess college students’ attitudes, perceptions, emotional reactions and affective dispositions with respect to various critical dimensions of course achievement testing and assessment, including: “papers” vs. “exams”, “essay” vs. “multiple choice” type formats, “open book” vs. “closed book” exams, “free choice” among items vs. “no free choice” among items, and “oral” vs. “written” modes of test administration. A further aim is to delineate the construction, properties, and potential classroom uses and applications of a selected sample of examinee feedback inventories designed to gauge students’ test attitudes and dispositions. The use of each examinee feedback inventory is demonstrated and exemplified in the context of an empirical study. This paper discusses the assumptions underlying the use of feedback systems in college achievement evaluation; their importance for assessing the face validity of classroom tests; some possible future applications of feedback inventories for research and applied purposes in college; and some guidelines for future research. A mapping sentence specifying the universe of content of test attitude and examinee feedback research is suggested as a heuristic device for guiding future research.  相似文献   

18.
This study illustrates how generalizability theory can be used to evaluate the dependability of school-level scores in situations where test forms have been matrix sampled within schools, and to estimate the minimum number of forms required to achieve acceptable levels of score reliability. Data from a statewide performance assessment in reading, writing, and language usage were analyzed in a series of generalizability studies using a person: (school x form) design that provided variance component estimates for four sources: school, form, school x form, and person: (school x form). Six separate scores were examined. The results of the generalizability studies were then used in decision studies to determine the impact on score reliability when the number of forms administered within schools was varied. Results from the decision studies indicated that score generalizability could be improved when the number of forms administered within schools was increased from one to three forms, but that gains in generalizability were small when the number of forms was increased beyond three. The implications of these results for planning large-scale performance assessments are discussed.  相似文献   

19.
The Standards for Educational and Psychological Testing indicate that multiple sources of validity evidence should be used to support the interpretation of test scores. In the past decade, examinee response processes, as a source of validity evidence, have received increased attention. However, there have been relatively few methodological studies of the accuracy and consistency of examinee response processes as measured by verbal reports in the context of educational measurement. The objective of the current study was to investigate the accuracy and consistency of examinee response processes—as measured by verbal reports—as a function of varying interviewer and item variables in a think aloud interview within an educational measurement context. Results indicate that the accuracy of responses may be undermined when students perceive the interviewer to be an expert in the domain. Further, the consistency of response processes may be undermined when items that are too easy or difficult are used to elicit reports. The implications of these results for conducting think-aloud studies are explored.  相似文献   

20.
Reliability coefficients of linear combinations of observed scores have anomalous properties which have led to persistent difficulties in the investigation of difference scores and gain scores in test theory. Interpretation of these test scores is further complicated by effects of correlated errors of measurement which are likely to appear in difference scores and gain scores in practice. In this paper the discrepancies between classical results and correct results obtained from more general formulas, which allow for correlated errors, are examined systematically. These discrepancies depend strongly on the reliability coefficients of the respective tests and are smallest when the influence of the variables related by the formulas is least. A vector representation of difference scores reveals that these anomalies arise from simple geometric relations among observed scores, true scores, and error scores inherent in the test-theory model. In this context, doubts as to the usefulness of difference scores and gain scores in testing practice expressed by previous authors appear to be justified.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号