首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

2.
An essential question when computing test–retest and alternate forms reliability coefficients is how many days there should be between tests. This article uses data from reading and math computerized adaptive tests to explore how the number of days between tests impacts alternate forms reliability coefficients. Results suggest that the highest alternate forms reliability coefficients were obtained when the second test was administered at least 2 to 3 weeks after the first test. Even though reliability coefficients after this amount of time were often similar, results suggested a potential tradeoff in waiting longer to retest as student ability tended to grow with time. These findings indicate that if keeping student ability similar is a concern that the best time to retest is shortly after 3 weeks have passed since the first test. Additional analyses suggested that alternate forms reliability coefficients were lower when tests were shorter and that narrowing the first test ability distribution of examinees also impacted estimates. Results did not appear to be largely impacted by differences in first test average ability, student demographics, or whether the student took the test under standard or extended time. It is suggested that for math and reading tests, like the ones analyzed in this article, the optimal retest interval would be shortly after 3 weeks have passed since the first test.  相似文献   

3.
C‐tests are a specific variant of cloze tests that are considered time‐efficient, valid indicators of general language proficiency. They are commonly analyzed with models of item response theory assuming local item independence. In this article we estimated local interdependencies for 12 C‐tests and compared the changes in item difficulties, reliability estimates, and person parameter estimates for different modeling approaches: (a) Rasch, (b) testlet, (c) partial credit, and (d) copula models. The results are complemented with findings of a simulation study in which sample size, number of testlets, and strength of residual correlations between items were systematically manipulated. Results are discussed with regard to the pivotal question whether residual dependencies between items are an artifact or part of the construct.  相似文献   

4.
The latent state–trait (LST) theory is an extension of the classical test theory that allows one to decompose a test score into a true trait, a true state residual, and an error component. For practical applications, the variances of these latent variables may be estimated with standard methods of structural equation modeling (SEM). These estimates allow one to decompose the coefficient of reliability into a coefficient of consistency (indicating true effects of the person) plus a coefficient of occasion specificity (indicating true effects of the situation and the person–situation interaction). One disadvantage of this approach is that the standard SEM analysis requires large sample sizes. This article aims to overcome this disadvantage by presenting a simple method that allows one to estimate the LST parameters algebraically from the observed covariance matrix. A Monte Carlo simulation suggests that the proposed method may be superior to the standard SEM analysis in small samples.  相似文献   

5.
The purpose of this study was to investigate the effects of items, passages, contents, themes, and types of passages on the reliability and standard errors of measurement for complex reading comprehension tests. Seven different generalizability theory models were used in the analyses. Results indicated that generalizability coefficients estimated using multivariate models incorporating content strata and types of passages were similar in size to reliability estimates based upon a model that did not include these factors. In contrast, incorporating passages and themes within univariate generalizability theory models produced non-negligible differences in the reliability estimates. This suggested that passages and themes be taken into account when evaluating the reliability of test scores for complex reading comprehension tests.  相似文献   

6.
Responses to a 40-item test were simulated for 150 examinees under free-response and multiple-choice formats. The simulation was replicated three times for each of 30 variations reflecting format and the extent to which examinees were (a) misinformed, (b) successful in guessing free-response answers, and (c) able to recognize with assurance correct multiple-choice options that they could not produce under free-response testing. Internal consistency reliability (KR20) estimates were consistently higher for the free-response score sets, even when the free-response item difficulty indices were augmented to yield mean scores comparable to those from multiple-choice testing. In addition, all test score sets were correlated with four randomly generated sets of unit-normal measures, whose intercorrelations ranged from moderate to strong. These measures served as criteria because one of them had been used as the basic ability measure in the simulation of the test score sets. Again, the free-response score sets yielded superior results even when tests of equal difficulty were compared. The guessing and recognition factors had little or no effect on reliability estimates or correlations with the criteria. The extent of misinformation affected only multiple-choice score KR20's (more misinformation—higher KR20's). Although free-response tests were found to be generally superior, the extent of their advantage over multiple-choice was judged sufficiently small that other considerations might justifiably dictate format choice.  相似文献   

7.
This article will have as its focus a problem described by Anderson, namely, the quality of the construction and reporting of research that contains “homemade” achievement tests. The current status of homemade achievement tests was examined in this study. Research reports in two science education journals were analyzed, using Anderson's eight categories of information that a high quality research report should include. The journals examined were Journal of Research in Science Teaching and Science Education from January 1975 to January 1980. The findings indicate that reliability estimates and the procedures for selecting test items were mentioned more frequently in the science education journals than in Anderson's study. However, there was little or no improvement in describing the relation of test items to instruction. These findings should be of interest both to researchers who are utilizing “homemade” achievement tests in their studies and to journal panels who are reviewing studies for publication.  相似文献   

8.
对于全国性测试,经常性的评估是必不可少的。语言测试评估、有效性研究的关键是信度或一致性研究。本研究使用TEM4平行试卷,分别进行信度统计、差异分析。它不仅检验了平行测试之间的一致性问题,还在有差异的情况下,对有差异的测试或题项进行定位。这种定位对以后的测试编制、预测及拼卷将起到积极的作用。  相似文献   

9.
This study illustrates how generalizability theory can be used to evaluate the dependability of school-level scores in situations where test forms have been matrix sampled within schools, and to estimate the minimum number of forms required to achieve acceptable levels of score reliability. Data from a statewide performance assessment in reading, writing, and language usage were analyzed in a series of generalizability studies using a person: (school x form) design that provided variance component estimates for four sources: school, form, school x form, and person: (school x form). Six separate scores were examined. The results of the generalizability studies were then used in decision studies to determine the impact on score reliability when the number of forms administered within schools was varied. Results from the decision studies indicated that score generalizability could be improved when the number of forms administered within schools was increased from one to three forms, but that gains in generalizability were small when the number of forms was increased beyond three. The implications of these results for planning large-scale performance assessments are discussed.  相似文献   

10.
This Monte Carlo simulation study investigated methods of forming product indicators for the unconstrained approach for latent variable interaction estimation when the exogenous factors are measured by large and unequal numbers of indicators. Product indicators were created based on multiplying parcels of the larger scale by indicators of the smaller scale, multiplying the three most reliable indicators of each scale matched by reliability, and matching items by reliability to create as many product indicators as the number of indicators of the smallest scale. The unconstrained approach was compared with the latent moderated structural equations (LMS) approach. All methods considered provided unbiased parameter estimates. Unbiased standard errors were obtained in all conditions with the LMS approach and when the sample size was large with the unconstrained approach. Power levels to test the latent interaction and Type I error rates were similar for all methods but slightly better for the LMS approach.  相似文献   

11.
This article examines the issue of the quality of teacher-produced tests, limiting itself in the current context to objective, multiple-choice tests. The article investigates a short, two-part 20-item English language test. After a brief overview of the key test qualities of reliability and validity, the article examines the two subtests in terms of test and item quality, using standard classical test statistics. Unsurprisingly, the pretested items outperform the teacher-produced test. The differences between the two subtests underscore issues about the quality (or lack thereof) of teacher-produced tests. The article ends with suggestions of how teacher-produced tests might be improved.  相似文献   

12.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

13.
Unidimensionality and local independence are two common assumptions of item response theory. The former implies that all items measure a common latent trait, while the latter implies that responses are independent, conditional on respondents’ location on the latent trait. Yet, few tests are truly unidimensional. Unmodeled dimensions may result in test items displaying dependencies, which can lead to misestimated parameters and inflated reliability estimates. In this article, we investigate the dimensionality of interim mathematics tests and evaluate the extent to which modeling minor dimensions in the data change model parameter estimates. We found evidence of minor dimensions, but parameter estimates across models were similar. Our results indicate that minor dimensions outside the primary trait have negligible consequences on parameter estimates. This finding was observed despite the ratio of multidimensional to unidimensional items being above previously recommended thresholds.  相似文献   

14.
Decisions on admissions to university and placement into university courses are usually based on the results of achievement (as in secondary school exams) and/or aptitude (in intelligence-type tests and SAT). This paper argues that in a situation where educational provision at secondary school level is highly unequal, a third approach to testing offers an alternative which is preferable both on grounds of theory of cognitive psychology and because it yields much better discrimination.The Alternative Admissions Research Project at University of Cape Town has developed a mathematics test according to the dynamic testing approach as advocated by Miller (1990) for admission of African students from grossly under-resourced schools, as well as for placing these and other students into a diversifying first year curriculum. This approach aims to assess the ability of a candidate to learn from authentic academic material within the test. This paper focuses on the reasons for the development of the mathematics test and the process by which the test questions were developed and piloted. The reliability of the test and correlations of this test with subsequent mathematical performance data are discussed.Following the encouraging data for the test as an admission mechanism, the value of the dynamic testing approach for furnishing additional information for placement into an increasingly varied curriculum at first year level was investigated. This enabled the piloting of more topics and more comprehensive validation of this type of testing. The paper concerns itself with the reliability and predictive value of each of the topics in this placement test for a range of core courses in various faculties and the extent to which these tests can identify potentially at risk students who should be placed onto an appropriate curriculum.  相似文献   

15.
Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory.  相似文献   

16.
Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test.  相似文献   

17.
A reliability coefficient for criterion-referenced tests is developed from the assumptions of classical test theory. This coefficient is based on deviations of scores from the criterion score, rather than from the mean. The coefficient is shown to have several of the important properties of the conventional normreferenced reliability coefficient, including its interpretation as a ratio of variances and as a correlation between parallel forms, its relationship to test length, its estimation from a single form of a test, and its use in correcting for attenuation due to measurement error. Norm-referenced measurement is considered as a special case of criterion-referenced measurement.  相似文献   

18.
Various item selection techniques are compared on resultant criterionreferenced reliability and validity. Techniques compared include three nominal criterion-referenced methods, a traditional point biserial selection, teacher selection, and random selection. Eighteen volunteer junior and senior high school teachers supplied behavioral objectives and item pools ranging from 26 to 40 items. Each teacher obtained reponses from four classes. Pairs of tests of various length were developed by each item selection method. Estimates of test reliability and validity were obtained using responses independent of the test construction sample. Resultant reliability and validity estimates were compared across item selection techniques. Two of the criterion-referenced item selection methods resulted in consistently higher observed validity. However, the small magnitude of improvement over teacher or random selection raises a question as to whether the benefit warrants the necessary extra effort on the part of the classroom teacher.  相似文献   

19.
Assessment practitioners are often encouraged to adopt an “intelligent” approach to the interpretation of intelligence tests. A fundamental assumption of the “intelligent testing” philosophy is that psychometric test information (e.g., subtest g loadings) should be considered during the interpretive process. The relevant psychometric information is provided in the form of sample-based estimates. Unfortunately, the accuracy of these estimates, and the subsequent qualitative classification of intelligence subtests (e.g., good, fair, poor), are influenced to an unknown degree by sampling error. The current study demonstrated how data smoothing procedures, procedures commonly used in the development of continuous test norms, can be used to provide better estimates of the reliability, uniqueness, and general factor characteristics for the WISC-III subtests. © 1997 John Wiley & Sons, Inc.  相似文献   

20.
Reliability of a criterion-referenced test is often viewed as the consistency with which individuals who have taken two strictly parallel forms of a test are classified as being masters or nonmasters. However, in practice, it is rarely possible to retest students, especially with equivalent forms. For this reason, methods for making conservative approximations of alternate form (or test-retest “without the effects of testing”) reliability have been developed. Because these methods are computationally tedious and require some psychometric sophistication, they have rarely been used by teachers and school psychologists. This paper (a) describes one method (Subkoviak's) for estimating alternate-form reliability from one administration of a criterion-referenced test and (b) describes a computer program developed by the authors that will handle tests containing hundreds of items for large numbers of examinees and allow any test user to apply the technique described. The program is a superior alternative to other methods of simplifying this estimation procedure that rely upon tables; a user can check classification consistency estimates for several prospective cut scores directly from a data file, without having to make prior calculations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号