首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Investigated empirically through post mortem item-examinee sampling were the relative merits of two alternative procedures for allocating items to subtests in multiple matrix sampling and the feasibility of using the jackknife in approximating standard errors of estimate. The results indicate clearly that a partially balanced incomplete block design is preferable to random sampling in allocating items to subtests. The jackknife was found to better approximate standard errors of estimate in the latter item allocation procedure than in the former.  相似文献   

2.
Exact nonparametric procedures have been used to identify the level of differential item functioning (DIF) in binary items. This study explored the use of exact DIF procedures with items scored on a Likert scale. The results from an attitude survey suggest that the large-sample Cochran-Mantel-Haenszel (CMH) procedure identifies more items as statistically significant than two comparable exact nonparametric methods. This finding is consistent with previous findings; however, when items are classified in National Assessment of Educational Progress DIF categories, the results show that the CMH and its exact nonparametric counterparts produce almost identical classifications. Since DIF is often evaluated in terms of statistical and practical significance, this study provides evidence that the large-sample CMH procedure may be safely used even when the focal group has as few as 76 cases.  相似文献   

3.
Monte Carlo studies offer the opportunity to manipulate data sets with specified characteristics and to examine the generalizability of statistical procedures in ways that are not practical using actual (empirical) data. To determine whether computer-generated (simulated) data results accurately represent empirical data, this study replicated an investigation of the effects of item sampling plans in the application of multiple matrix sampling using both simulated and empirical data sets. Although results were similar, the empirical data results were more precise. This study suggests that, for some investigations, it may be important to confirm simulated study with empirical study.  相似文献   

4.
Abstract

To combat problems of cheating arising from testing under crowed classroom conditions, instructors frequently use multiple arrangements of a set of test items. These different arrangements or forms should be nearly equivalent relative to mean total scores. This study reports data from comparisons involving eleven pairs of equivalent tests. There were no significant linear relationships between equivalent test forms on the ordering of item difficulties. Reliabilities differed little within pairs of equivalent tests. Nine of eleven t-tests comparing mean total test scores were insignificant. The bulk of these data supported the assumption that one may construct equivalent power tests by rearranging items, when the ordering of item difficulty is non-systematic on both arrangements.  相似文献   

5.
Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

6.
Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to random sampling of items and/or responses in the validation sets. Any statistical hypothesis test of the differences in rankings needs to be appropriate for use with rater statistics and adjust for multiple comparisons. This study considered different statistical methods to evaluate differences in performance across multiple raters and items. These methods are illustrated leveraging data from the 2012 Automated Scoring Assessment Prize competitions. Using average rankings to test for significant differences in performance between automated and human raters, findings show that most automated raters did not perform statistically significantly different from human-to-human inter-rater agreement for essays but they did perform differently on short-answer items. Differences in average rankings between most automated raters were not statistically significant, even when their observed performance differed substantially.  相似文献   

7.
A computer simulation study was conducted to determine the feasibility of using logistic regression procedures to detect differential item functioning (DIF) in polytomous items. One item in a simulated test of 25 items contained DIF; parameters' for that item were varied to create three conditions of nonuniform DIF and one of uniform DIF. Item scores were generated using a generalized partial credit model, and the data were recoded into multiple dichotomies in order to use logistic regression procedures. Results indicate that logistic regression is powerful in detecting most forms of DIF; however, it required large amounts of data manipulation, and interpretation of the results was sometimes difficult. Some logistic regression procedures may be useful in the post hoc analysis of DlF for polytomous items.  相似文献   

8.
For the purpose of obtaining data to use in test development, multiple matrix sampling (MMS) plans were compared to examinee sampling plans. Data were simulated for examinees, sampled from a population with a normal distribution of ability, responding to items selected from an item universe. Three item universes were considered: one that would produce a normal distribution of test scores, one a moderately platykurtic distribution, and one a very platykurtic distribution. When comparing sampling plans, total numbers of observations were held constant. No differences were found among plans in estimating item difficulty. Examinee sampling produced better estimates of item discrimination, test reliability, and test validity. As total number of observations increased, estimates improved considerably, especially for those MMS plans with larger subtest sizes. Larger numbers of observations were needed for tests designed to produce a normal distribution of test scores. With an adequate number of observations, MMS is seen as an alternative to examinee sampling in test development.  相似文献   

9.
Parent or guardian perceptions play a specialized role in the evaluation of school teachers. Parents are important stakeholders in teacher success, they are in some instances partners in the teachers' work, parents have unique personal information about student learning, and they can report on the teacher duties to inform parents about the classroom and child progress. This study analyzed the responses of parents to 12 survey items concerning teacher performance in 201 classrooms. The surveys were used as part of an innovative teacher evaluation program in which teachers elected to include parent feedback as one objective data source for annual review. In this study three factors emerged as important concerns for parents: humane treatment of students, support for pupil learning, and effective communication and collaboration with parents. Recommendations for use of specific survey items can be based on the empirical results of this sampling. The data gathered by parent surveys define one dimension of quality which may vary in importance from one teacher to another.  相似文献   

10.
This study was designed to research the question of scrambling item content in the construction of achievement tests, so that very general implications could be drawn for both examinee and item populations. To achieve this generality, the methodology of multiple matrix sampling was combined with a simple two group experimental design: a random group of 8th graders responded to mathematics, science, social studies, reading, and language arts achievement items organized in a scrambled (random) test format, while another random group responded to the same items organized in a fixed (segregated by subject matter) test format. The results indicated that scrambling cognitive test items has minimal or no effect on mean examinee test performance or on any of the other parameters included in the analysis.  相似文献   

11.
The sampling procedures were designed so that the full matrix of item variances and covariances could be estimated. Three subtest sizes were investigated- subtests of size five, nine and sixteen items. In each of these implementations a double cross validation was used yielding two predicted scores for each individual. Discrepancy measures were also computed showing the difference between the observed and the predicted scores. The prediction of individual scores was accomplished within various ranges of error. The correlations between predicted scores and observed scores ranged from the .70′s to the .90′s, depending on the number of predictor variables used. The procedure is applicable in situations in which large numbers of individuals are tested or in situations where multiple measures are taken.  相似文献   

12.
Conventional multilevel modeling works well with purely hierarchical data; however, pure hierarchies rarely exist in real datasets. Applied researchers employ ad hoc procedures to create purely hierarchical data. For example, applied educational researchers either delete mobile participants' data from the analysis or identify the student only with the last school attended while including an explanatory variable indicating whether a student is mobile. This simulation study compared the parameter and standard error estimates of these two ad hoc procedures for handling and assessing the influence of mobility on outcomes with results based on use of the multiple membership random effects model. Substantial bias was found for some parameters when multiple membership data structures were ignored.  相似文献   

13.
Although the Angoff procedure is among the most widely used standard setting procedures for tests comprising multiple‐choice items, research has shown that subject matter experts have considerable difficulty accurately making the required judgments in the absence of examinee performance data. Some authors have viewed the need to provide performance data as a fatal flaw for the procedure; others have considered it appropriate for experts to integrate performance data into their judgments but have been concerned that experts may rely too heavily on the data. There have, however, been relatively few studies examining how experts use the data. This article reports on two studies that examine how experts modify their judgments after reviewing data. In both studies, data for some items were accurate and data for other items had been manipulated. Judges in both studies substantially modified their judgments whether the data were accurate or not.  相似文献   

14.
This study used a Monte Carlo approach to investigate the effect of item sampling by item stratification on parameter estimation when applying multiple matrix sampling to achievement data. From the results of this study it was concluded that the item sampling method and sampling plan which is a practical compromise in terms of precision and sample size is one based on item stratification by item discrimination and a sampling plan with a moderate number of subtests. This sampling condition provides reasonable precision of the mean and variance estimates but requires only a moderately sized sample.  相似文献   

15.
A norm distribution consisting of test scores received by 810 college students on a 150 item dichotomously-scored 4-alternative multiple-choice test was empirically estimated through several item-examinee sampling procedures. The post mortum item-sampling investigation was specifically designed to manipulate systematically the variables of number of subtests, number of items per subtest, and number of examinees responding to each subtest. Defining one observation as the score received by one examinee on one item, the results suggest that as the number of observations increases beyond 1.23% of the data base all procedures produce stochastically equivalent results. The results of this investigation indicate that, in estimating a norm distribution by item-sampling, the variable of importance is not the item-sampling procedure per se but is instead the number of observations obtained by the procedure. It should be noted, however, that in this investigation the test score norm distribution was approximately symmetrical and the possibility should not be overlooked that item-sampling as a procedure may be robust only for symmetrical norm distributions.  相似文献   

16.
An empirical comparison of the accuracy of item sampling and examinee sampling in estimating norm statistics. Item samples were composed of 3, 6, or 12 items selected from a total test of 50 multiple-choice vocabulary questions. Overall, the study findings provided empirical evidence that item sampling is approximately as effective as examinee sampling for estimating the population mean and standard deviation. Contradictory trends occurred for lower ability and higher ability student populations in accuracy of estimated means and standard deviations when the number of items administered increased from 3 to 6 to 12. The findings from this study indicate that the variation of sequences of items occurring in item sampling need not have a significant affect on test performance.  相似文献   

17.
When judgmental and statistical procedures are both used to identify potentially gender-biased items in a test, to what extent do the results agree? In this study, both procedures were used to evaluate the items in a statewide, 78-item, multiple-choice test of science knowledge. Only one item was flagged by the sensitivity reviewers as being potentially biased, but this item was not flagged by the statistical procedure. None of the nine items flagged by the Mantel-Haenszel procedure were flagged by the sensitivity reviewers. Eight of the nine statistically flagged items were differentially easier for males. Four of these eight measured the same category of objectives. The authors conclude that both judgmental and statistical procedures provide useful information and that both should be used in test construction. They caution readers that content-validity issues need to be addressed when making decisions based on the results of either procedure.  相似文献   

18.
Practical use of the matrix sampling (i.e. item sampling) technique requires the assumption that an examinee's response to an item is independent of the context in which the item occurs. This assumption was tested experimentally by comparing the responses of examinees to a population of items with the responses of examinees to item samples. Matrix sampling mean and variance estimates for verbal, quantitative, and attitude tests were used as dependent variables to test for differences between the “context” and “out-of-context” groups. The estimates obtained from both treatment groups were also compared with actual population values. No significant differences were found between treatments on matrix sample parameter estimates for any of the three types of tests.  相似文献   

19.
Various applications of item response theory often require linking to achieve a common scale for item parameter estimates obtained from different groups. This article used a simulation to examine the relative performance of four different item response theory (IRT) linking procedures in a random groups equating design: concurrent calibration with multiple groups, separate calibration with the Stocking-Lord method, separate calibration with the Haebara method, and proficiency transformation. The simulation conditions used in this article included three sampling designs, two levels of sample size, and two levels of the number of items. In general, the separate calibration procedures performed better than the concurrent calibration and proficiency transformation procedures, even though some inconsistent results were observed across different simulation conditions. Some advantages and disadvantages of the linking procedures are discussed.  相似文献   

20.
Data collected in the 1976-1977 NAEP survey of seventeen-year-olds was used to reanalyze the hypothesis that there are affective determinates of science achievement. Factor and item analysis procedures were used to examine affective and cognitive items from Booklet 4. Eight affective scales and one cognitive achievement scale were identified. Using stepwise multiple regression procedures, the four affective scales of Motivation, Anxiety, Student Choice, and Teacher Support were found to account for the majority of the correlation between the affective determinants and achievement.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号