首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 21 毫秒
1.
The authors performed a Monte Carlo simulation to empirically investigate the robustness and power of 4 methods in testing mean differences for 2 independent groups under conditions in which 2 populations may not demonstrate the same pattern of nonnormality. The approaches considered were the t test, Wilcoxon rank-sum test, Welch-James test with trimmed means and Winsorized variances, and a nonparametric bootstrap test. Results showed that the Wilcoxon rank-sum test and Welch-James test with trimmed means and Winsorized variances were not robust in terms of type I error control when the 2 populations showed different patterns of nonnormality. The nonparametric bootstrap test provided power advantages over the t test. The authors discuss other results from the simulation study and provide recommendations.  相似文献   

2.
Uses and consequences of educational testing have increased dramatically in recent years. Professional standards to ensure fair treatment of all affected by test results are more important than ever, but standards for developing and using educational tests are only helpful if they are followed. Test developers and users each have a role to play in meeting testing standards, but these roles have become less distinct as developers work more collaboratively with users, particularly states, to develop tests customized to user specifications. This paper explores various mechanisms for increasing compliance with testing standards through collaborative efforts, including (1) specifying adherence to test standards in contracts with test developers; (2) use of Technical Advisory Committee (TACS) to guide test development processes; (3) requiring compliance with test standards as part of the peer review process for state accountability programs; and (4) oversight by independent organizations .  相似文献   

3.
The interpretability of score comparisons depends on the design and execution of a sound data collection plan and the establishment of linkings between these scores. When comparisons are made between scores from two or more assessments that are built to different specifications and are administered to different populations under different conditions, the validity of the comparisons hinges on untestable assumptions. For example, tests administered across different disability groups or tests administered to different language groups produce scores for which implicit linkings are presumed to hold. Presumed linking makes use of extreme assumptions to produce links between scores on tests in the absence of common test material or equivalent groups of test takers. These presumed linkings lead to dubious interpretations. This article suggests an approach that indirectly assesses the validity of these presumed linkings among scores on assessments that contain neither equivalent groups nor common anchor material.  相似文献   

4.
Conventional null hypothesis testing (NHT) is a very important tool if the ultimate goal is to find a difference or to reject a model. However, the purpose of structural equation modeling (SEM) is to identify a model and use it to account for the relationship among substantive variables. With the setup of NHT, a nonsignificant test statistic does not necessarily imply that the model is correctly specified or the size of misspecification is properly controlled. To overcome this problem, this article proposes to replace NHT by equivalence testing, the goal of which is to endorse a model under a null hypothesis rather than to reject it. Differences and similarities between equivalence testing and NHT are discussed, and new “T-size” terminology is introduced to convey the goodness of the current model under equivalence testing. Adjusted cutoff values of root mean square error of approximation (RMSEA) and comparative fit index (CFI) corresponding to those conventionally used in the literature are obtained to facilitate the understanding of T-size RMSEA and CFI. The single most notable property of equivalence testing is that it allows a researcher to confidently claim that the size of misspecification in the current model is below the T-size RMSEA or CFI, which gives SEM a desirable property to be a scientific methodology. R code for conducting equivalence testing is provided in an appendix.  相似文献   

5.
Simultaneous protocols typically yield poorer stimulus equivalence outcomes than do other protocols commonly used in equivalence research. Two independent groups of three 3-member equivalence sets of stimuli were used in conditional discrimination procedures in two conditions, one using the standard simultaneous protocol and the other using a hybrid simultaneous training and simple-to-complex testing. Participants completed the two conditions in one long session in Experiment 1, but in separate sessions in Experiment 2. The same stimulus sets used in Experiment 1 were randomized for the two conditions in Experiment 2. Overall, accuracy was better with the hybrid than with the standard protocol in both experiments. The equivalence yield was also better under the hybrid than under the standard protocol in each experiment. The results suggest that the order of testing for emergent relations may account for the difficulty often encountered with the standard simultaneous protocol.  相似文献   

6.
As an alternative to adaptation, tests may also be developed simultaneously in multiple languages. Although the items on such tests could vary substantially, scores from these tests may be used to make the same types of decisions about different groups of examinees. The ability to make such decisions is contingent upon setting performance standards for each exam that allow for comparable interpretations of test results. This article describes a standard setting process used for a multilingual high school literacy assessment constructed under these conditions. This methodology was designed to address the specific challenges presented by this testing program including maintaining equivalent expectations for performance across different student populations. The validity evidence collected to support the methodology and results is discussed along with recommendations for future practice.  相似文献   

7.
Recent advances in testing mediation have found that certain resampling methods and tests based on the mathematical distribution of 2 normal random variables substantially outperform the traditional z test. However, these studies have primarily focused only on models with a single mediator and 2 component paths. To address this limitation, a simulation was conducted to evaluate these alternative methods in a more complex path model with multiple mediators and indirect paths with 2 and 3 paths. Methods for testing contrasts of 2 effects were evaluated also. The simulation included 1 exogenous independent variable, 3 mediators and 2 outcomes and varied sample size, number of paths in the mediated effects, test used to evaluate effects, effect sizes for each path, and the value of the contrast. Confidence intervals were used to evaluate the power and Type I error rate of each method, and were examined for coverage and bias. The bias-corrected bootstrap had the least biased confidence intervals, greatest power to detect nonzero effects and contrasts, and the most accurate overall Type I error. All tests had less power to detect 3-path effects and more inaccurate Type I error compared to 2-path effects. Confidence intervals were biased for mediated effects, as found in previous studies. Results for contrasts did not vary greatly by test, although resampling approaches had somewhat greater power and might be preferable because of ease of use and flexibility.  相似文献   

8.
Book reviews     
Background:?A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests.

Purpose:?This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories.

Main argument and conclusion:?Classification accuracy is defined in terms of a ‘true score’ specified in a psychometric model. Three things must be known or hypothesised in order to derive a value for classification accuracy: (1) a psychometric model relating observed scores to true scores; (2) the location of the cut-scores on the score scale; and (3) the distribution of true scores in the group of test-takers.  相似文献   

9.
We make a distinction between two types of test changes: inevitable deviations from specifications versus planned modifications of specifications. We describe how score equity assessment (SEA) can be used as a tool to assess a critical aspect of construct continuity, the equivalence of scores, whenever planned changes are introduced to testing programs. We also report on how SEA can be used as a quality control check to evaluate whether tests developed to a static set of specifications remain within acceptable tolerance levels with respect to equatability.  相似文献   

10.
This paper presents the results of a simulation study to compare the performance of the Mann-Whitney U test, Student?s t test, and the alternate (separate variance) t test for two mutually independent random samples from normal distributions, with both one-tailed and two-tailed alternatives. The estimated probability of a Type I error was controlled (in the sense of being reasonably close to the attainable level) by all three tests when the variances were equal, regardless of the sample sizes. However, it was controlled only by the alternate t test for unequal variances with unequal sample sizes. With equal sample sizes, the probability was controlled by all three tests regardless of the variances. When it was controlled, we also compared the power of these tests and found very little difference. This means that very little power will be lost if the Mann-Whitney U test is used instead of tests that require the assumption of normal distributions.  相似文献   

11.
Educational tests are standardized so that all examinees are tested on the same material, under the same testing conditions, and with the same scoring protocols. This uniformity is designed to provide a level “playing field” for all examinees so that the test is “the same” for everyone. Thus, standardization is designed to promote fairness in testing. In practice, the material tested, the conditions under which a test is administered, and the scoring processes, are often too rigid to provide the intended level playing field. For example, standardized testing conditions may interact with personal characteristics of examinees that affect test performance, but are not construct-relevant. Thus, more flexibility in standardization is needed to account for the diversity of experiences, talents, and handicaps of the incredibly heterogeneous populations of examinees we currently assess. Traditional standardization procedures grew out of experimental psychology and psychophysics laboratories where keeping all conditions constant was crucial. Today, accounting for and measuring what is not constant across examinees is crucial to valid construct interpretations. To meet this need I introduce the concept of understandardization, which refers to ensuring sufficient flexibility in standardized testing conditions to yield the most accurate measurement of proficiency for each examinee.  相似文献   

12.
This paper introduces an empirical study testing three kinds of bias in higher education student assessment. All of them are connected to the repetitive use of the same test questions which may facilitate academic cheating. The ‘same tests effect’ may appear if two or more groups of students are writing the same test one after the other and, as a result, a statistically significant improvement is detectable in the test scores of the second student group. The ‘revealed sameness effect’ is the impact of informing the students in some way that the test questions will be repeated. The ‘self selection effect’ arises when the students choose their examination turn themselves and this boosts their measured performance. The present study examines the three effects with independent t-tests and linear regression models on samples of 1221, 235, and 201 students (in this order), from four business courses in six academic semesters. The results do not support the ‘same test effect’, but support the ‘revealed sameness effect’ and the ‘self selection effect’.  相似文献   

13.
Researchers are often interested in establishing equivalence of population variances. Traditional difference-based procedures are appropriate to answer questions about differences in some statistic (e.g., variances, etc.). However, if a researcher is interested in evaluating the equivalence of population variances, it is more appropriate to use a procedure designed to determine equivalence. A simulation study was used to compare novel equivalence-based tests to traditional variance homogeneity tests under common data conditions. Results demonstrated that traditional difference-based tests assess equality of variances from the wrong perspective and that the proposed Levene-Wellek-Welch test for equivalence of group variances the best performing test for detecting equivalence. An R function is provided in order to facilitate use of this test for equivalence of population variances.  相似文献   

14.
In two experiments, 90 undergraduates took six tests as part of an educational psychology course. Using a crossover design, students took three tests individually without feedback and then took the same test again, following the process of team-based testing (TBT), in teams in which the members reached consensus for each question and answered until they were correct. Students took the other three tests individually with feedback. All students were individually tested over a portion of this content two weeks later and again after two months. Independent samples t tests revealed that TBT students scored higher when retested two months later than those who took the test individually. Finally, three-fourths of the students reported that they enjoyed TBT more than individual testing. Although TBT requires more class time to administer, it appears to be beneficial for long-term student learning.  相似文献   

15.
Differential weighting of response alternatives and confidence testing have been proposed as ways to assess partial knowledge on multiple-choice tests. 211 students in an educational measurement course took their midterm examination under one of three procedures. Results from those students administered the test under conventional directions provided a baseline for comparing, in terms of reliability and validity, the results from students who took the test under the differential weighting of response alternatives or the confidence testing instructions. Reliability was estimated by the split-half technique. Validity was estimated by correlating midterm test scores with scores on a final examination. This investigation provides some support for the contention that validity can be improved using more sophisticated testing techniques. Suggestions for the conduct of more definitive studies were offered.  相似文献   

16.
Two parallel versions of a Test of Science Investigation Skills were developed to assess students' application of science investigation skills in biology and physics contexts. Repeated pilot testing and critical appraisal were used to ensure the validity of the tests and their equivalence. Both versions of the test were administered to 112 Year 10 science students. The results indicated a satisfactory level of test reliability, the test set in a physics context proved to be significantly more difficult than the test set in a biology context, and mean scores for male and female students were not significantly different. Specializations: science teacher education, development of problem-solving expertise, concept development and conceptual change, assessment of laboratory work. Specializations: Chemistry education, concept development and conceptual change, effective laboratory teaching.  相似文献   

17.
研究生入学英语二考试是一项较有影响力的考试,大学英语四级考试更是目前我国规模最大的语言能力等级测试。从对文本和试题类型的分析,对两者阅读测试的不同特征进行对比研究。研究材料为大学英语四级真题(2010.12,2011.6,2011.12)和研究生入学英语二真题(2010,2011,2012)中的阅读理解题各三份,共六份试题,27篇阅读材料。研究结果表明考研英语二的阅读测试难度较大学英语四级高,而测试的信度和效度则比大学英语四级的试题低。  相似文献   

18.
Individuals with an aptitude for interpreting spatial information (high mental rotation ability: HMRA) typically master anatomy with more ease, and more quickly, than those with low mental rotation ability (LMRA). This article explores how visual attention differs with time limits on spatial reasoning tests. Participants were assorted to two groups based on their mental rotation ability scores and their eye movements were collected during these tests. Analysis of salience during testing revealed similarities between MRA groups in untimed conditions but significant differences between the groups in the timed one. Question‐by‐question analyses demonstrate that HMRA individuals were more consistent across the two timing conditions (κ = 0.25), than the LMRA (κ = 0.013). It is clear that the groups respond to time limits differently and their apprehension of images during spatial problem solving differs significantly. Without time restrictions, salience analysis suggests LMRA individuals attended to similar aspects of the images as HMRA and their test scores rose concomitantly. Under timed conditions however, LMRA diverge from HMRA attention patterns, adopting inflexible approaches to visual search and attaining lower test scores. With this in mind, anatomical educators may wish to revisit some evaluations and teaching approaches in their own practice. Although examinations need to evaluate understanding of anatomical relationships, the addition of time limits may induce an unforeseen interaction of spatial reasoning and anatomical knowledge. Anat Sci Educ 10: 528–537. © 2017 American Association of Anatomists.  相似文献   

19.
Martin   《Assessing Writing》2009,14(2):88-115
The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed.  相似文献   

20.
The comparison of scores from linguistically different tests is a twofold matter: the adaptation of tests and the comparison of scores. These 2 aspects of measurement invariance intersect at the need to guarantee the psychometric equivalence between the original and adapted versions. In this study, the authors examined comparability in 2 stages. First, they conducted a thorough study of progressive factorial variance through which they defined an anchor test. Second, they defined an observed score-equated function to establish equivalences between the original test and the adapted test; they used a design of common item nonequivalent groups for this purpose.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号