期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Training metacognition in the classroom: the influence of incentives and feedback on exam predictions

Tyler M. Miller Lisa Geraci 《Metacognition and Learning》2011,6(3):303-314

In two semester-long studies, we examined whether college students could improve their ability to accurately predict their own exam performance across multiple exams. We tested whether providing concrete feedback and incentives (i.e., extra credit) for accuracy would improve predictions by improving students’ metacognition, or awareness of their own knowledge. Students’ predictions were almost always higher than the grade they earned and this was particularly true for low-performing students. Experiment 1 demonstrated that providing incentives but minimal feedback failed to show improvement in students’ metacognition or performance. However, Experiment 2 showed that when feedback was made more concrete, metacognition improved for low performing students although exam scores did not improve across exams, suggesting that feedback and incentives influenced metacognitive monitoring but not control. 相似文献

2.

Effects of Strategy Training and Incentives on Students’ Performance,Confidence, and Calibration

Antonio P. Gutierrez Gregory Schraw 《Journal of Experimental Education》2015,83(3):386-404

This study examined the effect of strategy instruction and incentives on performance, confidence, and calibration accuracy. Individuals (N = 107) in randomly assigned treatment groups received a multicomponent strategy instruction intervention, financial incentives for high performance, or both. The authors predicted that incentives would improve performance, while strategy instruction would improve performance, confidence, and calibration accuracy as a result of better monitoring and self-regulation of learning. The authors compared pre- and posttest items and 20 new posttest-only items. They found significant effects for strategy training on performance, confidence, and calibration accuracy, as well as the interaction between strategy training and time on calibration accuracy. Incentives improved performance and calibration accuracy, either directly, or through an interaction with strategy training. Implications for future research are discussed. 相似文献

3.

What moderates the accuracy of ease of learning judgments?

Andreas Jemstedt Veit Kubik Fredrik U. Jönsson 《Metacognition and Learning》2017,12(3):337-355

When people begin to study new material, they may first judge how difficult it will be to learn. Surprisingly, these ease of learning (EOL) judgments have received little attention by metacognitive researchers so far. The aim of this study was to systematically investigate how well EOL judgments can predict actual learning, and what factors may moderate their relative accuracy. In three experiments, undergraduate psychology students made EOL judgments on, then studied, and were tested on, lists of word-pairs (e.g., sun – warm). In Experiment 1, the Goodman-Kruskal gamma (G) correlations showed that EOL judgments were accurate (G = .74) when items varied enough in difficulty to allow for proper discrimination between them, but were less accurate (G = .21) when variation was smaller. Furthermore, in Experiment 1 and 3, we showed that the relative accuracy was reliably higher when the EOL judgments were correlated with a binary criterion (i.e., if an item was recalled or not on a test), compared with a trials-to-learn criterion (i.e., how many study and test trials were needed to recall an item). In addition, Experiments 2 and 3 indicate other factors to be non-influential for EOL accuracy, such as the task used to measure the EOL judgments, and whether items were judged sequentially (i.e., one item at a time in isolation from the other items) or simultaneously (i.e., each item was judged while having access to all other items). To conclude, EOL judgments can be highly accurate (G = .74) and may thus be of strategic importance for learning. Further avenues for research are discussed. 相似文献

4.

A Method for Maintaining Scale Stability in the Presence of Test Speededness

James A. Wollack Allan S. Cohen Craig S. Wells 《Journal of Educational Measurement》2003,40(4):307-330

Administering tests under time constraints may result in poorly estimated item parameters, particularly for items at the end of the test (Douglas, Kim, Habing, & Gao, 1998; Oshima, 1994). Bolt, Cohen, and Wollack (2002) developed an item response theory mixture model to identify a latent group of examinees for whom a test is overly speeded, and found that item parameter estimates for end-of-test items in the nonspeeded group were similar to estimates for those same items when administered earlier in the test. In this study, we used the Bolt et al. (2002) method to study the effect of removing speeded examinees on the stability of a score scale over an II-year period. Results indicated that using only the nonspeeded examinees for equating and estimating item parameters provided a more unidimensional scale, smaller effects of item parameter drift (including fewer drifting items), and less scale drift (i.e., bias) and variability (i.e., root mean squared errors) when compared to the total group of examinees. 相似文献

5.

Parameter Recovery in the Graded Response Model Using MULTILOG

Steve P. Reise Jiayuan Yu 《Journal of Educational Measurement》1990,27(2):133-144

The graded response model can be used to describe test-taking behavior when item responses are classified into ordered categories. In this study, parameter recovery in the graded response model was investigated using the MULTILOG computer program under default conditions. Based on items having five response categories, 36 simulated data sets were generated that varied on true θ distribution, true item discrimination distribution, and calibration sample size. The findings suggest, first, the correlations between the true and estimated parameters were consistently greater than 0.85 with sample sizes of at least 500. Second, the root mean square error differences between true and estimated parameters were comparable with results from binary data parameter recovery studies. Of special note was the finding that the calibration sample size had little influence on the recovery of the true ability parameter but did influence item-parameter recovery. Therefore, it appeared that item-parameter estimation error, due to small calibration samples, did not result in poor person-parameter estimation. It was concluded that at least 500 examinees are needed to achieve an adequate calibration under the graded model. 相似文献

6.

Scaffolding cognitive and metacognitive processes in low verbal ability learners: Use of diagrams in computer-based training environments

Cuevas Haydee M. Fiore Stephen M. Oser Randall L. 《Instructional Science》2002,30(6):433-464

This study investigated how instructionalstrategies can support learners' knowledgeacquisition and metacomprehension of complexsystems in a computer-based trainingenvironment, and how individual characteristicsinteract with these manipulations. Incorporating diagrams into the trainingfacilitated performance on measures ofintegrative knowledge (i.e., the integrationand application of task-relevant knowledge),but had no significant effect on measures ofdeclarative knowledge (i.e., mastery of basicfactual knowledge). Diagrams additionallyfacilitated the development of accurate mentalmodels (as measured via a card sorting task)and significantly improved the instructionalefficiency of the training (i.e., higher levelof performance was achieved with less mentaleffort). Finally, diagrams effectivelyscaffolded participants' metacognition,improving their metacomprehension accuracy(i.e., their ability to accurately monitortheir comprehension). These beneficial effectsof diagrams on learners' cognitive andmetacognitive processes were found to bestrongest for participants with low verbalability. Results are discussed in terms ofimplications for the design of adaptivelearning systems. 相似文献

7.

Asymptotic Standard Errors for Item Response Theory True Score Equating of Polytomous Items

下载免费PDF全文

Cheow Cher Wong 《Journal of Educational Measurement》2015,52(1):106-120

Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like the bootstrap method, to obtain standard errors of equated scores. Formulas are introduced to obtain the derivatives for computing the asymptotic standard errors. The approach was validated using mean‐mean, mean‐sigma, random‐groups, or concurrent calibration equating of simulated samples, for tests modeled using the generalized partial credit model or the graded response model. 相似文献

8.

OBTAINING MAXIMUM LIKELIHOOD TRAIT ESTIMATES FROM NUMBER-CORRECT SCORES FOR THE THREE-PARAMETER LOGISTIC MODEL

WENDY M. YEN 《Journal of Educational Measurement》1984,21(2):93-111

A procedure is presented for obtaining maximum likelihood trait estimates from number-correct (NC) scores for the three-parameter logistic model. The procedure produces an NC score to trait estimate conversion table, which can be used when the hand scoring of tests is desired or when item response pattern (IP) scoring is unacceptable for other (e.g., political) reasons. Simulated data are produced for four 20-item and four 40-item tests of varying difficulties. These data indicate that the NC scoring procedure produces trait estimates that are tau-equivalent to the IP trait estimates (i.e., they are expected to have the same mean for all groups of examinees), but the NC trait estimates have higher standard errors of measurement than IP trait estimates. Data for six real achievement tests verify that the NC trait estimates are quite similar to the IP trait estimates but have higher empirical standard errors than IP trait estimates, particularly for low-scoring examinees. Analyses in the estimated true score metric confirm the conclusions made in the trait metric. 相似文献

9.

The Effects of Dimensionality on Equating the Law School Admission Test

Gregory Camilli Ming-mei Wang Jacqueline Fesq 《Journal of Educational Measurement》1995,32(1):79-96

Using factor analysis, we conducted an assessment of multidimensionality for 6 forms of the Law School Admission Test (LSAT) and found 2 subgroups of items or factors for each of the 6 forms. The main conclusion of the factor analysis component of this study was that the LSAT appears to measure 2 different reasoning abilities: inductive and deductive. The technique of N. J. Dorans & N. M. Kingston (1985) was used to examine the effect of dimensionality on equating. We began by calibrating (with item response theory [IRT] methods) all items on a form to obtain Set I of estimated IRT item parameters. Next, the test was divided into 2 homogeneous subgroups of items, each having been determined to represent a different ability (i.e., inductive or deductive reasoning). The items within these subgroups were then recalibrated separately to obtain item parameter estimates, and then combined into Set II. The estimated item parameters and true-score equating tables for Sets I and II corresponded closely. 相似文献

10.

Effects of Processing Time on Comprehension and Calibration in Print and Digital Mediums

Lauren M. Singer Trakhman Patricia A. Alexander Lisa E. Berkowitz 《Journal of Experimental Education》2019,87(1):101-115

This study explored the effects of processing texts in print or digitally on readers' comprehension, processing time, and calibration. Eighty-six undergraduates read print and digital versions of book excerpts about childhood ailments presented in counterbalanced order. Comprehension was tested at three levels (i.e., main idea, key points, and other relevant information). Direct comparisons between print and digital reading demonstrated a significant advantage for reading in print on students' recall of key points and other relevant information but not the main idea. When processing time was added as a mediator variable, it significantly affected the relation between medium and comprehension for all question levels. In terms of calibration, students read more quickly and judged their performance higher when engaged digitally, although their actual performance was much better when reading in print. Implications of these findings for subsequent research are considered. 相似文献

11.

Completion of partially worked-out examples as a generation strategy for improving monitoring accuracy

Martine Baars Sandra Visser Tamara van Gog Anique de Bruin Fred Paas 《Contemporary educational psychology》2013

Students’ Judgments of Learning (JOLs) are often inaccurate: students often overestimate their future test performance. Because of the consequences that JOL inaccuracy can have for regulating study activities, an important question is how JOL accuracy can be improved. When learning texts, JOL accuracy has been shown to improve through ‘generation strategies’, such as generating keywords, summaries, or concept maps. This study investigated whether JOL accuracy can also be improved by means of a generation strategy (i.e., completing blank steps in the examples) when learning to solve problems through worked example study. Secondary education students of 14–15 years old (cf. USA 9th grade) either studied worked examples or completed partially worked examples and gave JOLs. It was found that completion of worked examples resulted in underestimation of future test performance. It seems that completing partially worked-out examples made students less confident about future performance than studying fully worked examples. However, this did not lead to better regulation of study. 相似文献

12.

An NCME Instructional Module on Using Differential Step Functioning to Refine the Analysis of DIF in Polytomous Items

Randall D. Penfield Karina Gattamorta Ruth A. Childs 《Educational Measurement》2009,28(1):38-49

Traditional methods for examining differential item functioning (DIF) in polytomously scored test items yield a single item‐level index of DIF and thus provide no information concerning which score levels are implicated in the DIF effect. To address this limitation of DIF methodology, the framework of differential step functioning (DSF) has recently been proposed, whereby measurement invariance is examined within each step underlying the polytomous response variable. The examination of DSF can provide valuable information concerning the nature of the DIF effect (i.e., is the DIF an item‐level effect or an effect isolated to specific score levels), the location of the DIF effect (i.e., precisely which score levels are manifesting the DIF effect), and the potential causes of a DIF effect (i.e., what properties of the item stem or task are potentially biasing). This article presents a didactic overview of the DSF framework and provides specific guidance and recommendations on how DSF can be used to enhance the examination of DIF in polytomous items. An example with real testing data is presented to illustrate the comprehensive information provided by a DSF analysis. 相似文献

13.

Teachers' Ability to Estimate Item Difficulty: A Test of the Assumptions in the Angoff Standard Setting Method

James C. Impara Barbara S. Plake 《Journal of Educational Measurement》1998,35(1):69-81

The Angoff (1971) standard setting method requires expert panelists to (a) conceptualize candidates who possess the qualifications of interest (e.g., the minimally qualified) and (b) estimate actual item performance for these candidates. Past and current research (Bejar, 1983; Shepard, 1994) suggests that estimating item performance is difficult for panelists. If panelists cannot perform this task, the validity of the standard based on these estimates is in question. This study tested the ability of 26 classroom teachers to estimate item performance for two groups of their students on a locally developed district-wide science test. Teachers were more accurate in estimating the performance of the total group than of the "borderline group," but in neither case was their accuracy level high. Implications of this finding for the validity of item performance estimates by panelists using the Angoff standard setting method are discussed. 相似文献

14.

Differential Item Functioning Effect Size From the Multigroup Confirmatory Factor Analysis for a Meta-Analysis: A Simulation Study

Sung Eun Park Soyeon Ahn Cengiz Zopluoglu 《Educational and psychological measurement》2021,81(1):182

This study presents a new approach to synthesizing differential item functioning (DIF) effect size: First, using correlation matrices from each study, we perform a multigroup confirmatory factor analysis (MGCFA) that examines measurement invariance of a test item between two subgroups (i.e., focal and reference groups). Then we synthesize, across the studies, the differences in the estimated factor loadings between the two subgroups, resulting in a meta-analytic summary of the MGCFA effect sizes (MGCFA-ES). The performance of this new approach was examined using a Monte Carlo simulation, where we created 108 conditions by four factors: (1) three levels of item difficulty, (2) four magnitudes of DIF, (3) three levels of sample size, and (4) three types of correlation matrix (tetrachoric, adjusted Pearson, and Pearson). Results indicate that when MGCFA is fitted to tetrachoric correlation matrices, the meta-analytic summary of the MGCFA-ES performed best in terms of bias and mean square error values, 95% confidence interval coverages, empirical standard errors, Type I error rates, and statistical power; and reasonably well with adjusted Pearson correlation matrices. In addition, when tetrachoric correlation matrices are used, a meta-analytic summary of the MGCFA-ES performed well, particularly, under the condition that a high difficulty item with a large DIF was administered to a large sample size. Our result offers an option for synthesizing the magnitude of DIF on a flagged item across studies in practice. 相似文献

15.

A Multidimensional Scaling Study of College Students' Perception of Test Item Formats

《教育实用测度》2013,26(2):123-136

College students use information about upcoming tests, including the item formats to be used, to guide their study strategies and allocation of effort, but little is known about how students perceive item formats. In this study, college students rated the dissimilarity of pairs of common item formats (true/false, multiple choice, essay, fill-in-the-blank, matching, short answer, analogy, and arrangement). A multidimensional scaling model with individual differences (INDSCAL) was fit to the data of 11 1 students and suggested that they were using two dimensions to distinguish among these formats. One dimension separated supply from selection items, and the formats' positions on the dimension were related to ratings of difficulty, review time allocated, objectivity, and recognition (as opposed to recall) required. The second dimension ordered item formats from those with few options from which to choose (e.g., true/false) or brief responses (e.g., fill-in-the-blank), to those with many options from which to choose (e.g., matching) or long responses (e.g., essay). These student perceptions are likely to mediate the impact of classroom evaluation on student study strategies and allocation of effort. 相似文献

16.

Scaling Performance Assessments: A Comparison of One-Parameter and Two-Parameter Partial Credit Models

Anne R. Fitzpatrick Valerie B. Link Wendy M. Yen George R. Burket Kyoko Ito Robert C. Sykes 《Journal of Educational Measurement》1996,33(3):291-314

In one study, parameters were estimated for constructed-response (CR) items in 8 tests from 4 operational testing programs using the l-parameter and 2- parameter partial credit (IPPC and 2PPC) models. Where multiple-choice (MC) items were present, these models were combined with the 1-parameter and 3-parameter logistic (IPL and 3PL) models, respectively. We found that item fit was better when the 2PPC model was used alone or with the 3PL model. Also, the slopes of the CR and MC items were found to differ substantially. In a second study, item parameter estimates produced using the IPL-IPPC and 3PL-2PPC model combinations were evaluated for fit to simulated data generated using true parameters known to fit one model combination or ttle other. The results suggested that the more flexible 3PL-2PPC model combination would produce better item fit than the IPL-1PPC combination. 相似文献

17.

A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Youngsuk Suh Sun‐Joo Cho James A. Wollack 《Journal of Educational Measurement》2012,49(3):285-311

In the presence of test speededness, the parameter estimates of item response theory models can be poorly estimated due to conditional dependencies among items, particularly for end‐of‐test items (i.e., speeded items). This article conducted a systematic comparison of five‐item calibration procedures—a two‐parameter logistic (2PL) model, a one‐dimensional mixture model, a two‐step strategy (a combination of the one‐dimensional mixture and the 2PL), a two‐dimensional mixture model, and a hybrid model‐–by examining how sample size, percentage of speeded examinees, percentage of missing responses, and way of scoring missing responses (incorrect vs. omitted) affect the item parameter estimation in speeded tests. For nonspeeded items, all five procedures showed similar results in recovering item parameters. For speeded items, the one‐dimensional mixture model, the two‐step strategy, and the two‐dimensional mixture model provided largely similar results and performed better than the 2PL model and the hybrid model in calibrating slope parameters. However, those three procedures performed similarly to the hybrid model in estimating intercept parameters. As expected, the 2PL model did not appear to be as accurate as the other models in recovering item parameters, especially when there were large numbers of examinees showing speededness and a high percentage of missing responses with incorrect scoring. Real data analysis further described the similarities and differences between the five procedures. 相似文献

18.

A Comparison of Six Methods for Combining Multiple IRT Item Parameter Estimates

Robert L. McKinley 《Journal of Educational Measurement》1988,25(3):233-246

Six procedures for combining sets of IRT item parameter estimates obtained from different samples were evaluated using real and simulated response data. In the simulated data analyses, true item and person parameters were used to generate response data for three different-sized samples. Each sample was calibrated separately to obtain three sets of item parameter estimates for each item. The six procedures for combining multiple estimates were each applied, and the results were evaluated by comparing the true and estimated item characteristic curves. For the real data, the two best methods from the simulation data analyses were applied to three different-sized samples and the resulting estimated item characteristic curves were compared to the curves obtained when the three samples were combined and calibrated simultaneously. The results support the use of covariance matrix-weighted averaging and a procedure that involves sample-size-weighted averaging of estimated item characteristic curves at the center of the ability distribution 相似文献

19.

Explaining calibration accuracy in classroom contexts: the effects of incentives, reflection, and explanatory style

Douglas J. Hacker Linda Bol Kamilla Bahbahani 《Metacognition and Learning》2008,3(2):101-121

A 2 × 2 quasi-experimental design was used to investigate the impact of extrinsic incentives and reflection on students’ calibration of exam performance. We further examined the relationships among attributional style, performance, and calibration judgments. Participants were 137 college students enrolled in an educational psychology course. Results differed as a function of exam performance. Higher-performing students were very accurate in their calibration and did not show significant improvements across a semester-length course. Attributional style did not significantly contribute to their calibration judgments. Lower-performing students, however, were less accurate in their calibration, and students in the incentives condition showed significant increases in calibration. Beyond exam scores, attributional style constructs were significant predictors of calibration judgments for these students. The constructs targeting study and social variables accounted for most of the additional explained variance. The qualitative data also revealed differences by performance level in open-ended explanations for calibration judgments. 相似文献

20.

Comparing student science performance between hands-on and traditional item types: A many-facet Rasch analysis

《Studies in Educational Evaluation》2021

This study aimed to compare student science performance between hands-on and traditional item types by investigating the item type effect and the interaction effect between item type and science content domain. In Shanghai, China, 2404 ninth-graders from six urban junior high schools took part in the study. The partial credit many-facet Rasch measurement analysis was used to examine the instrument's quality and investigate the item type effect and the interaction effect. The results showed that the traditional item type was significantly more difficult for participants than the hands-on item type, exhibiting a moderate-to-large effect size. Moderate or large interaction effects of an item type with a specific content domain on student science performance were also detected. Students performed better on some science content domains with a particular item type (either hands-on or traditional). Implications for assessment developers and science instructors were also discussed. 相似文献