首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
As access and reliance on technology continue to increase, so does the use of computerized testing for admissions, licensure/certification, and accountability exams. Nonetheless, full computer‐based test (CBT) implementation can be difficult due to limited resources. As a result, some testing programs offer both CBT and paper‐based test (PBT) administration formats. In such situations, evidence that scores obtained from different formats are comparable must be gathered. In this study, we illustrate how contemporary statistical methods can be used to provide evidence regarding the comparability of CBT and PBT scores at the total test score and item levels. Specifically, we looked at the invariance of test structure and item functioning across test administration mode across subgroups of students defined by SES and sex. Multiple replications of both confirmatory factor analysis and Rasch differential item functioning analyses were used to assess invariance at the factorial and item levels. Results revealed a unidimensional construct with moderate statistical support for strong factorial‐level invariance across SES subgroups, and moderate support of invariance across sex. Issues involved in applying these analyses to future evaluations of the comparability of scores from different versions of a test are discussed.  相似文献   

2.
Many innovative item formats have been proposed over the past decade, but little empirical research has been conducted on their measurement properties. This study examines the reliability, efficiency, and construct validity of two innovative item formats—the figural response (FR) and constructed response (CR) formats used in a K–12 computerized science test. The item response theory (IRT) information function and confirmatory factor analysis (CFA) were employed to address the research questions. It was found that the FR items were similar to the multiple-choice (MC) items in providing information and efficiency, whereas the CR items provided noticeably more information than the MC items but tended to provide less information per minute. The CFA suggested that the innovative formats and the MC format measure similar constructs. Innovations in computerized item formats are reviewed, and the merits as well as challenges of implementing the innovative formats are discussed.  相似文献   

3.
Multiple traits of language proficiency as well as test method effects were concurrently analyzed to investigate interrelations of construct validity, convergent validity, and discriminant validity using multitrait-multimethod (MTMM) matrices. A total of 585 test takers' scores were derived from the field test of the Pearson Test of English Academic. An MTMM confirmatory factor analysis model was parameterized using 4 traits and 3 assessment methods. The 4 traits included listening, reading, speaking, and integrated skills, while the 3 methods included prescribed multiple-choice responses, constructed responses, and summarized responses. The trait factor loadings were systematically greater than those of methods, providing evidence that the indicators were strongly related to their latent constructs, after adjusting for the method effects. The results showed robust convergent validity, moderate discriminant validity, and insignificant method effects. Implications are discussed.  相似文献   

4.
This study examines the influence of processing strategies, and the associated metacomponents that determine when to apply them, on the construct validity of a verbal reasoning test. Three strategies for solving verbal analogy items were examined: a rule-oriented strategy, an association strategy, and a partial rule strategy. Construct validity was studied in two separate stages: construct representation and nomothetic span. For construct representation, evidence was obtained that all three strategies, and their related metacomponents, are associated with performance on analogy items. For nomothetic span, the current study found that all three strategies contribute to individual differences in verbal reasoning and to the predictive validity of the test. The results of this study also point to the utility of metacomponents as constructs for describing and understanding test performance. Implications of the results for test development and theories of aptitude are elaborated.  相似文献   

5.
The current push for principals’ to be accountable for student outcomes has led to a renewed interest in the role of leadership in instructional improvement. This article describes the development and validation of a survey that focused on organisational management features that are likely to bring about improvement in instruction. The development of the instrument involved a multistage approach that included: identifying key organisational management features important to instructional leadership and an effective school, based on research and theoretical underpinnings; clearly articulating key constructs; and modifying, adapting and developing items to assess those constructs. After pilot testing the survey, the survey was administered to 216 teachers selected from four high schools in Western Australia. To ensure that the survey was reliable, we used [Trochim W. M. and J. P. Donnelly. 2006. The Research Methods Knowledge Base. 3rd ed. Cincinnati, OH: Atomic Dog] framework for construct validity. Analysis of the data provided evidence to support the reliability and validity of the questionnaire in terms of both the translation and criterion validity. The development of this survey provides principals with an economical and psychometrically sound tool that can be used as part of a process involving critical self-reflection.  相似文献   

6.
Background:?Validity theory has evolved significantly over the past 30 years in response to the increased use of assessments across scientific, social and educational settings. The overarching trajectory of this evolution reflects a shift from a purely quantitative, positivistic approach to a conception of validity reliant on the interpretation of multiple evidence sources integrated into validity arguments. Moreover, within contemporary validity, interpretation has been emphasised as a central process; however, despite this emphasis, there have been few explicit articulations of specific interpretive methodologies applicable to the practice of validation.

Purpose:?To link contemporary theoretical foundations in validity to practical methods and structures to help guide the collection and analysis of interpretive validity evidence. By building upon existing validity theory, this paper aims to provide greater clarity on the practice of validation and contribute toward the larger developing framework for the validation of educational assessments.

Source of evidence:?An interdisciplinary, integrative review of over 60 research articles and sources related to the theory and practice of educational validation and interpretive inquiry approaches. Sources include literature from the fields of educational assessment and more broadly social scientific research.

Main argument:?As assessments in education increasingly aim to measure complex constructs that are value-laden and socially dependant, validity theory must keep pace and evolve in ways that address the inherent complexities associated with contemporary educational assessment. Through this paper, I assert that a greater understanding of interpretive methodologies represents one of the most promising areas for development of validation theory and practice. Specifically, I argue that dialectic, hermeneutic and transgressive forms of inquiry can be integrated within current argument-based structures for the collection, analysis and representation of validity evidence in several useful ways.

Conclusions:?Interpretive inquiry processes, namely dialectic, hermeneutic and transgressive forms of interpretation, serve to expand validation practice to include diverse evidences for the generation of multiple-perspective validity arguments. The paper concludes with specific implications for future research and practice within the field of interpretive validity theory.  相似文献   

7.
Environmental field days offer a distinct opportunity to connect students with science and the environment. The literature on field days, informed by research on field trips, provides a framework for best practices. If there are best practices, however, then presence or lack of the practice should have a discernable impact on the outcomes of the field day and should be measurable and some of them should be observable. The Delphi process was modified to ground the theory and to end the Delphi using a subset of the panel in a face‐to‐face meeting to move from consensus to instrument construction. This paper describes the process and shows a summary of the findings of each of the rounds of the Delphi. The use of a Delphi method for determining consensus around the validity of the theory emerging from the research and literature on field days was appropriate and shows that this type of testing against grounded theory can prove useful for building measures to test the emergent theoretical constructs. Modifying the Delphi to focus on the theoretical constructs emerging from the initial analysis allowed the process to function as a true Delphi and eliminated the long process of construct identification.  相似文献   

8.
This paper describes the development and validation of an item bank designed for students to assess their own achievements across an undergraduate-degree programme in seven generic competences (i.e., problem-solving skills, critical-thinking skills, creative-thinking skills, ethical decision-making skills, effective communication skills, social interaction skills and global perspective). The Rasch modelling approach was adopted for instrument development and validation. A total of 425 items were developed. The content validity of these items was examined via six focus group interviews with target students, and the construct validity was verified against data collected from a large student sample (N?=?1151). A matrix design was adopted to assemble the items in 26 test forms, which were distributed at random in each administration session. The results demonstrated that the item bank had high reliability and good construct validity. Cross-sectional comparisons of Years 1–4 students revealed patterns of changes over the years. Correlation analyses shed light on the relationships between the constructs. Implications are drawn to inform future efforts to develop the instrument, and suggestions are made regarding ways to use the instrument to enhance the teaching and learning of generic skills.  相似文献   

9.
The aims of this study are to: (a) assess if cognitive self-concept (competence) and affective self-concept in mathematics and science are different constructs, (b) evaluate the construct validity of self-concept in the context of conflation and separation, and (c) test if the relationships among cognitive and affective variables are invariant across gender. The data for this study were obtained from the Trends in International Mathematics and Science Study 2007 database. Data about 2,687 out of 4,099 eighth grade Saudi students were subject to various analyses. The variables used in this study were mathematics and science self-concepts, and mathematics and science subject value as part of the Students Background Questionnaire. The relationships among constructs were examined with the use of SPSS16 and the structural equation modeling software, AMOS16. The results demonstrated that subject value and self-concept were different constructs. Also, the results demonstrated that cognitive and affective self-concepts were independent, but strongly related constructs, and the structure of the construct was clearer when self-concept was separated into cognitive and affective components than when it was conflated. The relationships among cognitive, affective, and subject value in mathematics and science were invariant across gender. However, their relationships with achievement were not invariant across gender.  相似文献   

10.
Entirely predictable examinations are ones for which the questions are known in advance. Some assessments are designed this way, but in public examinations, predictability is subtler. Students familiarise themselves with the requirements broadly: likely topics that will come up, question formats and how to maximise their marks. If students can predict what they have to do, they can memorise performances, such as essays, and restrict their learning to fit only with examination requirements. The danger is that this focus could undermine curriculum aims. Further, examinations that are overly predictable might produce results that do not generalise to other performances or have predictive validity. This paper presents part of a broader project investigating whether the Higher Level Irish Leaving Certificate (LC) examinations were too predictable. Here, the development of a rating scale for students’ views of examination predictability is described. Data were collected from 1002 Irish LC students taking higher level examinations in biology (n?=?536), English (n?=?749) and geography (n?=?387). Students’ views on predictability of the examination could be grouped consistently across subject areas into three factors: valuable learning, predictability and narrowing of the curriculum. Belief that narrowing of the curriculum was a good examination preparation tactic had a negative relationship with examination scores and perceived learning value of examinations was positively associated with students’ scores in biology and English. These findings indicate that the scoring system rewards students who believe they must study the discipline broadly.  相似文献   

11.
The purpose of this article was twofold. The first purpose was to test the validity of the Teachers’ Sense of Self-Efficacy Scale (TSES) in five settings—Canada, Cyprus, Korea, Singapore, and the United States. The second purpose was, by extension, to establish the importance of the teacher self-efficacy construct across diverse teaching conditions. Multi-group confirmatory factor analysis was used to better understand the measurement invariance of the scale across countries, after which the relationship between the TSES, its three factors, and job satisfaction was explored. The TSES showed convincing evidence of reliability and measurement invariance across the five countries, and the relationship between the TSES and job satisfaction was similar across settings. The study provides general evidence that teachers’ self-efficacy is a valid construct across culturally diverse settings and specific evidence that teachers’ self-efficacy showed a similar relationship with teachers’ job satisfaction in five contrasting settings.  相似文献   

12.
Validity evidence based on test content is critical to meaningful interpretation of test scores. Within high-stakes testing and accountability frameworks, content-related validity evidence is typically gathered via alignment studies, with panels of experts providing qualitative judgments on the degree to which test items align with the representative content standards. Various summary statistics are then calculated (e.g., categorical concurrence, balance of representation) to aid in decision-making. In this paper, we propose an alternative approach for gathering content-related validity evidence that capitalizes on the overlap in vocabulary used in test items and the corresponding content standards, which we define as textual congruence. We use a text-based, machine learning model, specifically topic modeling, to identify clusters of related content within the standards. This model then serves as the basis from which items are evaluated. We illustrate our method by building a model from the Next Generation Science Standards, with textual congruence evaluated against items within the Oregon statewide alternate assessment. We discuss the utility of this approach as a source of triangulating and diagnostic information and show how visualizations can be used to evaluate the overall coverage of the content standards across the test items.  相似文献   

13.
The Moral Competence Test (MCT) was designed over 30 years ago to provide a resource for educators interested in conducting cross-cultural studies of moral development and education. Since its origin, it has been translated into at least 30 languages and used in hundreds of studies. However, few studies provide evidence to support the use of the test in the US. The test’s designer identified three criteria for evaluating the construct validity of the test and its primary scores: do correlations of stage scores reflect a simplex structure, do ratings follow the theoretical order of stages, does the test differentiate preferences and structures of reasoning. We use these criteria and evidence of criterion and content validity to assess the validity of the MCT. We present results from two US samples (n = 772). Results analyzing the test author’s criteria support the semantic validity of the test, however, evidence of criterion validity raise questions about the C-score as a measure of moral competence. After controlling for stage preferences, the C-score was negatively related to democratic attitudes and positively related to dogmatism.  相似文献   

14.
15.
The purposes of this study were to (a) test the hypothesized factor structure of the Student-Teacher Relationship Scale (STRS; Pianta, 2001) for 308 African American (AA) and European American (EA) children using confirmatory factor analysis (CFA) and (b) examine the measurement invariance of the factor structure across AA and EA children. CFA of the hypothesized three-factor model with correlated latent factors did not yield an optimal model fit. Parameter estimates obtained from CFA identified items with low factor loadings and R2 values, suggesting that content revision is required for those items on the STRS. Deletion of two items from the scale yielded a good model fit, suggesting that the remaining 26 items reliably and validly measure the constructs for the whole sample. Tests for configural invariance, however, revealed that the underlying constructs may differ for AA and EA groups. Subsequent exploratory factor analyses (EFAs) for AA and EA children were carried out to investigate the comparability of the measurement model of the STRS across the groups. The results of EFAs provided evidence suggesting differential factor models of the STRS across AA and EA groups. This study provides implications for construct validity research and substantive research using the STRS given that the STRS is extensively used in intervention and research in early childhood education.  相似文献   

16.
The aim of this work is to analyze the dimensional structure of the Spanish version of the School and College Ability Test, employed in the process for the identification of students with high intellectual abilities. This test measures verbal and mathematical (or quantitative) abilities at three levels of difficulty: elementary (3rd, 4th, and 5th years in Primary school), intermediate (6th year in Primary school plus the 1st and 2nd years of Compulsory Secondary School or ESO), and advanced (3rd and 4th years of ESO plus the 1st and 2nd years of bachillerato – equivalent to High school). For each level there are two forms, X and Y. The research was undertaken with the results obtained from the application carried out for the validation and norming of the Spanish version of the test, and for which a representative sample of students from Navarre at these mentioned levels was taken. This study assessed the possible unidimensionality of the simplicity or the complexity of the structure of this test as an essential aspect of construct validity. To this end, the results were triangulated for the classic factorial techniques and non-parametric methods based on the item response theory.  相似文献   

17.
Although there have been numerous studies investigating the predictive validity of early assessment, observed predictive validity coefficients across studies are not stable. A validity generalization study was conducted in order to answer the question of whether the relationship between early assessment of children and later achievement is generalizable or situation-specific. This study examined 716 predictive correlation coefficients from 44 studies using Hierarchical Linear Modeling (HLM). The findings of this study revealed that predictive validity of early assessment is not generalizable. Additional analyses indicated that predictive validity differ across assessments as a function of test type, specific construct being assessed, length of prediction, and administration procedures. The most impressive finding in this study was the variability of effect sizes across different test administration types. In particular, tests that were scored through ratings were found to be most effective. These findings suggest that instead of addressing a broad predictive validity between a test and a criterion measure, it is necessary to understand early assessment procedures as a whole system by including considerations of various variables related to testing conditions.  相似文献   

18.
Validity is a central principle of assessment relating to the appropriateness of the uses and interpretations of test results. Usually, one of the inferences that we wish to make is that the score reflects the extent of a student’s learning in a given domain. Thus, it is important to establish that the assessment tasks elicit performances that reflect the intended constructs. This research explored the use of three methods for evaluating whether there are threats to validity in relation to the constructs elicited in international A level geography examinations: (a) Rasch analysis; (b) analysis of processes expected and apparent when students answer questions; and (c) qualitative analysis of responses to items identified as potentially problematic. The results provided strong evidence to support validity with regard to the elicitation of constructs although one question part was identified as a threat to validity. Strengths and weaknesses of the methods can be identified.  相似文献   

19.
20.
The construct validity of the Family Involvement Questionnaire–Short Form (FIQ‐SF) was examined in an independent sample of ethnically and linguistically diverse low‐income families (N = 498) enrolled in an urban Head Start program in the Southeast. A series of exploratory and confirmatory factor analyses replicated the three‐factor structure identified in initial validation studies with Northeast samples: home‐school conferencing, home‐based involvement, and school‐based involvement. Findings from multiple group confirmatory factor analyses provided evidence that the three‐factor structure was invariant across family ethnicity. multivariate analyses of variance also confirmed hypothesized mean differences on FIQ‐SF dimension scores across family demographic variables such as caregiver ethnicity, primary home language, caregiver education, employment, and marital status. Findings replicate and extend prior construct validity evidence to support the use of the FIQ‐SF by early childhood education programs such as Head Start serving diverse families from low‐income backgrounds. Implications for future research, practice, and policy applications in early childhood are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号