本文介绍了斯坦福(Stanford)考试系列在美国中小学教育中的发展和应用。着重阐述了三个考试即斯坦福成就考试、斯坦福阅读和数学诊断考试及斯坦福英语水平考试的各自内容,相关的技术和设计原理。  相似文献   

本文从难度、信度、区分度和结构效度四个方面比较了人工组卷和自动组卷的HSK(四级)试卷(以下简称人工卷和自动卷)的题目质量。研究结果显示,自动卷和人工卷的题目质量较好,在结构效度的模型拟合度上,自动卷结构效度能很好规避书写1部分对阅读理解能力的考查,拟合参数好于人工卷。结果说明计算机自动组卷成功,自动卷可对考生的汉语应用能力准确测量,可用于正式考试。  相似文献   

应用结构方程模型(SEM)方法,通过分析模型内潜变量和观测变量之间的相互关系,对教师特质影响初中学生的学业成就进行了研究。从教师角度出发评估变量,将教师特质对初中学生学业成就的影响因素分为教师背景、教师情意、教师态度、教师教学创造、教师期望5个潜在变量,15个可测变量,进行了问卷调查,并应用AMOS17.0和SPSS17.0数据分析软件进行信度分析,运用结构方程模型对假设模型进行分析验证。总体上模型的拟合度较好,5个研究假设都得到了支持。最后,在结果分析的基础上为提高初中学生学业成就提出合理化建议。  相似文献   

研究硕士研究生的有效学习对于提高硕士研究生的培养质量有着重要意义。本文在理论分析和实证研究的基础上假设影响硕士研究生有效学习的因素是学习动机、学习态度、学习环境(氛围)、学习投入、学习收获等5个潜变量和37个观测变量。运用SPSS和结构方程模型(SEM)在5个维度和37个指标题项之间进行探索性因子分析和验证分析。结果表明:5个潜变量和15个观测变量之间拟合度良好,且足以描述硕士研究生有效学习。据此针对性地给出促进硕士研究生有效学习的分类指导建议。  相似文献   

阅读篇章的选择、多项选择题目的设计以及篇章数量与测验题目数量的拟合度问题,是影响阅读理解能力测试信度和效度的基本因素。篇章数量和题目数量的不同组合方式对阅读理解测验误差和信度的影响也不相同。本研究以中国汉语水平考试(HSK)的实测数据为基础,随机选择500名考生作为研究样本,借助概化理论的随机双面嵌套(nested)设计s×(i:p)分析了HSK阅读理解测验中的误差来源和结构,对篇章数量和题目数量的匹配合理性进行了检验。研究结果显示:增加文章数量和题目数量均可以提高测验的精度,但增加文章数量比增加题目数量对概化系数(Generalizability coefficient,Eρ2)的提高作用更有效;HSK阅读理解测验的篇章数量和题目数量的现行组合方式符合误差控制的原则和信度指标的要求。  相似文献   

依纲备考 在一标多本的情况下.根据课改地区的实践及经验可知:高三历史复习要以《普通高中历史课程标准》《考试说明》和教材为依据。《课程标准》是纲领性文件;《考试说明》是复习的重要指南;历史教材是历史学习的主要素材。总而言之,2008年全国及各省、市高考试卷考查的知识点与《考试说明》规定的考点基本对应,但也有个别题目超越了教材版本,体现了高考依据《课程标准》《考试说明》命题,并超越版本的特点.如2008年海南卷的第1题、山东文综卷的第10题、宁夏文综卷的第24题等。因此,考生在复习时还需将不同版本的教材做好对比.相互借鉴.要特别关注共性知识。  相似文献   

2006年高考综政治试题遵照《考试说明》的要求设计,考查内容的知识结构相对合理,学科内三个知识模块的比例基本稳定;题型结构与2005年一致,保持主客观题两种题型大约5:5的合理比例,第一卷选择题12小题,第二卷为5小题;题目阅读量的减少使学生的思考时间更加充裕:改变选择题的设计形式,从以往以题组为主变为题组与单题相结合.存有限题量中扩大了覆盖面和抽样比例,存具体题目设计上,选择题题目设计的技巧有所提高,全卷较为全面地考查了“是什么”、“为什么”和“怎么办”等不同层次的问题,体现了全面发展的要求,也使考试信度得到提高。  相似文献   

语言类篇章测验中经常出现题组题,由于可能违背局部独立性假设,使用传统项目反应理论会导致一系列误差。本文在讨论三个改进模型Polytomous模型、题组模型和双因子模型的基础上,分别使用题组模型和独立模型对汉语能力测试的题目进行检验和分析。结果发现:汉语能力测试中的题组题总体依存度不高;题组模型适合于汉语能力测试的篇章听力和篇章阅读类的题目;独立模型和题组模型对题目难度参数的估计较为接近,对于区分度则有明显差异;两种模型对个人能力估计的一致性很高,但在能力估计的标准误上差别很大。  相似文献   

无纸化考试是在计算机上进行的考试,由计算机从已建立好的题库中调题组卷,考生一人一机一卷,根据屏幕上显示的题目用键盘输入答案。计算机课程用这种形式进行考试已有多年,由于其具有公正客观、组织方便、阅卷迅速等特点,越来越引起人们的重视。1.考试分类与题库无纸化考试的考题可分为文字题和非文字题两大类。文字题是用文字来描述的题目,也是目前有纸化考试中用得最多的一类题目;非文字题包括图形题、图像题、声音题、音像题等多媒体题。文字题又可分为客观性考题和主观性考题:常见的客观性考题的题型有判断题、单选题、多选题和选择填空题;常见的主观性考题的题型有名词解释、文字填空题和问答题。题库中各个考题的主要项目如下:考题编号、题目、答案、科目、章节编号、知识点编号、知识分级(一般分为回忆、理解、分析应用三级)、所属题型以及该题的相关题的题号。相近题是内容十分接近的题目,相关题是指甲题的题目中有乙题的答案,计算机在组卷时应避免同一试卷上出现相近题或相关题。2.无纸化考试的硬件实施无纸化考试的硬件是计算机,一般使用的是微机网络,网络比单机传送考卷和收集答案要方便快捷。只要能运行汉字操作系统的计算机即可用于文字题的无纸化考试;如果使用装有W...  相似文献   

以高考语文全国卷(2018)、上海卷(2017)中的古诗词阅读题为例,探讨在大规模考试中古诗词阅读题如何测评学科核心素养。现有高考试卷中的古诗阅读题无法满足考查和反映学生学科核心素养发展状况并促进教与学之实践变革的需求。基于"新课标"和测评理论,促进学生核心素养发展的古诗阅读题应当具备综合性、情境性、任务驱动性、问题触发性、有效反馈性等特点。  相似文献   

Reading comprehension is difficult to measure because it is a multifaceted construct influenced by a variety of cognitive, social and affective variables. There are also many distinct reasons for measuring reading comprehension such as the evaluation of instructional programs, the ordering of students by ability, and the diagnosis of reading difficulties. In this article we suggest that appropriate measures of reading comprehension depend on the fit between the purposes for testing and the properties of individual tests. Three test properties that we identify are statistical: stability of individual differences, consistency of an individual's scores across testing occasions, and sensitivity of the criterion variable to treatment or growth. Three other properties are conceptual in nature and pertain to test validity: nomothetic span, construct representation, and penetration. Each of these test properties and purposes is described, as well as the perils of mismatches among them. Comparative research on reading tests, construction of tests from theories and models, and use of testing portfolios can all improve the effective measurement of reading.  相似文献   

This study examined a theoretical model hypothesizing that reading strategies mediate the effects of intrinsic reading motivation, reading fluency, and vocabulary knowledge on reading comprehension. Using path analytic methods, we tested the direct and indirect effects specified in the hypothesized model in a sample of 1105 fifth-graders. In addition to standardized tests and questionnaires, we administered a performance test to assess students' proficiency in the application of three reading strategies. The overall fit of the model to the data was good. Both cognitive (fluency and vocabulary) and motivational (intrinsic reading motivation) variables had an indirect effect on reading comprehension through their influence on reading strategies. Reading strategies had a unique effect on reading comprehension and partially mediated the effects that cognitive and motivational variables had on fifth-graders' reading achievements.  相似文献   

为丰富英语专业测试的结构效度的研究,文章用过程分析法对受试者进行英语专业四级(TEM-4)中的完型填空部分模拟测试,并采用回顾报告的方法与阅读和应试策略列表来获取信息和数据。实验结果表明,TEM-4中完型填空测试很好地考察了受试者在词汇和句子层面的阅读能力,但一些应试技巧却影响着受试者在测试过程中的思维方式,因此TEM-4完型填空的测试设计仍需改进以达到更高的结构效度。  相似文献   

基于阅读组合模式的理论框架、Nation提出的词汇广度测试构念和结构语言学家对于句法知识的界定,考察词汇广度和句法知识对二语阅读理解测试成绩的预测,以了解此预测是否受二语水平的调节作用。实验结果表明:受试的词汇广度和句法知识都与其阅读理解测试成绩存在线性相关关系;无论对于高水平受试还是低水平受试,句法知识对阅读理解测试成绩的预测能力都大于词汇广度对其的预测。  相似文献   

The purpose of this study is to apply the attribute hierarchy method (AHM) to a subset of SAT critical reading items and illustrate how the method can be used to promote cognitive diagnostic inferences. The AHM is a psychometric procedure for classifying examinees’ test item responses into a set of attribute mastery patterns associated with different components from a cognitive model. The study was conducted in two steps. In step 1, three cognitive models were developed by reviewing selected literature in reading comprehension as well as research related to SAT Critical Reading. Then, the cognitive models were validated by having a sample of students think aloud as they solved each item. In step 2, psychometric analyses were conducted on the SAT critical reading cognitive models by evaluating the model‐data fit between the expected and observed response patterns produced from two random samples of 2,000 examinees who wrote the items. The model that provided best data‐model fit was then used to calculate attribute probabilities for 15 examinees to illustrate our diagnostic testing procedure.  相似文献   

Recent investigations challenge the construct validity of sustained silent reading tests. Performance of two groups of post‐secondary students (e.g. struggling and non‐struggling) on a sustained silent reading test and two types of cloze test (i.e. maze and open‐ended) was compared in order to identify the test format that contributes greater variance in reading comprehension. One hundred participants were recruited from students enrolled in a preparatory course for a high‐stakes statewide reading examination. Our results suggest that all three measures have good concurrent validity. There was no evidence that open‐ended cloze performance was more related to verbal ability than any other reading measure. Maze performance did the best job at discriminating between our struggling and non‐struggling readers. Implications for reading comprehension assessment in post secondary‐aged adults are discussed.  相似文献   

Conventional methods of differentiating reading disability (RD) caused by deficits in decoding skills or comprehension from poor reading performance caused by inconsistent attention associated with attention-deficit/hyperactivity disorder (ADHD) have produced equivocal results. This study presents a model of differential diagnosis of attentional problems and RD that differs from these conventional approaches. The new diagnostic procedure uses intraindividual differences seen in the performance of at-risk learners on tasks related to reading that vary in their sensitivity to the sustained attention required for successful performance. The hypothesis is that children with inconsistent attention would perform more poorly on tests that require sustained attention, such as listening comprehension, than on tests that are more tolerant of inattention, such as reading comprehension. Such differences would not be seen in the test scores of children who have only RD, because their performance is determined more by the difficulty level of the reading tests than by the degree of sensitivity of the task to attention. The validity of this new model was evaluated by determining the capability of the differences seen in the scores of tests that differ in their sensitivity to sustained attention to predict the degree of inconsistency in sustained attention as measured by a continuous performance test. The data obtained from 39 children who are at risk for RD suggest that this is a viable model.  相似文献   

Using a New Statistical Model for Testlets to Score TOEFL   总被引:1,自引:0,他引:1  
Standard item response theory (IRT) models fit to examination responses ignore the fact that sets of items (testlets) often are matched with a single common stimulus (e.g., a reading comprehension passage). In this setting, all items given to an examinee are unlikely to be conditionally independent (given examinee proficiency). Models that assume conditional independence will overestimate the precision with which examinee proficiency is measured. Overstatement of precision may lead to inaccurate inferences as well as prematurely ended examinations in which the stopping rule is based on the estimated standard error of examinee proficiency (e.g., an adaptive test). The standard three parameter IRT model was modified to include an additional random effect for items nested within the same testlet (Wainer, Bradlow, & Du, 2000). This parameter, γ characterizes the amount of local dependence in a testlet.
We fit 86 TOEFL testlets (50 reading comprehension and 36 listening comprehension) with the new model, and obtained a value for the variance of γ for each testlet. We compared the standard parameters (discrimination (a), difficulty (b) and guessing (c)) with what is obtained through traditional modeling. We found that difficulties were well estimated either way, but estimates of both a and c were biased if conditional independence is incorrectly assumed. Of greater import, we found that test information was substantially over-estimated when conditional independence was incorrectly assumed.  相似文献   


The present study compared the performance of six cognitive diagnostic models (CDMs) to explore inter skill relationship in a reading comprehension test. To this end, item responses of about 21,642 test-takers to a high-stakes reading comprehension test were analyzed. The models were compared in terms of model fit at both test and item levels, classification consistency and accuracy, and proportion of skill mastery profiles. The results showed that the G-DINA performed the best and the C-RUM, NC-RUM, and ACDM showed the closest affinity to the G-DINA. In terms of some criteria, the DINA showed comparable performance to the G-DINA. The test-level results were corroborated by the item-level model comparison, where DINA, DINO, and ACDM variously fit some of the items. The results of the study suggested that relationships among the subskills of reading comprehension might be a combination of compensatory and non-compensatory. Therefore, it is suggested that the choice of the CDM be carried out at item level rather than test level.  相似文献   

The present study is predicated on the logic of interrelated functional information processing components as an approach to understanding reading and its difficulties in preadolescent readers. The structural equation modelling (and its variants) involved these three latent components: (a) orthographic/phonological component, (b) morphological component, and (c) sentence and paragraph comprehension component. These components were subserved by a total of ten measurable tasks, all administered on-line via the microcomputer under laboratory conditions with reaction time measures as indices of mental representation of word knowledge and sentence/paragraph comprehension. The latent dependent component of reading performance was subserved by standardized vocabulary and reading comprehension tests. The total sample consisted of 298 children in grades, 4, 5, and 6. Maximum likelihood analyses using LISREL show that the data in general do not disconfirm the proposed model for grade 4 readers. The three-component model, with some variables set free, provides a reasonable fit for the grade 5 data but less claim could be made about the goodness of fit for grade 6. The results show the mutually reinforcing and mutually facilitating effects of multilevels and multicomponents of reading. Word structure and word knowledge are particularly predictive of reading. The present Phase 1 work would be validated in a follow-up of another cohort of readers and would also lead to the systematic training of some of the components with poor readers. This study was supported by a grant (No. 410-86-0048) from the Social Sciences and Humanities Research Council of Canada.  相似文献   

