首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
To evaluate Information Retrieval Systems on their effectiveness, evaluation programs such as TREC offer a rigorous methodology as well as benchmark collections. Whatever the evaluation collection used, effectiveness is generally considered globally, averaging the results over a set of information needs. As a result, the variability of system performance is hidden as the similarities and differences from one system to another are averaged. Moreover, the topics on which a given system succeeds or fails are left unknown. In this paper we propose an approach based on data analysis methods (correspondence analysis and clustering) to discover correlations between systems and to find trends in topic/system correlations. We show that it is possible to cluster topics and systems according to system performance on these topics, some system clusters being better on some topics. Finally, we propose a new method to consider complementary systems as based on their performances which can be applied for example in the case of repeated queries. We consider the system profile based on the similarity of the set of TREC topics on which systems achieve similar levels of performance. We show that this method is effective when using the TREC ad hoc collection.  相似文献   

2.
This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments (General and Specific) and two categories of topics (Broad and Narrow). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call Coherent Retrieval Elements. The results of our experiments show that—when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics)—the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval.  相似文献   

3.
黎楠  杜永萍  何明 《情报工程》2015,1(3):090-097
LDA 主题模型可用于识别大规模文档集中潜藏的主题信息,本文提出了一种基于LDA 建立发明人兴趣主题模型的方法,合并每位发明人的专利数据,专利信息基于发明人进行划分,将标准的文档- 主题-词的三层LDA 模型变为专利数据中的发明人- 主题- 词的发明人兴趣模型,实现发明人的主题发现,并利用该模型中主题分布之间的相似性进行发明人的个性化推荐。在采集真实专利数据集上的实验结果表明该方法相比传统的向量空间模型方法和隐马尔科夫模型方法具有更高的准确率,推荐效果更优。  相似文献   

4.
Research topics and research communities are not disconnected from each other: communities and topics are interwoven and co-evolving. Yet, scientometric evaluations of topics and communities have been conducted independently and synchronically, with researchers often relying on homogeneous unit of analysis, such as authors, journals, institutions, or topics. Therefore, new methods are warranted that examine the dynamic relationship between topics and communities. This paper examines how research topics are mixed and matched in evolving research communities by using a hybrid approach which integrates both topic identification and community detection techniques. Using a data set on information retrieval (IR) publications, two layers of enriched information are constructed and contrasted: one is the communities detected through the topology of coauthorship network and the other is the topics of the communities detected through the topic model. We find evidence to support the assumption that IR communities and topics are interwoven and co-evolving, and topics can be used to understand the dynamics of community structures. We recommend the use of the hybrid approach to study the dynamic interactions of topics and communities.  相似文献   

5.
Traditional pooling-based information retrieval (IR) test collections typically have \(n= 50\)–100 topics, but it is difficult for an IR researcher to say why the topic set size should really be n. The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements. We employ Nagata’s three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals, respectively. These topic set size design methods require topic-by-run score matrices from past test collections for the purpose of estimating the within-system population variance for a particular evaluation measure. While the previous work of Sakai incorrectly used estimates of the total variances, here we use the correct estimates of the within-system variances, which yield slightly smaller topic set sizes than those reported previously by Sakai. Moreover, this study provides a comparison across the three methods. Our conclusions nevertheless echo those of Sakai: as different evaluation measures can have vastly different within-system variances, they require substantially different topic set sizes under the same set of statistical requirements; by analysing the tradeoff between the topic set size and the pool depth for a particular evaluation measure in advance, researchers can build statistically reliable yet highly economical test collections.  相似文献   

6.
Exploiting Hierarchy in Text Categorization   总被引:4,自引:3,他引:1  
With the recent dramatic increase in electronic access to documents, text categorization—the task of assigning topics to a given document—has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into meta-topics, e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.  相似文献   

7.
Customer magazines blur the boundaries between journalistic reporting and organizational information. On the one hand, customer magazines are intended to communicate the interests, brands, products, and services of an organization. On the other hand, their topics, style, and layout resemble those of journalistic publications, from which readers expect independent and objective reporting. While customer magazines are distributed in high numbers throughout different industries and play an increasingly important role in the media landscape, they have hardly been the focus of researchers to date. It is therefore quite unclear how editorial decisions are made within these publications. This study investigated the relevance of journalistic news factors for topic selection in customer magazines and the extent to which these factors differ from those of journalistic publications. We conducted a quantitative survey of customer magazines’ editors-in-chief in Germany (N?=?143). We compared their responses on the relevance of news factors to the findings of a survey of senior journalists. The findings revealed clear differences in the use of news factors between the two groups.  相似文献   

8.
祝娜  王芳 《图书情报工作》2016,60(5):101-109
[目的/意义]科技创新需要快速发现特定科技领域中关键知识衍生与演化的路径,探索未来的知识创新趋势,为此,有必要对知识演化路径进行动态可视化研究。[方法/过程]从主题关联的角度入手,以3D打印领域为例,基于LDA识别出科技创新主题并进行分阶段细化分析,探测主题集群内部与外部的关联强度,识别出主题不同生命周期的演化能力及其演化类型。[结果/结论]实验结果表明,该方法从主题关联的角度入手,构建了基于时间序列的知识演化路径,丰富了知识管理和信息计量的理论研究方法,在实践上则有助于探测科技创新知识。  相似文献   

9.
[目的/意义]主题排序不仅是信息检索、信息组织研究的基础性问题,也是图书馆学科服务的重要工作,对学科领域研究主题进行有效排序能够帮助科研人员和科研管理部门有效把握学科领域的研究态势,准确定位科研方向,快速做出科研决策。[方法/过程]基于趋势分析提出一种学科研究主题优先级排序算法。首先,在主题提取的基础上,根据发文趋势和引文趋势将每个研究主题按研究等级分为贫乏主题、热点主题、冷点主题、过热主题4个子类。然后,分别对各子类下的主题词进行优先级排序。[结果/结论]在情报学领域的实验表明:本文提出的优先级排序算法能够全方位、细粒度、深层次地展示学科领域研究主题的发展等级,该方法可为从时间维度实现动态情报分析提供新的视角。  相似文献   

10.
[目的/意义]作为科学学预测的重要组成部分,学科主题热度预测旨在揭示学术前沿和发展趋势,辅助学者发现前沿选题,支持科研管理机构科学立项。[研究设计/方法]提出基于期刊影响因子的学科主题热度计算指标(TP-JIF),构建基于LSTM神经网络的学科主题热度预测模型(TPP-LSTM),并以LIS领域数据为例,通过时间切片的形式抽取、计算学科主题的热度序列,检验不同长度时间序列下模型的各项误差。[结论/发现]相对于RBF-SVM、Linear-SVM、KNN、Naive Bayesian等模型,TPP-LSTM预测模型可有效表征学科主题热度时间序列的特性,当时间序列长度为4年时预测效果相对较好。[创新/价值]提出的基于期刊影响因子的学科主题热度计算指标,能够有效刻画不同学术刊物对学科影响的差异,规避了单纯依据频率计算热度的弊端;构建的学科主题热度预测模型,有效表征了学科主题的时间序列变化规律,减小了各项预测误差,预测效果较好。  相似文献   

11.
In this study, MatrixSim, a new method for detecting the evolution paths of research topics based on matrix similarity, was proposed. In the analysis of research topic evolution with the help of co-word networks, in contrast to traditional methods of topic evolution path detection, such as cosine similarity and edge similarity, MatrixSim is based on the local community structure of topic communities in co-word networks and considers the similarity of research topics in both nodes and edges, that is, words and inter-word relations. Using the library and information science field as an example, two sets of experiments were designed for topic similarity detection and subject-specific research topic evolution analysis to evaluate and verify the performance of MatrixSim in detecting the evolution paths of research topics and its validity and feasibility in research topic evolution analysis. The results confirm that MatrixSim performs well in detecting the evolution paths of research topics. It can correlate important research topics, help describe the research development process in scientific fields, reveal the internal evolutionary features of research topics, and thus discover and track the research frontiers in scientific fields. This study provides significant methodological support for researchers conducting prospective research activities.  相似文献   

12.
[目的/意义] 为揭示情报学领域近15年的研究方向和发展演化情况,了解和掌握研究主题热度的动态变化。[方法/过程] 基于动态主题模型(Dynamic Topic Model),以国内外情报学领域影响因子较高的6本核心期刊作为数据集,分析国内外情报学研究主题演化过程,从主题热度的宏观维度和词语变化的微观角度入手,对比分析主题的研究内容和研究热度异同点,以期为我国情报学研究提供参考和借鉴。[结果/结论] 研究结果表明,国内情报学研究内容偏重实际应用,国外偏重于技术与方法的创新;同一研究主题在不同时期涉及研究内容差别明显,导致其研究热度随着时间推移发生变化;相对于国内,国外情报学研究主题传承性和递进性更强,热度变化较小。  相似文献   

13.
Several studies have reported on metrics for measuring the influence of scientific topics from different perspectives; however, current ranking methods ignore the reinforcing effect of other academic entities on topic influence. In this paper, we developed an effective topic ranking model, 4EFRRank, by modeling the influence transfer mechanism among all academic entities in a complex academic network using a four-layer network design that incorporates the strengthening effect of multiple entities on topic influence. The PageRank algorithm is utilized to calculate the initial influence of topics, papers, authors, and journals in a homogeneous network, whereas the HITS algorithm is utilized to express the mutual reinforcement between topics, papers, authors, and journals in a heterogeneous network, iteratively calculating the final topic influence value. Based on a specific interdisciplinary domain, social media data, we applied the 4ERRank model to the 19,527 topics included in the criteria. The experimental results demonstrate that the 4ERRank model can successfully synthesize the performance of classic co-word metrics and effectively reflect high citation topics. This study enriches the methodology for assessing topic impact and contributes to the development of future topic-based retrieval and prediction tasks.  相似文献   

14.
15.
The objective of this study was to evaluate the HealthInsite topic query technique, which uses a dynamic database search to assign resources to a topic. It is an alternative to the explicit classification technique, which relies on the classification of each resource using a predefined classification scheme. We performed a recall-precision analysis on all topics within the broad topic area of Child Health. Recall and precision errors were checked to determine which part of the information retrieval process was at fault. We then compared the topic query technique with the explicit classification technique. The results show errors or problems at every stage of the information retrieval process. This has initiated a review of all the tools used in the process, from indexing guidelines to the search engine. While many errors could be corrected, there were still features of the explicit classification technique that could not be achieved by the topic query technique. In conclusion, the topic query technique has the advantage of flexibility, but close co-operation between the different information retrieval specialists is needed to get the best results. The HealthInsite topic navigation structure should be regarded as an organized set of predefined searches rather than a full classified listing.  相似文献   

16.
An information retrieval (IR) system can often fail to retrieve relevant documents due to the incomplete specification of information need in the user’s query. Pseudo-relevance feedback (PRF) aims to improve IR effectiveness by exploiting potentially relevant aspects of the information need present in the documents retrieved in an initial search. Standard PRF approaches utilize the information contained in these top ranked documents from the initial search with the assumption that documents as a whole are relevant to the information need. However, in practice, documents are often multi-topical where only a portion of the documents may be relevant to the query. In this situation, exploitation of the topical composition of the top ranked documents, estimated with statistical topic modeling based approaches, can potentially be a useful cue to improve PRF effectiveness. The key idea behind our PRF method is to use the term-topic and the document-topic distributions obtained from topic modeling over the set of top ranked documents to re-rank the initially retrieved documents. The objective is to improve the ranks of documents that are primarily composed of the relevant topics expressed in the information need of the query. Our RF model can further be improved by making use of non-parametric topic modeling, where the number of topics can grow according to the document contents, thus giving the RF model the capability to adjust the number of topics based on the content of the top ranked documents. We empirically validate our topic model based RF approach on two document collections of diverse length and topical composition characteristics: (1) ad-hoc retrieval using the TREC 6-8 and the TREC Robust ’04 dataset, and (2) tweet retrieval using the TREC Microblog ’11 dataset. Results indicate that our proposed approach increases MAP by up to 9% in comparison to the results obtained with an LDA based language model (for initial retrieval) coupled with the relevance model (for feedback). Moreover, the non-parametric version of our proposed approach is shown to be more effective than its parametric counterpart due to its advantage of adapting the number of topics, improving results by up to 5.6% of MAP compared to the parametric version.  相似文献   

17.
[目的/意义]从知识主题的角度切入,建立全面的课程知识体系,解决现有课程体系设计和教学中的课程间知识点重复及"知识孤岛"问题,从而有效开展专业知识服务。[方法/过程]以临床医学专业主干课程为研究对象,基于医学主题词表、电子教材、电子教案等医学教育数据,通过LDA模型挖掘课程中的知识主题,利用关联分析揭示课程间、知识主题间及课程与知识主题间的细粒度关联,从而构建临床医学课程知识主题图谱。[结果/结论]研究从专业课程体系与知识主题视角构建出领域知识图谱,有助于教学管理人员及师生掌握专业知识体系,开展知识导向型教学活动,推进医学领域知识组织与服务及智慧医学教育发展。  相似文献   

18.
[目的/意义]回顾总结我国情报学近20年来的历史进程,对于了解我国情报学的发展脉络具有重要意义,能够为情报学后续研究提供参考和指引。[方法/过程]以《情报学进展》第1-11卷刊载的文章为研究对象,运用内容分析法归纳文章选题、主题并总结各选题特点;在此基础上预测未来一定进展周期内我国情报学在理论研究、范式方法、应用实践三个方面的发展趋势。[结果/结论]分析发现情报学基础理论、信息资源及其管理、新兴信息技术等是《情报学进展》所载文章的主要选题,各选题呈现出不同特点。未来,智能化的情报学将融合多学科,面向科学发现,服务国民经济建设和国防安全,为国家新型智库建设提供智力支持。  相似文献   

19.
20.
��[Purpose/significance] This paper proposes the identification of the core research topics and their evolution path visualization methods, in order to provide reference for the field subject evolution analysis research, which has certain significance for revealing the evolution characteristics and development laws of the core topics.[Method/process] Using the LDA model for topic recognition and combining multi-dimensional scaling analysis and visualization techniques to map LDA topic recognition results to two-dimensional space. The topic similarity algorithm was used to detect the association between adjacent time topics, a new visual display method was proposed. We constructed cross-evolution paths of different types of research topics to reveal the dynamic changes of core topics and secondary topics in the evolution process.[Result/conclusion] Taking the medical health information field in China as an example, the research results show that the core research topics in the field of medical and health information in China mainly include electronic health records and Internet medical treatment. Among them, core themes such as health management and smart medical treatment show a good development trend.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号