首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
This paper investigates the effectiveness of using MeSH® in PubMed through its automatic query expansion process: Automatic Term Mapping (ATM). We run Boolean searches based on a collection of 55 topics and about 160,000 MEDLINE® citations used in the 2006 and 2007 TREC Genomics Tracks. For each topic, we first automatically construct a query by selecting keywords from the question. Next, each query is expanded by ATM, which assigns different search tags to terms in the query. Three search tags: [MeSH Terms], [Text Words], and [All Fields] are chosen to be studied after expansion because they all make use of the MeSH field of indexed MEDLINE citations. Furthermore, we characterize the two different mechanisms by which the MeSH field is used. Retrieval results using MeSH after expansion are compared to those solely based on the words in MEDLINE title and abstracts. The aggregate retrieval performance is assessed using both F-measure and mean rank precision. Experimental results suggest that query expansion using MeSH in PubMed can generally improve retrieval performance, but the improvement may not affect end PubMed users in realistic situations.  相似文献   

2.
Linked topics in science and technology (LTSTs) can provide new avenues for technological innovation and are a key step in the transition from basic to applied research. This paper proposes a science and technology semantic linkage integration model for discovering LTSTs. Particularly, the integrative model fuses the term co-occurrence networks of basic and applied research, which expands the completeness of topic networks by enhancing the semantic characteristics of these networks. It is found that link prediction can further reinforce the semantic association of topic terms in networks between basic and applied topics. Simple fusion explicitly linked the topic terms, which can be used as automatic seed marking for subsequent link prediction to identify implicit linking of topic terms. Furthermore, an application to the gene-engineered vaccines field depicted that newly predicted implicit relations can effectively identify LTSTs. The results also show that implicit semantic recognition of LTSTs can be enhanced through simple fusion, while the recognition of LTST can be improved through link prediction. Therefore, the proposed model can assist experts to identify LTSTs that cannot be recognized through simple fusion.  相似文献   

3.
In the patent domain significant efforts are invested to assist researchers in formulating better queries, preferably via automated query expansion. Currently, automatic query expansion in patent search is mostly limited to computing co-occurring terms for the searchable features of the invention. Additional query terms are extracted automatically from patent documents based on entropy measures. Learning synonyms in the patent domain for automatic query expansion has been a difficult task. No dedicated sources providing synonyms for the patent domain, such as patent domain specific lexica or thesauri, are available. In this paper we focus on the highly professional search setting of patent examiners. In particular, we use query logs to learn synonyms for the patent domain. For automatic query expansion, we create term networks based on the query logs specifically for several USPTO patent classes. Experiments show good performance in automatic query expansion using these automatically generated term networks. Specifically, with a larger number of query logs for a specific patent US class available the performance of the learned term networks increases.  相似文献   

4.
[目的/意义]研究前沿的准确判断是国家宏观层面的战略需求,文献计量学作为一种定量研究方法广泛应用于科学主题探测和研究前沿识别中。[方法/过程]梳理研究前沿主题探测的发展历程和方法模型,引入全域微观模型的概念,详细介绍SciVal模块采用的主题创建方法,包括直接引用文献聚类、关键词主题命名和研究前沿遴选的主题显著性算法,并对SciVal创建的9.6万个主题和遴选出的前1%的研究前沿主题的特征进行实证分析。[结果/结论]全域微观模型可以同时一次识别整个科学领域的所有主题,但不同学科在研究前沿上表现存在差异,不能把主题显著性简单等同为重要性;主题论文数量与主题排名之间存在中度相关性;自动抽取的关键词术语从学科领域层和独特性上命名和描述主题;石墨烯相关前沿主题的演变趋势分析可以用于发现关键节点和新兴主题。  相似文献   

5.
[目的/意义]针对目前医学领域基于主题的语义相似度计算研究较少,尚不足以揭示主题间在语义层面的关系,提出一套用于主题间语义相似度计算的方法,进而从语义角度判断主题间关系,为主题新颖性判断、主题关联研究等提供参考。[方法/过程]以MeSH词表为语义计算的基础,剖析词表结构与现有研究成果,从入口词、语义距离、注释3个维度综合测度主题间的语义相似度,利用PubMed中2011-2014年干细胞领域的文献进行实证研究。[结果/结论]利用通用验证主题词对,验证了本文所提3个测度维度的有效性。通过主题间语义相似度的计算,发现干细胞领域2011-2014年较为新颖的主题为未成年人干细胞研究。后续研究中还需融入基于统计的主题相似度,从而更加全面地揭示主题间的关系,发现语义层面领域的新颖性研究主题。  相似文献   

6.
System Performance and Natural Language Expression of Information Needs   总被引:1,自引:0,他引:1  
Consider information retrieval systems that respond to a query (a natural language statement of a topic, an information need) with an ordered list of 1000 documents from the document collection. From the responses to queries that all express the same topic, one can discern how the words associated with a topic result in particular system behavior. From what is discerned from different topics, one can hypothesize abstract topic factors that influence system performance. An example of such a factor is the specificity of the topic's primary key word. This paper shows that statements about the effect of abstract topic factors on system performance can be supported empirically. A combination of statistical methods is applied to system responses from NIST's Text REtrieval Conference. We analyze each topic using a measure of irrelevant-document exclusion computed for each response and a measure of dissimilarity between relevant-document return orders computed for each pair of responses. We formulate topic factors through graphical comparison of measurements for different topics. Finally, we propose for each topic a four-dimensional summarization that we use to select topic comparisons likely to depict topic factors clearly.  相似文献   

7.
自动文本摘要中一个关键的步骤是确定文章的主旨并将反映文章主旨的句子提取出来.在讨论分析k-means, k-medoids等聚类算法的基础上,根据对文本摘要的实际要求以及文档自身的特点,提出一种基于聚类算法的主旨句提取方法.实验结果表明,在提高聚类准确性的基础上,新方法较其他聚类算法能够更加有效地避免遗漏主题的问题,能较全方位地反映全文的主旨,提取出的摘要既覆盖全面又突出重点.  相似文献   

8.
[目的/意义] 基于主题关联相似度揭示主题汇聚及变异过程,识别学科交叉主题及交叉模式,归纳学科主题的演化趋势及演化路径模式。[方法/过程] 获取情报学学科科研论文的高频主题词,构造主题词共词矩阵,利用网络社区演化分析工具生成学科主题演化网络图,结合指标数据对学科主题演化过程进行分析。[结果/结论] 总体上看,情报学学科的研究主题虽然在反复地变化,但核心主题一直存在;扩张、收缩和合并是研究主题最普遍的变化态势,分裂现象较少,产生和消亡现象存在;有3条特定社区演化轨迹清晰地贯穿始终,活跃度相对稳定,反映了3类核心研究主题;3类核心研究主题的演化路径呈现出升华吸纳、共融迭新和辐射推进3种演化模式。研究结果显示,基于主题关联学科主题演化路径的多模式识别方法既能从宏观层面呈现学科主题演化形式,也能从微观层面分析学科主题交叉模式,结合二者可揭示学科主题的继承或创新,预测学科交叉主题的发展方向。  相似文献   

9.
[目的/意义] 针对现有弱信号全自动识别研究尚不完善的问题,提出基于LDA-BERT融合模型的弱信号全自动识别方法。[方法/过程] 基于无监督的LDA主题模型对文本数据集进行主题分类,构建主题和术语双层过滤函数从主题分类的结果中提取早期预警信号,通过紧密中心度、主题权重以及主题自相关性三大度量函数评价主题的弱性,并基于主题内术语的归一化频率和概率提取出弱信号。最后,运用BERT深度学习模型从语义层面对弱信号上下文及其类似词进行扩展。[结果/结论] 以2021年1月初疫情重爆发事件为例,使用爆发前三月的社交媒体新闻数据集对构建的系统模型进行验证。实验结果表明,该方法可有效检测出相关弱信号,并挖掘出弱信号随时间推移逐渐增强的演化特性。此外,该融合模型在实现弱信号全自动识别的同时,也表现出较单一模型更强的结果可解释能力。  相似文献   

10.
One important reason for the use of field categorization in bibliometrics is the necessity to make citation impact of papers published in different scientific fields comparable with each other. Raw citations are normalized by using field-categorization schemes to achieve comparable citation scores. There are different approaches to field categorization available. They can be broadly classified as intellectual and algorithmic approaches. A paper-based algorithmically constructed classification system (ACCS) was proposed which is based on citation relations. Using a few ACCS field-specific clusters, we investigate the discriminatory power of the ACCS. The micro study focusses on the topic ‘overall water splitting’ and related topics. The first part of the study investigates intellectually whether the ACCS is able to identify papers on overall water splitting reliably and validly. Next, we compare the ACCS with (1) a paper-based intellectual (INSPEC) classification and (2) a journal-based intellectual classification (Web of Science, WoS, subject categories). In the last part of our case study, we compare the average number of citations in selected ACCS clusters (on overall water splitting and related topics) with the average citation count of publications in WoS subject categories related to these clusters. The results of this micro study question the discriminatory power of the ACCS. We recommend larger follow-up studies on broad datasets.  相似文献   

11.
Search engine results are often biased towards a certain aspect of a query or towards a certain meaning for ambiguous query terms. Diversification of search results offers a way to supply the user with a better balanced result set increasing the probability that a user finds at least one document suiting her information need. In this paper, we present a reranking approach based on minimizing variance of Web search results to improve topic coverage in the top-k results. We investigate two different document representations as the basis for reranking. Smoothed language models and topic models derived by Latent Dirichlet?allocation. To evaluate our approach we selected 240 queries from Wikipedia disambiguation pages. This provides us with ambiguous queries together with a community generated balanced representation of their (sub)topics. For these queries we crawled two major commercial search engines. In addition, we present a new evaluation strategy based on Kullback-Leibler divergence and Wikipedia. We evaluate this method using the TREC sub-topic evaluation on the one hand, and manually annotated query results on the other hand. Our results show that minimizing variance in search results by reranking relevant pages significantly improves topic coverage in the top-k results with respect to Wikipedia, and gives a good overview of the overall search result. Moreover, latent topic models achieve competitive diversification with significantly less reranking. Finally, our evaluation reveals that our automatic evaluation strategy using Kullback-Leibler divergence correlates well with α-nDCG scores used in manual evaluation efforts.  相似文献   

12.
基于主题词频和g指数的研究热点分析方法   总被引:12,自引:1,他引:11  
给出一种基于主题词频和g指数的研究热点分析新方法,该方法具有同时计量主题词频和该主题论文的质量以及自然选取主题词数量等特点。以信息计量学为例,实证分析了主题词g指数与词频及该主题论文质量的关系。理论推导证明:主题词g指数近似等于主题词g核心内论文的平均被引。  相似文献   

13.
文章首先介绍了汉语科技词系统的体系结构和功能,其次设计了自动赋词标引研究的整体思路,完成了自动赋词标引的系统功能实现,包括标引知识库的格式转换、算法实现和系统实现,并收集语料进行测试。最后对自动赋词标引的结果进行了分析,并且总结了该自动赋词标引研究的特点和不足,介绍了未来的工作设想。  相似文献   

14.
An information retrieval (IR) system can often fail to retrieve relevant documents due to the incomplete specification of information need in the user’s query. Pseudo-relevance feedback (PRF) aims to improve IR effectiveness by exploiting potentially relevant aspects of the information need present in the documents retrieved in an initial search. Standard PRF approaches utilize the information contained in these top ranked documents from the initial search with the assumption that documents as a whole are relevant to the information need. However, in practice, documents are often multi-topical where only a portion of the documents may be relevant to the query. In this situation, exploitation of the topical composition of the top ranked documents, estimated with statistical topic modeling based approaches, can potentially be a useful cue to improve PRF effectiveness. The key idea behind our PRF method is to use the term-topic and the document-topic distributions obtained from topic modeling over the set of top ranked documents to re-rank the initially retrieved documents. The objective is to improve the ranks of documents that are primarily composed of the relevant topics expressed in the information need of the query. Our RF model can further be improved by making use of non-parametric topic modeling, where the number of topics can grow according to the document contents, thus giving the RF model the capability to adjust the number of topics based on the content of the top ranked documents. We empirically validate our topic model based RF approach on two document collections of diverse length and topical composition characteristics: (1) ad-hoc retrieval using the TREC 6-8 and the TREC Robust ’04 dataset, and (2) tweet retrieval using the TREC Microblog ’11 dataset. Results indicate that our proposed approach increases MAP by up to 9% in comparison to the results obtained with an LDA based language model (for initial retrieval) coupled with the relevance model (for feedback). Moreover, the non-parametric version of our proposed approach is shown to be more effective than its parametric counterpart due to its advantage of adapting the number of topics, improving results by up to 5.6% of MAP compared to the parametric version.  相似文献   

15.
To evaluate Information Retrieval Systems on their effectiveness, evaluation programs such as TREC offer a rigorous methodology as well as benchmark collections. Whatever the evaluation collection used, effectiveness is generally considered globally, averaging the results over a set of information needs. As a result, the variability of system performance is hidden as the similarities and differences from one system to another are averaged. Moreover, the topics on which a given system succeeds or fails are left unknown. In this paper we propose an approach based on data analysis methods (correspondence analysis and clustering) to discover correlations between systems and to find trends in topic/system correlations. We show that it is possible to cluster topics and systems according to system performance on these topics, some system clusters being better on some topics. Finally, we propose a new method to consider complementary systems as based on their performances which can be applied for example in the case of repeated queries. We consider the system profile based on the similarity of the set of TREC topics on which systems achieve similar levels of performance. We show that this method is effective when using the TREC ad hoc collection.  相似文献   

16.
Topic emergence detection aids in pinpointing prominent topics within a given domain, providing practical insights into all interested parties on where to focus the limited resources. This paper employs the network-based topic evolution approach to overcome limitations in text-based topic evolution, providing prospective topic emergence prediction capabilities by representing emergent topics by their ancestors. A descendant-aware clustering algorithm is proposed to generate non-exhaustive and overlapping clusters, utilizing the pace of collaborations and structural similarities between topics with iterative edge removal and addition processes. Over 100 datasets specific to a research topic were extracted from the Microsoft Academic Graph dataset for the experiments, where the proposed algorithm consistently outperformed existing clustering algorithms in generating clusters with a higher likelihood of being ancestors to an emergent topic up to three years in the future. Regression-based cluster filtering using five structural cluster features and topic cluster qualities showed that the prediction performance can be enhanced by automatically classifying undesirable clusters from previously known data. The results showed that the proposed algorithm can enhance topic emergence predictions on a wide range of research domains regardless of their maturities, popularities, and magnitudes without having access to the data in the predicted year, paving a road to prospective predictions on emergent topics.  相似文献   

17.
Several studies have reported on metrics for measuring the influence of scientific topics from different perspectives; however, current ranking methods ignore the reinforcing effect of other academic entities on topic influence. In this paper, we developed an effective topic ranking model, 4EFRRank, by modeling the influence transfer mechanism among all academic entities in a complex academic network using a four-layer network design that incorporates the strengthening effect of multiple entities on topic influence. The PageRank algorithm is utilized to calculate the initial influence of topics, papers, authors, and journals in a homogeneous network, whereas the HITS algorithm is utilized to express the mutual reinforcement between topics, papers, authors, and journals in a heterogeneous network, iteratively calculating the final topic influence value. Based on a specific interdisciplinary domain, social media data, we applied the 4ERRank model to the 19,527 topics included in the criteria. The experimental results demonstrate that the 4ERRank model can successfully synthesize the performance of classic co-word metrics and effectively reflect high citation topics. This study enriches the methodology for assessing topic impact and contributes to the development of future topic-based retrieval and prediction tasks.  相似文献   

18.
《Communication monographs》2012,79(4):471-496
This research stresses the need to examine the relationship between topic avoidance and relational correlates (e.g., satisfaction and emotional closeness) from a message production theoretical perspective. Our approach—strategic topic avoidance—offers additional explanatory capabilities as the strategies with which interactants in close relationships avoid topics may be associated with perceptions of the relationship (after accounting for topic avoidance frequency). Moreover, relational correlates may also vary by the combination of overall topic avoidance frequency and certain topic avoidance strategies. The current research, therefore, assessed individuals' topic avoidance frequency levels and the frequency of using topic avoidance strategies in relation to satisfaction and closeness across three different relational types (i.e., significant others, mother–young‐adult, and father–young‐adult relationships). Results suggested that avoiding certain topics, such as current relational concerns, predicted levels of satisfaction and closeness across relationship types; however, cross‐relational differences also emerged. Strategies employed to avoid topics accounted for additional variance in satisfaction and closeness for relationships with significant others and mothers but not fathers. Analyses also demonstrated that overall topic avoidance frequency interacted with topic avoidance strategy use.  相似文献   

19.
Observations from a unique investigation of failure analysis of Information Retrieval research engines held in 2003 are presented. The Reliable Information Access Workshop invited seven leading IR research groups to supply both their systems and their experts to an effort to analyze why their systems fail on some topics and whether the failures are due to system flaws, approach flaws, or the topic itself. There were surprising results from this cross-system failure analysis. One is that despite systems retrieving very different documents, the major cause of failure for any particular topic was almost always the same across all systems. Another is that relationships between aspects of a topic are not especially important for state-of-the-art systems; the systems are failing at a much more basic level where the top-retrieved documents are not reflecting some aspect at all. The investigatory framework and the lessons learned can serve as a model for needed future research in this area.  相似文献   

20.
[目的/意义]运用深度学习技术,提出结合时间和空间特征的测度(速度、覆盖度和迂回度)方法,用于量化学者研究主题演化,从而为基于内容的学者评价提供量化依据。[方法/过程]提出三维指标框架,其中速度反映作者改变研究主题快慢的平均程度,覆盖度反映作者研究内容所覆盖的主题广度,迂回度反映作者研究路径的曲折性。使用微软学术数据集中计算机科学的作者进行实证研究,并考察学者研究主题演化的三维测度和学者学术影响力和生产力的关系。[结果/结论] 实证研究结果显示,覆盖度与总被引量和总发文量的关系为单调递减,这一特征说明聚焦于特定研究主题较为深入的作者,其发文量和影响力都较大。作者研究主题演化的"速度"和"迂回度"与总被引量、总发文量都存在先增加后减少的倒U型关系。所提出的多维度指标框架不仅可在理论上丰富科学计量学对于学者研究主题转移演化及其机制的理解,而且结合深度学习模型提出了问题的解决思路。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号