首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
The application of word sense disambiguation (WSD) techniques to information retrieval (IR) has yet to provide convincing retrieval results. Major obstacles to effective WSD in IR include coverage and granularity problems of word sense inventories, sparsity of document context, and limited information provided by short queries. In this paper, to alleviate these issues, we propose the construction of latent context models for terms using latent Dirichlet allocation. We propose building one latent context per word, using a well principled representation of local context based on word features. In particular, context words are weighted using a decaying function according to their distance to the target word, which is learnt from data in an unsupervised manner. The resulting latent features are used to discriminate word contexts, so as to constrict query’s semantic scope. Consistent and substantial improvements, including on difficult queries, are observed on TREC test collections, and the techniques combines well with blind relevance feedback. Compared to traditional topic modeling, WSD and positional indexing techniques, the proposed retrieval model is more effective and scales well on large-scale collections.  相似文献   

2.
We first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency distributions and the central role played by burstiness in this context. This leads us to propose a formal definition of burstiness which can be used to characterize probability distributions with respect to this phenomenon. We then introduce the family of information-based IR models which naturally captures heuristic retrieval constraints when the underlying probability distribution is bursty and propose a new IR model within this family, based on the log-logistic distribution. The experiments we conduct on several collections illustrate the good behavior of the log-logistic IR model: It significantly outperforms the Jelinek-Mercer and Dirichlet prior language models on most collections we have used, with both short and long queries and for both the MAP and the precision at 10 documents. It also compares favorably to BM25 and has similar performance to classical DFR models such as InL2 and PL2.  相似文献   

3.
基于域加权词频法的XML文档级检索实现与评价   总被引:1,自引:0,他引:1  
利用BM25F模型,通过实验,在INEX 04数据集的基础上,实现了对多个域(元素)词频进行加权的XML文档级检索。XML文档结构的确蕴含了一定的语义信息。利用这些语义信息,可以提高检索性能。表2。图1。参考文献16。  相似文献   

4.
Pattern indexing is an attempt at combining standardized and free indexing. In contrast to prevailing indexing methods, notably precoordinated ones, pattern indexing also takes into consideration the terminological and information retrieval habits in certain displines of science. It is based on patterns consisting of subject categories reflecting the conceptual and methodological framework of a given discipline. These categories provide structured sets of standardized subject headings. To allow for flexibility and adequacy, these headings may be complemented by free indexing terms. Pattern indexing is intended to mend opaque catalog structures and terminological uncertainties of topical subject headings in common precoordinated indexing practice. Pattern indexing is discussed in the context of literary scholarship.  相似文献   

5.
一种改进的余弦向量度量法文本检索模型   总被引:2,自引:1,他引:1  
付永贵 《图书情报工作》2011,55(19):115-119
针对用户对索引项要求的不同提出改进余弦向量度量法(ICVMM)文本检索模型,该模型将索引项分为主索引项和特征索引项,根据查询相关文本集中特征索引项相关性概率值来修改文本和查询特征索引项的初始权值;通过实例对传统余弦向量度量法(TCVMM)文本检索模型和ICVMM文本检索模型的查询效率进行对比,说明ICVMM文本检索模型的查询结果更接近用户的需求。  相似文献   

6.
从文献检索到信息检索最大的变化 :一是由文献单元向信息单元为基础的组织方式的改变 ;二是由手工分类、主题标引、著者标引经过机器的主题词、自由词抽取、标引发展到全文标引乃至超文本检索。网络技术、超媒体技术和智能技术等是促其变化的关键。作为一门学科的教学必须创建以CAI课件为主导的实践教学方法和建立信息检索课程的基本框架体系。参考文献 4。  相似文献   

7.
基于概念空间方法的信息检索技术研究   总被引:14,自引:0,他引:14  
为了解决词汇差异问题,词表构造在信息检索系统中有着重要意义。概念空间方法是利用计算机自动构造概念语义网络(词表)并以此为基础进行概念检索的一种方法。由词语作为语义网络的节点,词语之间的关联权重以一个给定文档集合中词语的共现率来计算,其大小代表它们之间的相似性。检索时系统采用人工智能方法激活与检索入口词相关的术语或概念,为用户提供交互式的检索用语建议。方法的具体步骤包括文档和对象列表收集、对象过滤和自动标引、共现分析和联想检索四个阶段。这种方法多用于英文检索系统,但对我国的信息检索系统也有重要的借鉴意义。  相似文献   

8.
In this paper, which treats Swedish full text retrieval, the problem of morphological variation of query terms in the document database is studied. The Swedish CLEF 2003 test collection was used, and the effects of combination of indexing strategies with query terms on retrieval effectiveness were studied. Four of the seven tested combinations involved indexing strategies that used normalization, a form of conflation. All of these four combinations employed compound splitting, both during indexing and at query phase. SWETWOL, a morphological analyzer for the Swedish language, was used for normalization and compound splitting. A fifth combination used stemming, while a sixth attempted to group related terms by right hand truncation of query terms. The truncation was performed by a search expert. These six combinations were compared to each other and to a baseline combination, where no attempt was made to counteract the problem of morphological variation of query terms in the document database. Both the truncation combination, the four combinations based on normalization and the stemming combination outperformed the baseline. Truncation had the best performance. The main conclusion of the paper is that truncation, normalization and stemming enhanced retrieval effectiveness in comparison to the baseline. Further, normalization and stemming were not far below truncation.  相似文献   

9.
基于文档权重归并法的企业专家检索*   总被引:2,自引:0,他引:2  
针对企业专家的专长识别与检索问题,采用文档权重归并法,利用TREC W3C数据集实现企业内的专家检索,并与专家档案法进行了比较。研究结果表明同样采用BM25模型,采用文档权重归并法具有稳定的优势。  相似文献   

10.
Vocabulary incompatibilities arise when the terms used to index a document collection are largely unknown, or at least not well-known to the users who eventually search the collection. No matter how comprehensive or well-structured the indexing vocabulary, it is of little use if it is not used effectively in query formulation. This paper demonstrates that techniques for mapping user queries into the controlled indexing vocabulary have the potential to radically improve document retrieval performance. We also show how the use of controlled indexing vocabulary can be employed to achieve performance gains for collection selection. Finally, we demonstrate the potential benefit of combining these two techniques in an interactive retrieval environment. Given a user query, our evaluation approach simulates the human user's choice of terms for query augmentation given a list of controlled vocabulary terms suggested by a system. This strategy lets us evaluate interactive strategies without the need for human subjects.  相似文献   

11.
针对个性化搜索的3个关键问题:用户信息搜集,用户信息库的动态更新与个性化检索算法,探索性地提出基于Ajax用户行为跟踪方案,以会话为单位动态更新用户行为信息库策略与加入用户文档的向量空间检索模型,并在此基础上设计和实现个性化搜索引擎实验系统。  相似文献   

12.
黄名选 《图书情报工作》2011,55(15):110-113
针对情报检索系统中存在的词不匹配问题,提出一种基于相关性-兴趣度架构的关联规则挖掘的局部反馈查询扩展算法,并论述查询扩展基本思想、扩展算法模型以及扩展词权值的计算方法。该算法主要特点是采用支持度-置信度-相关性-兴趣度框架衡量关联规则,避免产生负相关的、虚假的和无兴趣的规则,提高来自于关联规则的扩展词的质量。实验结果表明,该算法能有效地改善和提高信息检索性能, 有很高的实际应用价值和推广前景。  相似文献   

13.
宋芸芳 《图书馆建设》2012,(3):52-54,57
组配标引是在词表中选择两个及两个以上有形式逻辑关系的词,按照特定规则组成的一组标引词串,用以满足文献多层次、多途径检索的需要。概念组配是文献标引的关键环节。根据参与组配的主题词之间的逻辑关系,概念组配可分为交叉组配、限定组配和联结组配3种基本类型。在实际组配标引工作中,编目员应避免因对新词表不熟悉造成检索词语构成混乱,避免因主题概念转换错误造成粗标、漏标和错标,避免因未遵循专指性标引规则造成切题不当,减少组配标引失误。  相似文献   

14.
15.
In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model “fits” the user’s information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests’ framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms.  相似文献   

16.
针对传统信息检索模型不能很好满足用户需求的问题,在分析现有相关研究的基础上,提出基于领域Ontology的知识检索模型。通过构建领域Ontology,对文档进行语义标注,对查询请求进行概念提取和语义扩展,从而得到语义索引项作为文档和用户请求的知识表达,进一步研究领域Ontology中词语间语义关系的计算模型。考虑到语义相似度与语义相关的内在关系,给出相关系数来衡量检索目标与候选者间符合程度。最后对提出的模型进行验证,结果表明检索性能有显著提高。  相似文献   

17.
基于并行文献数据库的索引语言概念兼容转换   总被引:3,自引:0,他引:3  
张雪英 《情报学报》2005,24(2):161-168
本文提出的RST模型 ,是一种基于并行文献数据库的概念语义相似度度量模型 ,适用于不同索引语言概念之间的自动兼容转换。RST模型根据粗糙集和索引语言的一些基本理论建立 ,能够明确定义概念之间的语义关系和相似程度。实验表明 ,RST模型的性能明显优于现有的两种方法 ,可以广泛应用于各类电子文献数据库和搜索引擎的集成检索系统 ,从而实现应用单种索引语言进行跨数据库的有效检索。  相似文献   

18.
Document length is widely recognized as an important factor for adjusting retrieval systems. Many models tend to favor the retrieval of either short or long documents and, thus, a length-based correction needs to be applied for avoiding any length bias. In Language Modeling for Information Retrieval, smoothing methods are applied to move probability mass from document terms to unseen words, which is often dependant upon document length. In this article, we perform an in-depth study of this behavior, characterized by the document length retrieval trends, of three popular smoothing methods across a number of factors, and its impact on the length of documents retrieved and retrieval performance. First, we theoretically analyze the Jelinek–Mercer, Dirichlet prior and two-stage smoothing strategies and, then, conduct an empirical analysis. In our analysis we show how Dirichlet prior smoothing caters for document length more appropriately than Jelinek–Mercer smoothing which leads to its superior retrieval performance. In a follow up analysis, we posit that length-based priors can be used to offset any bias in the length retrieval trends stemming from the retrieval formula derived by the smoothing technique. We show that the performance of Jelinek–Mercer smoothing can be significantly improved by using such a prior, which provides a natural and simple alternative to decouple the query and document modeling roles of smoothing. With the analysis of retrieval behavior conducted in this article, it is possible to understand why the Dirichlet Prior smoothing performs better than the Jelinek–Mercer, and why the performance of the Jelinek–Mercer method is improved by including a length-based prior.
Leif AzzopardiEmail:
  相似文献   

19.
Operational multimodal information retrieval systems have to deal with increasingly complex document collections and queries that are composed of a large set of textual and non-textual modalities such as ratings, prices, timestamps, geographical coordinates, etc. The resulting combinatorial explosion of modality combinations makes it intractable to treat each modality individually and to obtain suitable training data. As a consequence, instead of finding and training new models for each individual modality or combination of modalities, it is crucial to establish unified models, and fuse their outputs in a robust way. Since the most popular weighting schemes for textual retrieval have in the past generalized well to many retrieval tasks, we demonstrate how they can be adapted to be used with non-textual modalities, which is a first step towards finding such a unified model. We demonstrate that the popular weighting scheme BM25 is suitable to be used for multimodal IR systems and analyze the underlying assumptions of the BM25 formula with respect to merging modalities under the so-called raw-score merging hypothesis, which requires no training. We establish a multimodal baseline for two multimodal test collections, show how modalities differ with respect to their contribution to relevance and the difficulty of treating modalities with overlapping information. Our experiments demonstrate that our multimodal baseline with no training achieves a significantly higher retrieval effectiveness than using just the textual modality for the social book search 2016 collection and lies in the range of a trained multimodal approach using the optimal linear combination of the modality scores.  相似文献   

20.
杨秀丹  李皓 《图书情报工作》2012,(19):95-100,127
对物理信息检索系统进行用户情境的实地研究,结合情报学认知观理论,分析信息检索系统中的认知要素。在此基础上,设计认知信息检索系统模型——主要在信息标引和信息检索与匹配阶段加入认知要素,最后介绍认知信息检索系统模型的构建过程和模型组成。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号