首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We compare support vector machines (SVMs) to Rocchio, Ide regular and Ide dec-hi algorithms in information retrieval (IR) of text documents using relevancy feedback. It is assumed a preliminary search finds a set of documents that the user marks as relevant or not and then feedback iterations commence. Particular attention is paid to IR searches where the number of relevant documents in the database is low and the preliminary set of documents used to start the search has few relevant documents. Experiments show that if inverse document frequency (IDF) weighting is not used because one is unwilling to pay the time penalty needed to obtain these features, then SVMs are better whether using term-frequency (TF) or binary weighting. SVM performance is marginally better than Ide dec-hi if TF-IDF weighting is used and there is a reasonable number of relevant documents found in the preliminary search. If the preliminary search is so poor that one has to search through many documents to find at least one relevant document, then SVM is preferred.  相似文献   

2.
This paper proposes a method to improve retrieval performance of the vector space model (VSM) in part by utilizing user-supplied information of those documents that are relevant to the query in question. In addition to the user's relevance feedback information, information such as original document similarities is incorporated into the retrieval model, which is built by using a sequence of linear transformations. High-dimensional and sparse vectors are then reduced by singular value decomposition (SVD) and transformed into a low-dimensional vector space, namely the space representing the latent semantic meanings of words. The method has been tested with two test collections, the Medline collection and the Cranfield collection. In order to train the model, multiple partitions are created for each collection. Improvement of average precision of the averages over all partitions, compared with the latent semantic indexing (LSI) model, are 20.57% (Medline) and 22.23% (Cranfield) for the two training data sets, and 0.47% (Medline) and 4.78% (Cranfield) for the test data, respectively. The proposed method provides an approach that makes it possible to preserve user-supplied relevance information for the long term in the system in order to use it later.  相似文献   

3.
陈立华 《现代情报》2010,30(3):26-28,31
潜在语义分析是自然语言使用于情报检索系统的理论基础,以此理论建构的空间向量模型是评判检索系统性能优良与否的知识工具。阐述了潜在语义标引(LSI)的基本内容、LSI下影响自然语言检索查准率的因素及向量空间模型检索软件的运行机制。此评述对网络化的情报检索技术的发展起到了一定的参考作用。  相似文献   

4.
马巍 《情报科学》2006,24(7):1066-1068
本文介绍了用以词为基础的概念学习法来自动扩展提问式的算法,该算法通过学习出现在当前提问中的概念描述词来逐词扩展提问。实验表明,与传统的向量空间检索模型及相关反馈算法相比,本算法能大大提高查全率和查准率。该方法可用于数字图书馆和WWW等的检索中。  相似文献   

5.
Typical pseudo-relevance feedback methods assume the top-retrieved documents are relevant and use these pseudo-relevant documents to expand terms. The initial retrieval set can, however, contain a great deal of noise. In this paper, we present a cluster-based resampling method to select novel pseudo-relevant documents based on Lavrenko’s relevance model approach. The main idea is to use overlapping clusters to find dominant documents for the initial retrieval set, and to repeatedly use these documents to emphasize the core topics of a query.  相似文献   

6.
7.
The relevance feedback process uses information obtained from a user about a set of initially retrieved documents to improve subsequent search formulations and retrieval performance. In extended Boolean models, the relevance feedback implies not only that new query terms must be identified and re-weighted, but also that the terms must be connected with Boolean And/Or operators properly. Salton et al. proposed a relevance feedback method, called DNF (disjunctive normal form) method, for a well established extended Boolean model. However, this method mainly focuses on generating Boolean queries but does not concern about re-weighting query terms. Also, this method has some problems in generating reformulated Boolean queries. In this study, we investigate the problems of the DNF method and propose a relevance feedback method using hierarchical clustering techniques to solve those problems. We also propose a neural network model in which the term weights used in extended Boolean queries can be adjusted by the users’ relevance feedbacks.  相似文献   

8.
Contextual document clustering is a novel approach which uses information theoretic measures to cluster semantically related documents bound together by an implicit set of concepts or themes of narrow specificity. It facilitates cluster-based retrieval by assessing the similarity between a query and the cluster themes’ probability distribution. In this paper, we assess a relevance feedback mechanism, based on query refinement, that modifies the query’s probability distribution using a small number of documents that have been judged relevant to the query. We demonstrate that by providing only one relevance judgment, a performance improvement of 33% was obtained.  相似文献   

9.
It has been known that retrieval effectiveness can be significantly improved by combining multiple evidence from different query or document representations, or multiple retrieval techniques. In this paper, we combine multiple evidence from different relevance feedback methods, and investigate various aspects of the combination. We first generate multiple query vectors for a given information problem in a fully automatic way by expanding an initial query vector with various relevance feedback methods. We then perform retrieval runs for the multiple query vectors, and combine the retrieval results. Experimental results show that combining the evidence of different relevance feedback methods can lead to substantial improvements of retrieval effectiveness.  相似文献   

10.
Latent Semantic Indexing (LSI) uses the singular value decomposition to reduce noisy dimensions and improve the performance of text retrieval systems. Preliminary results have shown modest improvements in retrieval accuracy and recall, but these have mainly explored small collections. In this paper we investigate text retrieval on a larger document collection (TREC) and focus on distribution of word norm (magnitude). Our results indicate the inadequacy of word representations in LSI space on large collections. We emphasize the query expansion interpretation of LSI and propose an LSI term normalization that achieves better performance on larger collections.  相似文献   

11.
Information filtering (IF) systems usually filter data items by correlating a set of terms representing the user’s interest (a user profile) with similar sets of terms representing the data items. Many techniques can be employed for constructing user profiles automatically, but they usually yield large sets of term. Various dimensionality-reduction techniques can be applied in order to reduce the number of terms in a user profile. We describe a new terms selection technique including a dimensionality-reduction mechanism which is based on the analysis of a trained artificial neural network (ANN) model. Its novel feature is the identification of an optimal set of terms that can classify correctly data items that are relevant to a user. The proposed technique was compared with the classical Rocchio algorithm. We found that when using all the distinct terms in the training set to train an ANN, the Rocchio algorithm outperforms the ANN based filtering system, but after applying the new dimensionality-reduction technique, leaving only an optimal set of terms, the improved ANN technique outperformed both the original ANN and the Rocchio algorithm.  相似文献   

12.
Interdocument similarities are the fundamental information source required in cluster-based retrieval, which is an advanced retrieval approach that significantly improves performance during information retrieval (IR). An effective similarity metric is query-sensitive similarity, which was introduced by Tombros and Rijsbergen as method to more directly satisfy the cluster hypothesis that forms the basis of cluster-based retrieval. Although this method is reported to be effective, existing applications of query-specific similarity are still limited to vector space models wherein there is no connection to probabilistic approaches. We suggest a probabilistic framework that defines query-sensitive similarity based on probabilistic co-relevance, where the similarity between two documents is proportional to the probability that they are both co-relevant to a specific given query. We further simplify the proposed co-relevance-based similarity by decomposing it into two separate relevance models. We then formulate all the requisite components for the proposed similarity metric in terms of scoring functions used by language modeling methods. Experimental results obtained using standard TREC test collections consistently showed that the proposed query-sensitive similarity measure performs better than term-based similarity and existing query-sensitive similarity in the context of Voorhees’ nearest neighbor test (NNT).  相似文献   

13.
Recent studies suggest that significant improvement in information retrieval performance can be achieved by combining multiple representations of an information need. The paper presents a genetic approach that combines the results from multiple query evaluations. The genetic algorithm aims to optimise the overall relevance estimate by exploring different directions of the document space. We investigate ways to improve the effectiveness of the genetic exploration by combining appropriate techniques and heuristics known in genetic theory or in the IR field. Indeed, the approach uses a niching technique to solve the relevance multimodality problem, a relevance feedback technique to perform genetic transformations on query formulations and evolution heuristics in order to improve the convergence conditions of the genetic process. The effectiveness of the global approach is demonstrated by comparing the retrieval results obtained by both genetic multiple query evaluation and classical single query evaluation performed on a subset of TREC-4 using the Mercure IRS. Moreover, experimental results show the positive effect of the various techniques integrated to our genetic algorithm model.  相似文献   

14.
提出了一种结合颜色和形状特征的图像检索方法,针对传统基于内容图像检索不能很好满足用户需求的问题,提出了一种基于支持向量机(SVM)的相关反馈算法来捕捉用户的检索意图。实验结果证明,算法能发挥用户在检索过程中的作用,具有较好检索性能。  相似文献   

15.
Searching for relevant material that satisfies the information need of a user, within a large document collection is a critical activity for web search engines. Query Expansion techniques are widely used by search engines for the disambiguation of user’s information need and for improving the information retrieval (IR) performance. Knowledge-based, corpus-based and relevance feedback, are the main QE techniques, that employ different approaches for expanding the user query with synonyms of the search terms (word synonymy) in order to bring more relevant documents and for filtering documents that contain search terms but with a different meaning (also known as word polysemy problem) than the user intended. This work, surveys existing query expansion techniques, highlights their strengths and limitations and introduces a new method that combines the power of knowledge-based or corpus-based techniques with that of relevance feedback. Experimental evaluation on three information retrieval benchmark datasets shows that the application of knowledge or corpus-based query expansion techniques on the results of the relevance feedback step improves the information retrieval performance, with knowledge-based techniques providing significantly better results than their simple relevance feedback alternatives in all sets.  相似文献   

16.
搜索引擎中相关性反馈技术   总被引:10,自引:1,他引:10  
As an important component of search engines, the relevance feedback system is very effective for improving the performance of search engines. This paper firstly reviews the history of relevance feedback technology in the past 30 years, then introduces 2 major methods in relevance feedback, i. e. term reweighting and query expansion, and discusses the relevance feedback technologies based on vector space model and statistical ranking model.  相似文献   

17.
In this paper, a new robust relevance model is proposed that can be applied to both pseudo and true relevance feedback in the language-modeling framework for document retrieval. There are at least three main differences between our new relevance model and other relevance models. The proposed model brings back the original query into the relevance model by treating it as a short, special document, in addition to a number of top-ranked documents returned from the first round retrieval for pseudo feedback, or a number of relevant documents for true relevance feedback. Second, instead of using a uniform prior as in the original relevance model proposed by Lavrenko and Croft, documents are assigned with different priors according to their lengths (in terms) and ranks in the first round retrieval. Third, the probability of a term in the relevance model is further adjusted by its probability in a background language model. In both pseudo and true relevance cases, we have compared the performance of our model to that of the two baselines: the original relevance model and a linear combination model. Our experimental results show that the proposed new model outperforms both of the two baselines in terms of mean average precision.  相似文献   

18.
Determining requirements when searching for and retrieving relevant information suited to a user’s needs has become increasingly important and difficult, partly due to the explosive growth of electronic documents. The vector space model (VSM) is a popular method in retrieval procedures. However, the weakness in traditional VSM is that the indexing vocabulary changes whenever changes occur in the document set, or the indexing vocabulary selection algorithms, or parameters of the algorithms, or if wording evolution occurs. The major objective of this research is to design a method to solve the afore-mentioned problems for patent retrieval. The proposed method utilizes the special characteristics of the patent documents, the International Patent Classification (IPC) codes, to generate the indexing vocabulary for presenting all the patent documents. The advantage of the generated indexing vocabulary is that it remains unchanged, even if the document sets, selection algorithms, and parameters are changed, or if wording evolution occurs. Comparison of the proposed method with two traditional methods (entropy and chi-square) in manual and automatic evaluations is presented to verify the feasibility and validity. The results also indicate that the IPC-based indexing vocabulary selection method achieves a higher accuracy and is more satisfactory.  相似文献   

19.
基于潜在语义空间的文本检索问题研究   总被引:6,自引:0,他引:6  
本文针对潜在语义空间模型下文本检索存在的问题进行了研究;对LSI/SVD方法的理论及问题进行探讨,对文本一特征项矩阵加权方式进行了改进,对文本的聚类提出了k—means截尾算法,对原始向量空间降秩后的语义空间维数的确定提出了解决方法。  相似文献   

20.
Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries––one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translation probabilities confer a small but significant advantage.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号