首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60–80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.  相似文献   

2.
This paper presents an investigation about how to automatically formulate effective queries using full or partial relevance information (i.e., the terms that are in relevant documents) in the context of relevance feedback (RF). The effects of adding relevance information in the RF environment are studied via controlled experiments. The conditions of these controlled experiments are formalized into a set of assumptions that form the framework of our study. This framework is called idealized relevance feedback (IRF) framework. In our IRF settings, we confirm the previous findings of relevance feedback studies. In addition, our experiments show that better retrieval effectiveness can be obtained when (i) we normalize the term weights by their ranks, (ii) we select weighted terms in the top K retrieved documents, (iii) we include terms in the initial title queries, and (iv) we use the best query sizes for each topic instead of the average best query size where they produce at most five percentage points improvement in the mean average precision (MAP) value. We have also achieved a new level of retrieval effectiveness which is about 55–60% MAP instead of 40+% in the previous findings. This new level of retrieval effectiveness was found to be similar to a level using a TREC ad hoc test collection that is about double the number of documents in the TREC-3 test collection used in previous works.  相似文献   

3.
In ad hoc querying of document collections, current approaches to ranking primarily rely on identifying the documents that contain the query terms. Methods such as query expansion, based on thesaural information or automatic feedback, are used to add further terms, and can yield significant though usually small gains in effectiveness. Another approach to adding terms, which we investigate in this paper, is to use natural language technology to annotate - and thus disambiguate - key terms by the concept they represent. Using biomedical research documents, we quantify the potential benefits of tagging users’ targeted concepts in queries and documents in domain-specific information retrieval. Our experiments, based on the TREC Genomics track data, both on passage and full-text retrieval, found no evidence that automatic concept recognition in general is of significant value for this task. Moreover, the issues raised by these results suggest that it is difficult for such disambiguation to be effective.  相似文献   

4.
It has been known that retrieval effectiveness can be significantly improved by combining multiple evidence from different query or document representations, or multiple retrieval techniques. In this paper, we combine multiple evidence from different relevance feedback methods, and investigate various aspects of the combination. We first generate multiple query vectors for a given information problem in a fully automatic way by expanding an initial query vector with various relevance feedback methods. We then perform retrieval runs for the multiple query vectors, and combine the retrieval results. Experimental results show that combining the evidence of different relevance feedback methods can lead to substantial improvements of retrieval effectiveness.  相似文献   

5.
The quality of feedback documents is crucial to the effectiveness of query expansion (QE) in ad hoc retrieval. Recently, machine learning methods have been adopted to tackle this issue by training classifiers from feedback documents. However, the lack of proper training data has prevented these methods from selecting good feedback documents. In this paper, we propose a new method, called AdapCOT, which applies co-training in an adaptive manner to select feedback documents for boosting QE’s effectiveness. Co-training is an effective technique for classification over limited training data, which is particularly suitable for selecting feedback documents. The proposed AdapCOT method makes use of a small set of training documents, and labels the feedback documents according to their quality through an iterative process. Two exclusive sets of term-based features are selected to train the classifiers. Finally, QE is performed on the labeled positive documents. Our extensive experiments show that the proposed method improves QE’s effectiveness, and outperforms strong baselines on various standard TREC collections.  相似文献   

6.
The classical probabilistic models attempt to capture the ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to the effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution based on the Polya Urn scheme, which can also be considered as a hierarchical Bayesian model, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model based on this distribution is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks. The main objective of this research is to develop an effective probabilistic model based on the DCM distribution. A secondary objective is to provide a thorough understanding of the probabilistic retrieval model by a theoretical understanding of various text distribution assumptions.  相似文献   

7.
In this paper we present a new algorithm for relevance feedback (RF) in information retrieval. Unlike conventional RF algorithms which use the top ranked documents for feedback, our proposed algorithm is a kind of active feedback algorithm which actively chooses documents for the user to judge. The objectives are (a) to increase the number of judged relevant documents and (b) to increase the diversity of judged documents during the RF process. The algorithm uses document-contexts by splitting the retrieval list into sub-lists according to the query term patterns that exist in the top ranked documents. Query term patterns include a single query term, a pair of query terms that occur in a phrase and query terms that occur in proximity. The algorithm is an iterative algorithm which takes one document for feedback in each of the iterations. We experiment with the algorithm using the TREC-6, -7, -8, -2005 and GOV2 data collections and we simulate user feedback using the TREC relevance judgements. From the experimental results, we show that our proposed split-list algorithm is better than the conventional RF algorithm and that our algorithm is more reliable than a similar algorithm using maximal marginal relevance.  相似文献   

8.
The problem of results merging in distributed information retrieval environments has gained significant attention the last years. Two generic approaches have been introduced in research. The first approach aims at estimating the relevance of the documents returned from the remote collections through ad hoc methodologies (such as weighted score merging, regression etc.) while the other is based on downloading all the documents locally, completely or partially, in order to calculate their relevance. Both approaches have advantages and disadvantages. Download methodologies are more effective but they pose a significant overhead on the process in terms of time and bandwidth. Approaches that rely solely on estimation on the other hand, usually depend on document relevance scores being reported by the remote collections in order to achieve maximum performance. In addition to that, regression algorithms, which have proved to be more effective than weighted scores merging algorithms, need a significant number of overlap documents in order to function effectively, practically requiring multiple interactions with the remote collections. The new algorithm that is introduced is based on adaptively downloading a limited, selected number of documents from the remote collections and estimating the relevance of the rest through regression methodologies. Thus it reconciles the above two approaches, combining their strengths, while minimizing their drawbacks, achieving the limited time and bandwidth overhead of the estimation approaches and the increased effectiveness of the download. The proposed algorithm is tested in a variety of settings and its performance is found to be significantly better than the former, while approximating that of the latter.  相似文献   

9.
The relevance feedback process uses information derived from an initially retrieved set of documents to improve subsequent search formulations and retrieval output. In a Boolean query environment this implies that new query terms must be identified and Boolean operators must be chosen automatically to connect the various query terms. In this study two recently proposed automatic methods for relevance feedback of Boolean queries are evaluated and conclusions are drawn concerning the use of effective feedback methods in a Boolean query environment.  相似文献   

10.
Term classifications and thesauri can be used for many purposes in automatic information retrieval. Normally a thesaurus is generated manually by subject experts: alternatively, the associations between the terms can be obtained automatically by using the occurrence characteristics of the terms across the documents of a collection. A third possibility consists in taking into account user relevance assessments of certain documents with respect to certain queries in order to build term classes designed to retrieve the relevant documents and simultaneously to reject the nonrelevant documents. This last strategy, known as pseudoclassification, produces a user-dependent term classification.A number of pseudoclassification studies are summarized in the present report, and conclusions are reached concerning the effectiveness and feasibility of constructing term classifications based on human relevance assessments.  相似文献   

11.
This paper presents a study of relevance feedback in a cross-language information retrieval environment. We have performed an experiment in which Portuguese speakers are asked to judge the relevance of English documents; documents hand-translated to Portuguese and documents automatically translated to Portuguese. The goals of the experiment were to answer two questions (i) how well can native Portuguese searchers recognise relevant documents written in English, compared to documents that are hand translated and automatically translated to Portuguese; and (ii) what is the impact of misjudged documents on the performance improvement that can be achieved by relevance feedback. Surprisingly, the results show that machine translation is as effective as hand translation in aiding users to assess relevance in the experiment. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics.  相似文献   

12.
This paper is an interim report on our efforts at NIST to construct an information discovery tool through the fusion of hypertext and information retrieval (IR) technologies. The tool works by parsing a contiguous document base into smaller documents and inserting semantic links between these documents using document–document similarity measures based on IR techniques. The focus of the paper is a case study in which domain experts evaluate the utility of the tool in the performance of information discovery tasks on a large, dynamic procedural manual. The results of the case study are discussed, and their implications for the design of large-scale automatic hypertext generation systems are described.  相似文献   

13.
14.
Recent studies suggest that significant improvement in information retrieval performance can be achieved by combining multiple representations of an information need. The paper presents a genetic approach that combines the results from multiple query evaluations. The genetic algorithm aims to optimise the overall relevance estimate by exploring different directions of the document space. We investigate ways to improve the effectiveness of the genetic exploration by combining appropriate techniques and heuristics known in genetic theory or in the IR field. Indeed, the approach uses a niching technique to solve the relevance multimodality problem, a relevance feedback technique to perform genetic transformations on query formulations and evolution heuristics in order to improve the convergence conditions of the genetic process. The effectiveness of the global approach is demonstrated by comparing the retrieval results obtained by both genetic multiple query evaluation and classical single query evaluation performed on a subset of TREC-4 using the Mercure IRS. Moreover, experimental results show the positive effect of the various techniques integrated to our genetic algorithm model.  相似文献   

15.
Document retrieval systems based on probabilistic or fuzzy logic considerations may order documents for retrieval. Users then examine the ordered documents until deciding to stop, based on the estimate that the highest ranked unretrieved document will be most economically not retrieved. We propose an expected precision measure useful in estimating the performance expected if yet unretrieved documents were to be retrieved, providing information that may result in more economical stopping decisions. An expected precision graph, comparing expected precision versus document rank, may graphically display the relative expected precision of retrieved and unretrieved documents and may be used as a stopping aid for online searching of text data bases. The effectiveness of relevance feedback may be examined as a search progresses. Expected precision values may also be used as a cutoff for systems consistent with probabilistic models operating in batch modes. Techniques are given for computing the best expected precision obtainable and the expected precision of subject neutral documents.  相似文献   

16.
Measuring effectiveness of information retrieval (IR) systems is essential for research and development and for monitoring search quality in dynamic environments. In this study, we employ new methods for automatic ranking of retrieval systems. In these methods, we merge the retrieval results of multiple systems using various data fusion algorithms, use the top-ranked documents in the merged result as the “(pseudo) relevant documents,” and employ these documents to evaluate and rank the systems. Experiments using Text REtrieval Conference (TREC) data provide statistically significant strong correlations with human-based assessments of the same systems. We hypothesize that the selection of systems that would return documents different from the majority could eliminate the ordinary systems from data fusion and provide better discrimination among the documents and systems. This could improve the effectiveness of automatic ranking. Based on this intuition, we introduce a new method for the selection of systems to be used for data fusion. For this purpose, we use the bias concept that measures the deviation of a system from the norm or majority and employ the systems with higher bias in the data fusion process. This approach provides even higher correlations with the human-based results. We demonstrate that our approach outperforms the previously proposed automatic ranking methods.  相似文献   

17.
Searching for relevant material that satisfies the information need of a user, within a large document collection is a critical activity for web search engines. Query Expansion techniques are widely used by search engines for the disambiguation of user’s information need and for improving the information retrieval (IR) performance. Knowledge-based, corpus-based and relevance feedback, are the main QE techniques, that employ different approaches for expanding the user query with synonyms of the search terms (word synonymy) in order to bring more relevant documents and for filtering documents that contain search terms but with a different meaning (also known as word polysemy problem) than the user intended. This work, surveys existing query expansion techniques, highlights their strengths and limitations and introduces a new method that combines the power of knowledge-based or corpus-based techniques with that of relevance feedback. Experimental evaluation on three information retrieval benchmark datasets shows that the application of knowledge or corpus-based query expansion techniques on the results of the relevance feedback step improves the information retrieval performance, with knowledge-based techniques providing significantly better results than their simple relevance feedback alternatives in all sets.  相似文献   

18.
Due to the unknown system structure of the froth flotation process and frequent fluctuations in production conditions, design of control strategy is a challenging problem. As a result, manual operation is still widely applied in practice by observing froth image features. However, since the manual observation is subjective and the production conditions are time-varying, the manual operation cannot make decisions quickly and accurately. In this paper, a data-driven-based adaptive fuzzy neural network control strategy is developed to implement the automatic control of the antimony flotation process. The strategy is composed of fuzzy neural network (FNN) controllers, a data-driven model, and an on-line adaptive algorithm. The FNN is constructed to derive the control laws of the reagent dosages. The parameters of the FNN controllers are tuned by gradient descent algorithm. To obtain the real-time error feedback information, the data-driven model is established, which integrates the long short term memory (LSTM) network and radial basis function neural network (RBFNN). The LSTM network is utilized as a primary model, and the RBFNN is used as an error compensation model. To handle the challenges of the frequent fluctuations in the production conditions, the on-line adaptive algorithm is proposed to tune the parameters of the FNN controllers. Simulations and experiments are carried out in a real-world antimony flotation plant in China. The results demonstrate that the proposed adaptive fuzzy neural network control strategy produces better control performance than the other two existing methods.  相似文献   

19.
The classification of documents from a bibliographic database is a task that is linked to processes of information retrieval based on partial matching. A method is described of vectorizing reference documents from LISA which permits their topological organization using Kohonen's algorithm. As an example a map is generated of 202 documents from LISA, and an analysis is made of the possibilities of this type of neural network with respect to the development of information retrieval systems based on graphical browsing.  相似文献   

20.
This paper reports our experimental investigation into the use of more realistic concepts as opposed to simple keywords for document retrieval, and reinforcement learning for improving document representations to help the retrieval of useful documents for relevant queries. The framework used for achieving this was based on the theory of Formal Concept Analysis (FCA) and Lattice Theory. Features or concepts of each document (and query), formulated according to FCA, are represented in a separate concept lattice and are weighted separately with respect to the individual documents they present. The document retrieval process is viewed as a continuous conversation between queries and documents, during which documents are allowed to learn a set of significant concepts to help their retrieval. The learning strategy used was based on relevance feedback information that makes the similarity of relevant documents stronger and non-relevant documents weaker. Test results obtained on the Cranfield collection show a significant increase in average precisions as the system learns from experience.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号