共查询到20条相似文献,搜索用时 765 毫秒
1.
Query languages for XML such as XPath or XQuery support Boolean retrieval: a query result is a (possibly restructured) subset of XML elements or entire documents that satisfy the search conditions of the query. This search paradigm works for highly schematic XML data collections such as electronic catalogs. However, for searching information in open environments such as the Web or intranets of large corporations, ranked retrieval is more appropriate: a query result is a ranked list of XML elements in descending order of (estimated) relevance. Web search engines, which are based on the ranked retrieval paradigm, do, however, not consider the additional information and rich annotations provided by the structure of XML documents and their element names.This article presents the XXL search engine that supports relevance ranking on XML data. XXL is particularly geared for path queries with wildcards that can span multiple XML collections and contain both exact-match as well as semantic-similarity search conditions. In addition, ontological information and suitable index structures are used to improve the search efficiency and effectiveness. XXL is fully implemented as a suite of Java classes and servlets. Experiments in the context of the INEX benchmark demonstrate the efficiency of the XXL search engine and underline its effectiveness for ranked retrieval. 相似文献
2.
Jovan Pehcevski James A. Thom Anne-Marie Vercoustre Vladimir Naumovski 《Information Retrieval》2010,13(5):568-600
Entity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity
extraction where the goal is to tag names of entities in documents, entity ranking is primarily focused on returning a ranked
list of relevant entity names for the query. Many approaches to entity ranking have been proposed, and most of them were evaluated
on the INEX Wikipedia test collection. In this paper, we describe a system we developed for ranking Wikipedia entities in
answer to a query. The entity ranking approach implemented in our system utilises the known categories, the link structure
of Wikipedia, as well as the link co-occurrences with the entity examples (when provided) to retrieve relevant entities as
answers to the query. We also extend our entity ranking approach by utilising the knowledge of predicted classes of topic
difficulty. To predict the topic difficulty, we generate a classifier that uses features extracted from an INEX topic definition
to classify the topic into an experimentally pre-determined class. This knowledge is then utilised to dynamically set the
optimal values for the retrieval parameters of our entity ranking system. Our experiments demonstrate that the use of categories
and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction
is a promising approach that could also be exploited to further improve the entity ranking performance. 相似文献
3.
Precision prediction based on ranked list coherence 总被引:1,自引:0,他引:1
We introduce a statistical measure of the coherence of a list of documents called the clarity score. Starting with a document list ranked by the query-likelihood retrieval model, we demonstrate the score's relationship to query ambiguity with respect to the collection. We also show that the clarity score is correlated with the average precision of a query and lay the groundwork for useful predictions by discussing a method of setting decision thresholds automatically. We then show that passage-based clarity scores correlate with average-precision measures of ranked lists of passages, where a passage is judged relevant if it contains correct answer text, which extends the basic method to passage-based systems. Next, we introduce variants of document-based clarity scores to improve the robustness, applicability, and predictive ability of clarity scores. In particular, we introduce the ranked list clarity score that can be computed with only a ranked list of documents, and the weighted clarity score where query terms contribute more than other terms. Finally, we show an approach to predicting queries that perform poorly on query expansion that uses techniques expanding on the ideas presented earlier.
相似文献
W. Bruce CroftEmail: |
4.
Jacques Savoy 《Information Retrieval》2007,10(6):509-529
This paper reports on the underlying IR problems encountered when indexing and searching with the Bulgarian language. For
this language we propose a general light stemmer and demonstrate that it can be quite effective, producing significantly better
MAP (around + 34%) than an approach not applying stemming. We implement the GL2 model derived from the Divergence from Randomness paradigm and find its retrieval effectiveness better than other probabilistic, vector-space and language models. The resulting
MAP is found to be about 50% better than the classical tf idf approach. Moreover, increasing the query size enhances the MAP by around 10% (from T to TD). In order to compare the retrieval
effectiveness of our suggested stopword list and the light stemmer developed for the Bulgarian language, we conduct a set
of experiments on another stopword list and also a more complex and aggressive stemmer. Results tend to indicate that there
is no statistically significant difference between these variants and our suggested approach. This paper evaluates other indexing
strategies such as 4-gram indexing and indexing based on the automatic decompounding of compound words. Finally, we analyze
certain queries to discover why we obtained poor results, when indexing Bulgarian documents using the suggested word-based
approach. 相似文献
5.
In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model
“fits” the user’s information need (query model). On the other hand in statistics, goodness of fit tests are well established
techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are
randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query
terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness
of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given
the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate
the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to
these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and
5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical
OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe
that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative
retrieval formulas within goodness-of-fit statistical tests’ framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution
in terms. 相似文献
6.
An information retrieval (IR) system can often fail to retrieve relevant documents due to the incomplete specification of information need in the user’s query. Pseudo-relevance feedback (PRF) aims to improve IR effectiveness by exploiting potentially relevant aspects of the information need present in the documents retrieved in an initial search. Standard PRF approaches utilize the information contained in these top ranked documents from the initial search with the assumption that documents as a whole are relevant to the information need. However, in practice, documents are often multi-topical where only a portion of the documents may be relevant to the query. In this situation, exploitation of the topical composition of the top ranked documents, estimated with statistical topic modeling based approaches, can potentially be a useful cue to improve PRF effectiveness. The key idea behind our PRF method is to use the term-topic and the document-topic distributions obtained from topic modeling over the set of top ranked documents to re-rank the initially retrieved documents. The objective is to improve the ranks of documents that are primarily composed of the relevant topics expressed in the information need of the query. Our RF model can further be improved by making use of non-parametric topic modeling, where the number of topics can grow according to the document contents, thus giving the RF model the capability to adjust the number of topics based on the content of the top ranked documents. We empirically validate our topic model based RF approach on two document collections of diverse length and topical composition characteristics: (1) ad-hoc retrieval using the TREC 6-8 and the TREC Robust ’04 dataset, and (2) tweet retrieval using the TREC Microblog ’11 dataset. Results indicate that our proposed approach increases MAP by up to 9% in comparison to the results obtained with an LDA based language model (for initial retrieval) coupled with the relevance model (for feedback). Moreover, the non-parametric version of our proposed approach is shown to be more effective than its parametric counterpart due to its advantage of adapting the number of topics, improving results by up to 5.6% of MAP compared to the parametric version. 相似文献
7.
Multilingual information retrieval is generally understood to mean the retrieval of relevant information in multiple target
languages in response to a user query in a single source language. In a multilingual federated search environment, different
information sources contain documents in different languages. A general search strategy in multilingual federated search environments
is to translate the user query to each language of the information sources and run a monolingual search in each information
source. It is then necessary to obtain a single ranked document list by merging the individual ranked lists from the information
sources that are in different languages. This is known as the results merging problem for multilingual information retrieval.
Previous research has shown that the simple approach of normalizing source-specific document scores is not effective. On the
other side, a more effective merging method was proposed to download and translate all retrieved documents into the source
language and generate the final ranked list by running a monolingual search in the search client. The latter method is more
effective but is associated with a large amount of online communication and computation costs. This paper proposes an effective
and efficient approach for the results merging task of multilingual ranked lists. Particularly, it downloads only a small
number of documents from the individual ranked lists of each user query to calculate comparable document scores by utilizing
both the query-based translation method and the document-based translation method. Then, query-specific and source-specific
transformation models can be trained for individual ranked lists by using the information of these downloaded documents. These
transformation models are used to estimate comparable document scores for all retrieved documents and thus the documents can
be sorted into a final ranked list. This merging approach is efficient as only a subset of the retrieved documents are downloaded
and translated online. Furthermore, an extensive set of experiments on the Cross-Language Evaluation Forum (CLEF) () data has demonstrated the effectiveness of the query-specific and source-specific results merging algorithm against other
alternatives. The new research in this paper proposes different variants of the query-specific and source-specific results
merging algorithm with different transformation models. This paper also provides thorough experimental results as well as
detailed analysis. All of the work substantially extends the preliminary research in (Si and Callan, in: Peters (ed.) Results
of the cross-language evaluation forum-CLEF 2005, 2005).
相似文献
Hao YuanEmail: |
8.
To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short)
parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of
these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model
estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard
passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the
document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval. 相似文献
9.
Large-scale retrieval systems are often implemented as a cascading sequence of phases—a first filtering step, in which a large set of candidate documents are extracted using a simple technique such as Boolean matching and/or static document scores; and then one or more ranking steps, in which the pool of documents retrieved by the filter is scored more precisely using dozens or perhaps hundreds of different features. The documents returned to the user are then taken from the head of the final ranked list. Here we examine methods for measuring the quality of filtering and preliminary ranking stages, and show how to use these measurements to tune the overall performance of the system. Standard top-weighted metrics used for overall system evaluation are not appropriate for assessing filtering stages, since the output is a set of documents, rather than an ordered sequence of documents. Instead, we use an approach in which a quality score is computed based on the discrepancy between filtered and full evaluation. Unlike previous approaches, our methods do not require relevance judgments, and thus can be used with virtually any query set. We show that this quality score directly correlates with actual differences in measured effectiveness when relevance judgments are available. Since the quality score does not require relevance judgments, it can be used to identify queries that perform particularly poorly for a given filter. Using these methods, we explore a wide range of filtering options using thousands of queries, categorize the relative merits of the different approaches, and identify useful parameter combinations. 相似文献
10.
James C. French Allison L. Powell Fredric Gey Natalia Perelman 《Information Retrieval》2002,5(4):323-351
Vocabulary incompatibilities arise when the terms used to index a document collection are largely unknown, or at least not well-known to the users who eventually search the collection. No matter how comprehensive or well-structured the indexing vocabulary, it is of little use if it is not used effectively in query formulation. This paper demonstrates that techniques for mapping user queries into the controlled indexing vocabulary have the potential to radically improve document retrieval performance. We also show how the use of controlled indexing vocabulary can be employed to achieve performance gains for collection selection. Finally, we demonstrate the potential benefit of combining these two techniques in an interactive retrieval environment. Given a user query, our evaluation approach simulates the human user's choice of terms for query augmentation given a list of controlled vocabulary terms suggested by a system. This strategy lets us evaluate interactive strategies without the need for human subjects. 相似文献
11.
Fernando Martínez-Santiago L. Alfonso Ureña-López Maite Martín-Valdivia 《Information Retrieval》2006,9(1):71-93
A usual strategy to implement CLIR (Cross-Language Information Retrieval) systems is the so-called query translation approach.
The user query is translated for each language present in the multilingual collection in order to compute an independent monolingual
information retrieval process per language. Thus, this approach divides documents according to language. In this way, we obtain
as many different collections as languages. After searching in these corpora and obtaining a result list per language, we
must merge them in order to provide a single list of retrieved articles.
In this paper, we propose an approach to obtain a single list of relevant documents for CLIR systems driven by query translation.
This approach, which we call 2-step RSV (RSV: Retrieval Status Value), is based on the re-indexing of the retrieval documents
according to the query vocabulary, and it performs noticeably better than traditional methods.
The proposed method requires query vocabulary alignment: given a word for a given query, we must know the translation or translations
to the other languages. Because this is not always possible, we have researched on a mixed model. This mixed model is applied
in order to deal with queries with partial word-level alignment. The results prove that even in this scenario, 2-step RSV
performs better than traditional merging methods. 相似文献
12.
Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents. 相似文献
13.
Oren Kurland 《Information Retrieval》2009,12(4):437-460
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based
re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using
information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking
of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent
clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing
method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking
approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further
exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering
algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to
using single documents to this end.
相似文献
Oren KurlandEmail: |
14.
Patent prior art search is a type of search in the patent domain where documents are searched for that describe the work previously carried out related to a patent application. The goal of this search is to check whether the idea in the patent application is novel. Vocabulary mismatch is one of the main problems of patent retrieval which results in low retrievability of similar documents for a given patent application. In this paper we show how the term distribution of the cited documents in an initially retrieved ranked list can be used to address the vocabulary mismatch. We propose a method for query modeling estimation which utilizes the citation links in a pseudo relevance feedback set. We first build a topic dependent citation graph, starting from the initially retrieved set of feedback documents and utilizing citation links of feedback documents to expand the set. We identify the important documents in the topic dependent citation graph using a citation analysis measure. We then use the term distribution of the documents in the citation graph to estimate a query model by identifying the distinguishing terms and their respective weights. We then use these terms to expand our original query. We use CLEF-IP 2011 collection to evaluate the effectiveness of our query modeling approach for prior art search. We also study the influence of different parameters on the performance of the proposed method. The experimental results demonstrate that the proposed approach significantly improves the recall over a state-of-the-art baseline which uses the link-based structure of the citation graph but not the term distribution of the cited documents. 相似文献
15.
A solid research path towards new information retrieval models is to further develop the theory behind existing models. A profound understanding of these models is therefore essential. In this paper, we revisit probability ranking principle (PRP)-based models, probability of relevance (PR) models, and language models, finding conceptual differences in their definition and interrelationships. The probabilistic model of the PRP has not been explicitly defined previously, but doing so leads to the formulation of two actual principles with different objectives. First, the belief probability ranking principle (BPRP), which considers uncertain relevance between known documents and the current query, and second, the popularity probability ranking principle (PPRP), which considers the probability of relevance of documents among multiple queries with the same features. Our analysis shows how some of the discussed PR models implement the BPRP or the PPRP while others do not. However, for some models the parameter estimation is challenging. Finally, language models are often presented as related to PR models. However, we find that language models differ from PR models in every aspect of a probabilistic model and the effectiveness of language models cannot be explained by the PRP. 相似文献
16.
Travis Gagie Aleksi Hartikainen Kalle Karhu Juha Kärkkäinen Gonzalo Navarro Simon J. Puglisi Jouni Sirén 《Information Retrieval》2017,20(3):253-291
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case. 相似文献
17.
隐含语义检索及其应用 总被引:5,自引:1,他引:4
隐含语义检索(Latent Semantic Indexing, LSI) 是一种基于概念的文献检索方式。它区别于传统的基于用户查询条件与文档的单词匹配的文献检索方法, 根据文档与查询条件在语义上的关联而向用户提交查询结果。本文介绍了隐含语义检索在文献检索中的一种实现方法, 为文献检索提供了一种新的途径。 相似文献
18.
二元语义信息检索模型* 总被引:1,自引:0,他引:1
提出一个基于二元语义的信息检索模型。该模型包含文档的表示、查询语句的表示、文档和查询的匹配3个部分。相对于传统的基于查询关键词精确匹配的信息检索模型,该模型能较好地满足用户查询要求中的灵活性。 相似文献
19.
一种改进的余弦向量度量法文本检索模型 总被引:2,自引:1,他引:1