共查询到20条相似文献,搜索用时 31 毫秒
1.
Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper,
we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph
of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively
propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the
simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures,
our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms
in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine
applications.
相似文献
ChengXiang ZhaiEmail: |
2.
Fernando Diaz 《Information Retrieval》2007,10(6):531-562
We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have
similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective
by adjusting retrieval scores so that topically related documents receive similar scores. We refer to this process as score
regularization. Because score regularization operates on retrieval scores, regardless of their origin, we can apply the technique
to arbitrary initial retrieval rankings. Document rankings derived from regularized scores, when compared to rankings derived
from un-regularized scores, consistently and significantly result in improved performance given a variety of baseline retrieval
algorithms. We also present several proofs demonstrating that regularization generalizes methods such as pseudo-relevance
feedback, document expansion, and cluster-based retrieval. Because of these strong empirical and theoretical results, we argue
for the adoption of score regularization as general design principle or post-processing step for information retrieval systems.
相似文献
Fernando DiazEmail: |
3.
Query structuring and expansion with two-stage term dependence for Japanese web retrieval 总被引:1,自引:1,他引:0
In this paper, we propose a new term dependence model for information retrieval, which is based on a theoretical framework
using Markov random fields. We assume two types of dependencies of terms given in a query: (i) long-range dependencies that
may appear for instance within a passage or a sentence in a target document, and (ii) short-range dependencies that may appear
for instance within a compound word in a target document. Based on this assumption, our two-stage term dependence model captures
both long-range and short-range term dependencies differently, when more than one compound word appear in a query. We also
investigate how query structuring with term dependence can improve the performance of query expansion using a relevance model.
The relevance model is constructed using the retrieval results of the structured query with term dependence to expand the
query. We show that our term dependence model works well, particularly when using query structuring with compound words, through
experiments using a 100-gigabyte test collection of web documents mostly written in Japanese. We also show that the performance
of the relevance model can be significantly improved by using the structured query with our term dependence model.
相似文献
Koji EguchiEmail: |
4.
Xiaojun Wan 《Information Retrieval》2008,11(1):25-49
In recent years graph-ranking based algorithms have been proposed for single document summarization and generic multi-document
summarization. The algorithms make use of the “votings” or “recommendations” between sentences to evaluate the importance
of the sentences in the documents. This study aims to differentiate the cross-document and within-document relationships between
sentences for generic multi-document summarization and adapt the graph-ranking based algorithm for topic-focused summarization.
The contributions of this study are two-fold: (1) For generic multi-document summarization, we apply the graph-based ranking
algorithm based on each kind of sentence relationship and explore their relative importance for summarization performance.
(2) For topic-focused multi-document summarization, we propose to integrate the relevance of the sentences to the specified
topic into the graph-ranking based method. Each individual kind of sentence relationship is also differentiated and investigated
in the algorithm. Experimental results on DUC 2002–DUC 2005 data demonstrate the great importance of the cross-document relationships
between sentences for both generic and topic-focused multi-document summarizations. Even the approach based only on the cross-document
relationships can perform better than or at least as well as the approaches based on both kinds of relationships between sentences.
相似文献
Xiaojun WanEmail: |
5.
Oren Kurland 《Information Retrieval》2009,12(4):437-460
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based
re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using
information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking
of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent
clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing
method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking
approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further
exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering
algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to
using single documents to this end.
相似文献
Oren KurlandEmail: |
6.
On rank-based effectiveness measures and optimization 总被引:1,自引:0,他引:1
Many current retrieval models and scoring functions contain free parameters which need to be set—ideally, optimized. The process
of optimization normally involves some training corpus of the usual document-query-relevance judgement type, and some choice
of measure that is to be optimized. The paper proposes a way to think about the process of exploring the space of parameter
values, and how moving around in this space might be expected to affect different measures. One result, concerning local optima,
is demonstrated for a range of rank-based evaluation measures.
相似文献
Hugo ZaragozaEmail: |
7.
Jennifer S. Milligan 《Archival Science》2007,7(4):359-367
Curious Archives examines the creation of the museum of archives, the Musée de l’Histoire de France, at the Imperial Archives
of France under the direction of Leon de Laborde, 1858–1867. This museum was intended as a crucial tool for publicizing the
Archives and educating the public, but also represented a break from the Archives’ role as administrative storehouse both
in practice and in the popular imagination. The museum’s conception and reception reveal conflicts around the Archives’ mission
and contents, particularly regarding public interest, the potential dangers of public curiosity, and nature of documentary
and historical knowledge in nineteenth-century France.
相似文献
Jennifer S. MilliganEmail: |
8.
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed
to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving
Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different
correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error
rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available,
then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction
with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to
be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction
can minimize the need for morphologically sensitive error correction.
相似文献
Kareem DarwishEmail: |
9.
Query Expansion is commonly used in Information Retrieval to overcome vocabulary mismatch issues, such as synonymy between
the original query terms and a relevant document. In general, query expansion experiments exhibit mixed results. Overall TREC
Genomics Track results are also mixed; however, results from the top performing systems provide strong evidence supporting
the need for expansion. In this paper, we examine the conditions necessary for optimal query expansion performance with respect
to two system design issues: IR framework and knowledge source used for expansion. We present a query expansion framework
that improves Okapi baseline passage MAP performance by 185%. Using this framework, we compare and contrast the effectiveness
of a variety of biomedical knowledge sources used by TREC 2006 Genomics Track participants for expansion. Based on the outcome
of these experiments, we discuss the success factors required for effective query expansion with respect to various sources
of term expansion, such as corpus-based cooccurrence statistics, pseudo-relevance feedback methods, and domain-specific and
domain-independent ontologies and databases. Our results show that choice of document ranking algorithm is the most important
factor affecting retrieval performance on this dataset. In addition, when an appropriate ranking algorithm is used, we find
that query expansion with domain-specific knowledge sources provides an equally substantive gain in performance over a baseline
system.
相似文献
Nicola StokesEmail: Email: |
10.
Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document
set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents
are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. Yet a constant-size pool represents
an increasingly small percentage of the document set as document sets grow larger, and at some point the assumption of approximately
complete judgments must become invalid. This paper shows that the judgment sets produced by traditional pooling when the pools
are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic
title words. This phenomenon is wholly dependent on the collection size and does not depend on the number of relevant documents
for a given topic. We show that the AQUAINT test collection constructed in the recent TREC 2005 workshop exhibits this biased
relevance set; it is likely that the test collections based on the much larger GOV2 document set also exhibit the bias. The
paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable
test collections to be built.
相似文献
Ellen VoorheesEmail: |
11.
Jacob Soll 《Archival Science》2007,7(4):331-342
This article examines the archival methods developed by Colbert to train his son in state administration. Based on Colbert’s
correspondence with his son, it reveals the practices Colbert thought necessary to collect and manage information in his state
encyclopedic archive during the last half of the 17th century.
相似文献
Jacob SollEmail: |
12.
To put an end to the large copyright trade deficit, both Chinese government agencies and publishing houses have been striving
for entering the international publication market. The article analyzes the background of the going-global strategy, and sums
up the performance of both Chinese administrations and publishers.
相似文献
Qing Fang (Corresponding author)Email: |
13.
14.
Norbert Fuhr 《Information Retrieval》2008,11(3):251-265
The classical Probability Ranking Principle (PRP) forms the theoretical basis for probabilistic Information Retrieval (IR)
models, which are dominating IR theory since about 20 years. However, the assumptions underlying the PRP often do not hold,
and its view is too narrow for interactive information retrieval (IIR). In this article, a new theoretical framework for interactive
retrieval is proposed: The basic idea is that during IIR, a user moves between situations. In each situation, the system presents
to the user a list of choices, about which s/he has to decide, and the first positive decision moves the user to a new situation.
Each choice is associated with a number of cost and probability parameters. Based on these parameters, an optimum ordering
of the choices can the derived—the PRP for IIR. The relationship of this rule to the classical PRP is described, and issues
of further research are pointed out.
相似文献
Norbert FuhrEmail: |
15.
Content-oriented XML retrieval approaches aim at a more focused retrieval strategy: Instead of retrieving whole documents, document components that are exhaustive to the information need while at the same time being as specific as possible should be retrieved. In this article, we show that the evaluation methods developed for standard retrieval must be modified in order to deal with the structure of XML documents. More precisely, the size and overlap of document components must be taken into account. For this purpose, we propose a new effectiveness metric based on the definition of a concept space defined upon the notions of exhaustiveness and specificity of a search result. We compare the results of this new metric by the results obtained with the official metric used in INEX, the evaluation initiative for content-oriented XML retrieval.
相似文献
Gabriella KazaiEmail: |
16.
Result merging methods in distributed information retrieval with overlapping databases 总被引:5,自引:0,他引:5
In distributed information retrieval systems, document overlaps occur frequently among different component databases. This
paper presents an experimental investigation and evaluation of a group of result merging methods including the shadow document
method and the multi-evidence method in the environment of overlapping databases. We assume, with the exception of resultant
document lists (either with rankings or scores), no extra information about retrieval servers and text databases is available,
which is the usual case for many applications on the Internet and the Web.
The experimental results show that the shadow document method and the multi-evidence method are the two best methods when
overlap is high, while Round-robin is the best for low overlap. The experiments also show that [0,1] linear normalization
is a better option than linear regression normalization for result merging in a heterogeneous environment.
相似文献
Sally McCleanEmail: |
17.
Due to the heavy use of gene synonyms in biomedical text, people have tried many query expansion techniques using synonyms
in order to improve performance in biomedical information retrieval. However, mixed results have been reported. The main challenge
is that it is not trivial to assign appropriate weights to the added gene synonyms in the expanded query; under-weighting
of synonyms would not bring much benefit, while overweighting some unreliable synonyms can hurt performance significantly.
So far, there has been no systematic evaluation of various synonym query expansion strategies for biomedical text. In this
work, we propose two different strategies to extend a standard language modeling approach for gene synonym query expansion
and conduct a systematic evaluation of these methods on all the available TREC biomedical text collections for ad hoc document
retrieval. Our experiment results show that synonym expansion can significantly improve the retrieval accuracy. However, different
query types require different synonym expansion methods, and appropriate weighting of gene names and synonym terms is critical
for improving performance.
相似文献
Chengxiang ZhaiEmail: |
18.
Fotis Lazarinis Jesús Vilares John Tait Efthimis N. Efthimiadis 《Information Retrieval》2009,12(3):230-250
With increasingly higher numbers of non-English language web searchers the problems of efficient handling of non-English Web
documents and user queries are becoming major issues for search engines. The main aim of this review paper is to make researchers
aware of the existing problems in monolingual non-English Web retrieval by providing an overview of open issues. A significant
number of papers are reviewed and the research issues investigated in these studies are categorized in order to identify the
research questions and solutions proposed in these papers. Further research is proposed at the end of each section.
相似文献
Efthimis N. EfthimiadisEmail: |
19.
Modeling context through domain ontologies 总被引:1,自引:0,他引:1
Nathalie Hernandez Josiane Mothe Claude Chrisment Daniel Egret 《Information Retrieval》2007,10(2):143-172
Traditional information retrieval systems aim at satisfying most users for most of their searches, leaving aside the context
in which the search takes place. We propose to model two main aspects of context: The themes of the user's information need
and the specific data the user is looking for to achieve the task that has motivated his search. Both aspects are modeled
by means of ontologies. Documents are semantically indexed according to the context representation and the user accesses information
by browsing the ontologies. The model has been applied to a case study that has shown the added value of such a semantic representation
of context.
相似文献
Daniel EgretEmail: |
20.
Through a reading of the archived letters of Henry Garnet (1555–1606), Superior of the Jesuit order in England and suspected
Gunpowder plotter, this article investigates the nature of the archive in relation to narrative theory. Figuring the archive
as one of the number of narrating voices accrued by the individual record, I argue that models of communication such as those
put forward by Roman Jakobson, Wayne C. Booth and Seymour Chatman afford useful insights into the ways in which power is inscribed
and reinscribed in the record through successive acts of reading and rewriting.
Paul Wake is a Senior Lecturer in English Literature at Manchester Metropolitan University. He is the author of Conrad’s Marlow (2007), editor, with Simon Malpas, of The Routledge Companion to Critical Theory (2006), and he has published articles on narrative theory and postmodernism. 相似文献
Paul WakeEmail: |
Paul Wake is a Senior Lecturer in English Literature at Manchester Metropolitan University. He is the author of Conrad’s Marlow (2007), editor, with Simon Malpas, of The Routledge Companion to Critical Theory (2006), and he has published articles on narrative theory and postmodernism. 相似文献