期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An effective and efficient results merging strategy for multilingual information retrieval in federated search environments

Luo Si Jamie Callan Suleyman Cetintas Hao Yuan 《Information Retrieval》2008,11(1):1-24

Multilingual information retrieval is generally understood to mean the retrieval of relevant information in multiple target languages in response to a user query in a single source language. In a multilingual federated search environment, different information sources contain documents in different languages. A general search strategy in multilingual federated search environments is to translate the user query to each language of the information sources and run a monolingual search in each information source. It is then necessary to obtain a single ranked document list by merging the individual ranked lists from the information sources that are in different languages. This is known as the results merging problem for multilingual information retrieval. Previous research has shown that the simple approach of normalizing source-specific document scores is not effective. On the other side, a more effective merging method was proposed to download and translate all retrieved documents into the source language and generate the final ranked list by running a monolingual search in the search client. The latter method is more effective but is associated with a large amount of online communication and computation costs. This paper proposes an effective and efficient approach for the results merging task of multilingual ranked lists. Particularly, it downloads only a small number of documents from the individual ranked lists of each user query to calculate comparable document scores by utilizing both the query-based translation method and the document-based translation method. Then, query-specific and source-specific transformation models can be trained for individual ranked lists by using the information of these downloaded documents. These transformation models are used to estimate comparable document scores for all retrieved documents and thus the documents can be sorted into a final ranked list. This merging approach is efficient as only a subset of the retrieved documents are downloaded and translated online. Furthermore, an extensive set of experiments on the Cross-Language Evaluation Forum (CLEF) () data has demonstrated the effectiveness of the query-specific and source-specific results merging algorithm against other alternatives. The new research in this paper proposes different variants of the query-specific and source-specific results merging algorithm with different transformation models. This paper also provides thorough experimental results as well as detailed analysis. All of the work substantially extends the preliminary research in (Si and Callan, in: Peters (ed.) Results of the cross-language evaluation forum-CLEF 2005, 2005).

Hao YuanEmail:

相似文献

2.

Re-ranking search results using language models of query-specific clusters

Oren Kurland 《Information Retrieval》2009,12(4):437-460

To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to using single documents to this end.

Oren KurlandEmail:

相似文献

3.

Regularizing query-based retrieval scores

Fernando Diaz 《Information Retrieval》2007,10(6):531-562

We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective by adjusting retrieval scores so that topically related documents receive similar scores. We refer to this process as score regularization. Because score regularization operates on retrieval scores, regardless of their origin, we can apply the technique to arbitrary initial retrieval rankings. Document rankings derived from regularized scores, when compared to rankings derived from un-regularized scores, consistently and significantly result in improved performance given a variety of baseline retrieval algorithms. We also present several proofs demonstrating that regularization generalizes methods such as pseudo-relevance feedback, document expansion, and cluster-based retrieval. Because of these strong empirical and theoretical results, we argue for the adoption of score regularization as general design principle or post-processing step for information retrieval systems.

Fernando DiazEmail:

相似文献

4.

Smoothing document language models with probabilistic term count propagation

Azadeh Shakery ChengXiang Zhai 《Information Retrieval》2008,11(2):139-164

Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper, we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures, our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine applications.

ChengXiang ZhaiEmail:

相似文献

5.

Query structuring and expansion with two-stage term dependence for Japanese web retrieval 总被引：1，自引：1，他引：0

Koji Eguchi W. Bruce Croft 《Information Retrieval》2009,12(3):251-274

In this paper, we propose a new term dependence model for information retrieval, which is based on a theoretical framework using Markov random fields. We assume two types of dependencies of terms given in a query: (i) long-range dependencies that may appear for instance within a passage or a sentence in a target document, and (ii) short-range dependencies that may appear for instance within a compound word in a target document. Based on this assumption, our two-stage term dependence model captures both long-range and short-range term dependencies differently, when more than one compound word appear in a query. We also investigate how query structuring with term dependence can improve the performance of query expansion using a relevance model. The relevance model is constructed using the retrieval results of the structured query with term dependence to expand the query. We show that our term dependence model works well, particularly when using query structuring with compound words, through experiments using a 100-gigabyte test collection of web documents mostly written in Japanese. We also show that the performance of the relevance model can be significantly improved by using the structured query with our term dependence model.

Koji EguchiEmail:

相似文献

6.

An empirical study of gene synonym query expansion in biomedical information retrieval

Yue Lu Hui Fang Chengxiang Zhai 《Information Retrieval》2009,12(1):51-68

Due to the heavy use of gene synonyms in biomedical text, people have tried many query expansion techniques using synonyms in order to improve performance in biomedical information retrieval. However, mixed results have been reported. The main challenge is that it is not trivial to assign appropriate weights to the added gene synonyms in the expanded query; under-weighting of synonyms would not bring much benefit, while overweighting some unreliable synonyms can hurt performance significantly. So far, there has been no systematic evaluation of various synonym query expansion strategies for biomedical text. In this work, we propose two different strategies to extend a standard language modeling approach for gene synonym query expansion and conduct a systematic evaluation of these methods on all the available TREC biomedical text collections for ad hoc document retrieval. Our experiment results show that synonym expansion can significantly improve the retrieval accuracy. However, different query types require different synonym expansion methods, and appropriate weighting of gene names and synonym terms is critical for improving performance.

Chengxiang ZhaiEmail:

相似文献

7.

A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

Diego Reforgiato Recupero 《Information Retrieval》2007,10(6):563-579

Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.

Diego Reforgiato RecuperoEmail:

相似文献

8.

Deciphering the diplomatic archives of fifteenth-century Italy 总被引：1，自引：1，他引：0

Paul Marcus Dover 《Archival Science》2007,7(4):297-316

This article examines the repercussions of the explosion of paper documents generated by new developments in diplomatic practice in Italian city-states between 1450 and 1500. With the proliferation of resident ambassadors whose daily duties centered around writing and receiving letters and other documents, a flood of written material was produced. The management and archiving of all this material triggered the formation of new institutions, of new methods of working, and of new personnel. Though the results of the efforts at archiving were often fitful and incomplete, the governments of the Italian peninsula henceforth sought to collect, control and preserve diplomatic documents so that they could be referenced in the future.

Paul Marcus DoverEmail:

Paul M. Dover is Assistant Professor of History at Kennesaw State University in Kennesaw, Georgia. He has published several articles on the political and intellectual history of Renaissance Italy. He is currently writing a book on ambassadors and the culture of diplomacy in fifteenth-century Italy. He holds a PhD from Yale University. 相似文献

9.

Learning-based summarisation of XML documents

Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas 《Information Retrieval》2007,10(3):233-255

Documents formatted in eXtensible Markup Language (XML) are available in collections of various document types. In this paper, we present an approach for the summarisation of XML documents. The novelty of this approach lies in that it is based on features not only from the content of documents, but also from their logical structure. We follow a machine learning, sentence extraction-based summarisation technique. To find which features are more effective for producing summaries, this approach views sentence extraction as an ordering task. We evaluated our summarisation model using the INEX and SUMMAC datasets. The results demonstrate that the inclusion of features from the logical structure of documents increases the effectiveness of the summariser, and that the learnable system is also effective and well-suited to the task of summarisation in the context of XML documents. Our approach is generic, and is therefore applicable, apart from entire documents, to elements of varying granularity within the XML tree. We view these results as a step towards the intelligent summarisation of XML documents.

Mounia LalmasEmail:

相似文献

10.

Evaluating the effectiveness of content-oriented XML retrieval methods 总被引：1，自引：0，他引：1

Norbert Gövert Norbert Fuhr Mounia Lalmas Gabriella Kazai 《Information Retrieval》2006,9(6):699-722

Content-oriented XML retrieval approaches aim at a more focused retrieval strategy: Instead of retrieving whole documents, document components that are exhaustive to the information need while at the same time being as specific as possible should be retrieved. In this article, we show that the evaluation methods developed for standard retrieval must be modified in order to deal with the structure of XML documents. More precisely, the size and overlap of document components must be taken into account. For this purpose, we propose a new effectiveness metric based on the definition of a concept space defined upon the notions of exhaustiveness and specificity of a search result. We compare the results of this new metric by the results obtained with the official metric used in INEX, the evaluation initiative for content-oriented XML retrieval.

Gabriella KazaiEmail:

相似文献

11.

The Identification of Digital Book Content

Andy Weissberg 《Publishing Research Quarterly》2008,24(4):255-260

This article analyzes current industry practices toward the identification of digital book content. It highlights key technology trends, workflow considerations and supply chain behaviors, and examines the implications of these trends and behaviors on the production, discoverability, purchasing and consumption of digital book products.

Andy WeissbergEmail:

相似文献

12.

Output-sensitive autocompletion search

Holger Bast Christian W. Mortensen Ingmar Weber 《Information Retrieval》2008,11(4):269-286

We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every keystroke display those completions of the last query word that would lead to the best hits, and also display the best such hits. The following problem is at the core of this feature: for a fixed document collection, given a set D of documents, and an alphabetical range W of words, compute the set of all word-in-document pairs (w, d) from the collection such that w ∈ W and d ∈ D. We present a new data structure with the help of which such autocompletion queries can be processed, on the average, in time linear in the input plus output size, independent of the size of the underlying document collection. At the same time, our data structure uses no more space than an inverted index. Actual query processing times on a large test collection correlate almost perfectly with our theoretical bound.

Ingmar WeberEmail:

相似文献

13.

Exploring criteria for successful query expansion in the genomic domain 总被引：1，自引：0，他引：1

Nicola Stokes Yi Li Lawrence Cavedon Justin Zobel 《Information Retrieval》2009,12(1):17-50

Query Expansion is commonly used in Information Retrieval to overcome vocabulary mismatch issues, such as synonymy between the original query terms and a relevant document. In general, query expansion experiments exhibit mixed results. Overall TREC Genomics Track results are also mixed; however, results from the top performing systems provide strong evidence supporting the need for expansion. In this paper, we examine the conditions necessary for optimal query expansion performance with respect to two system design issues: IR framework and knowledge source used for expansion. We present a query expansion framework that improves Okapi baseline passage MAP performance by 185%. Using this framework, we compare and contrast the effectiveness of a variety of biomedical knowledge sources used by TREC 2006 Genomics Track participants for expansion. Based on the outcome of these experiments, we discuss the success factors required for effective query expansion with respect to various sources of term expansion, such as corpus-based cooccurrence statistics, pseudo-relevance feedback methods, and domain-specific and domain-independent ontologies and databases. Our results show that choice of document ranking algorithm is the most important factor affecting retrieval performance on this dataset. In addition, when an appropriate ranking algorithm is used, we find that query expansion with domain-specific knowledge sources provides an equally substantive gain in performance over a baseline system.

Nicola StokesEmail: Email:

相似文献

14.

Multilingual phrase-based concordance generation in real-time

Kumiko Tanaka-Ishii Yuichiro Ishii 《Information Retrieval》2007,10(3):275-295

We present software that generates phrase-based concordances in real-time based on Internet searching. When a user enters a string of words for which he wants to find concordances, the system sends this string as a query to a search engine and obtains search results for the string. The concordances are extracted by performing statistical analysis on search results and then fed back to the user. Unlike existing tools, this concordance consultation tool is language-independent, so concordances can be obtained even in a language for which there are no well-established analytical methods. Our evaluation has revealed that concordances can be obtained more effectively than by only using a search engine directly.

Yuichiro IshiiEmail:

相似文献

15.

Knowledge-based query expansion to support scenario-specific retrieval of medical free text

Zhenyu Liu Wesley W. Chu 《Information Retrieval》2007,10(2):173-202

In retrieving medical free text, users are often interested in answers pertinent to certain scenarios that correspond to common tasks performed in medical practice, e.g., treatment or diagnosis of a disease. A major challenge in handling such queries is that scenario terms in the query (e.g., treatment) are often too general to match specialized terms in relevant documents (e.g., chemotherapy). In this paper, we propose a knowledge-based query expansion method that exploits the UMLS knowledge source to append the original query with additional terms that are specifically relevant to the query's scenario(s). We compared the proposed method with traditional statistical expansion that expands terms which are statistically correlated but not necessarily scenario specific. Our study on two standard testbeds shows that the knowledge-based method, by providing scenario-specific expansion, yields notable improvements over the statistical method in terms of average precision-recall. On the OHSUMED testbed, for example, the improvement is more than 5% averaging over all scenario-specific queries studied and about 10% for queries that mention certain scenarios, such as treatment of a disease and differential diagnosis of a symptom/disease.

Wesley W. ChuEmail:

相似文献

16.

Digital signatures and electronic records

Filip Boudrez 《Archival Science》2007,7(2):179-193

This paper gives an overview of the archival issues that relate to digitally signed documents. First, by way of introduction, the advanced digital signature is presented briefly. In the second part, a number of problems are discussed that present themselves when a digital signature is used as a proof of authenticity and integrity for digital documents in general. In particular, it is also being investigated whether it makes any sense for the archivist to digitally sign all electronic records under his or her management. Problems relating to the (medium) long-term archiving of digitally signed documents are dealt with in the third part. After an overview of the sticking points for long-term validation (“Archival issues”) a number of possible solutions are discussed (“Solutions for long-term archiving”).

Filip BoudrezEmail:

相似文献

17.

Current research issues and trends in non-English Web searching

Fotis Lazarinis Jesús Vilares John Tait Efthimis N. Efthimiadis 《Information Retrieval》2009,12(3):230-250

With increasingly higher numbers of non-English language web searchers the problems of efficient handling of non-English Web documents and user queries are becoming major issues for search engines. The main aim of this review paper is to make researchers aware of the existing problems in monolingual non-English Web retrieval by providing an overview of open issues. A significant number of papers are reviewed and the research issues investigated in these studies are categorized in order to identify the research questions and solutions proposed in these papers. Further research is proposed at the end of each section.

Efthimis N. EfthimiadisEmail:

相似文献

18.

Chinese Publishing Industry Going Global: Background and Performance

Lifang Xu Qing Fang 《Publishing Research Quarterly》2008,24(1):64-72

To put an end to the large copyright trade deficit, both Chinese government agencies and publishing houses have been striving for entering the international publication market. The article analyzes the background of the going-global strategy, and sums up the performance of both Chinese administrations and publishers.

Qing Fang (Corresponding author)Email:

相似文献

19.

A world-class archival achievement: the People’s Republic of China archivists’ success in opening the Ming-Qing central-government archives, 1949–1998

Beatrice S. Bartlett 《Archival Science》2007,7(4):369-390

This article describes the first half century of the Communist government’s supervision and management of the central-government archives of the last two dynasties. Immediately with the Communist ascent to power in 1949, the new government took great interest in assembling and protecting the country’s archival documents, readying the Ming-Qing archives for access to scholars, and preparing for publication of selected materials. By the 1980s Beijing’s Number One Historical Archives, in charge of the largest holding of Ming-Qing documents, had become the first Chinese authority to complete a full sorting and preliminary catalogues for such a collection. Moreover, to facilitate searches, an attempt has recently begun to create a subject-heading system for these and other holdings in the country. In the first half century’s final decades, foreign researchers were admitted for the first time and tours and international exchanges began to take place.

Beatrice S. BartlettEmail:

相似文献

20.

An analysis on document length retrieval trends in language modeling smoothing

David E. Losada Leif Azzopardi 《Information Retrieval》2008,11(2):109-138

Document length is widely recognized as an important factor for adjusting retrieval systems. Many models tend to favor the retrieval of either short or long documents and, thus, a length-based correction needs to be applied for avoiding any length bias. In Language Modeling for Information Retrieval, smoothing methods are applied to move probability mass from document terms to unseen words, which is often dependant upon document length. In this article, we perform an in-depth study of this behavior, characterized by the document length retrieval trends, of three popular smoothing methods across a number of factors, and its impact on the length of documents retrieved and retrieval performance. First, we theoretically analyze the Jelinek–Mercer, Dirichlet prior and two-stage smoothing strategies and, then, conduct an empirical analysis. In our analysis we show how Dirichlet prior smoothing caters for document length more appropriately than Jelinek–Mercer smoothing which leads to its superior retrieval performance. In a follow up analysis, we posit that length-based priors can be used to offset any bias in the length retrieval trends stemming from the retrieval formula derived by the smoothing technique. We show that the performance of Jelinek–Mercer smoothing can be significantly improved by using such a prior, which provides a natural and simple alternative to decouple the query and document modeling roles of smoothing. With the analysis of retrieval behavior conducted in this article, it is possible to understand why the Dirichlet Prior smoothing performs better than the Jelinek–Mercer, and why the performance of the Jelinek–Mercer method is improved by including a length-based prior.

Leif AzzopardiEmail:

相似文献