共查询到20条相似文献,搜索用时 31 毫秒
1.
Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper,
we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph
of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively
propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the
simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures,
our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms
in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine
applications.
相似文献
ChengXiang ZhaiEmail: |
2.
Panagiotis Symeonidis Alexandros Nanopoulos Apostolos N. Papadopoulos Yannis Manolopoulos 《Information Retrieval》2008,11(1):51-75
Collaborative Filtering (CF) Systems have been studied extensively for more than a decade to confront the “information overload”
problem. Nearest-neighbor CF is based either on similarities between users or between items, to form a neighborhood of users
or items, respectively. Recent research has tried to combine the two aforementioned approaches to improve effectiveness. Traditional
clustering approaches (k-means or hierarchical clustering) has been also used to speed up the recommendation process. In this paper, we use biclustering
to disclose this duality between users and items, by grouping them in both dimensions simultaneously. We propose a novel nearest-biclusters
algorithm, which uses a new similarity measure that achieves partial matching of users’ preferences. We apply nearest-biclusters
in combination with two different types of biclustering algorithms—Bimax and xMotif—for constant and coherent biclustering,
respectively. Extensive performance evaluation results in three real-life data sets are provided, which show that the proposed
method improves substantially the performance of the CF process.
相似文献
Yannis ManolopoulosEmail: |
3.
To put an end to the large copyright trade deficit, both Chinese government agencies and publishing houses have been striving
for entering the international publication market. The article analyzes the background of the going-global strategy, and sums
up the performance of both Chinese administrations and publishers.
相似文献
Qing Fang (Corresponding author)Email: |
4.
Jacob Soll 《Archival Science》2007,7(4):331-342
This article examines the archival methods developed by Colbert to train his son in state administration. Based on Colbert’s
correspondence with his son, it reveals the practices Colbert thought necessary to collect and manage information in his state
encyclopedic archive during the last half of the 17th century.
相似文献
Jacob SollEmail: |
5.
Andy Weissberg 《Publishing Research Quarterly》2008,24(4):255-260
This article analyzes current industry practices toward the identification of digital book content. It highlights key technology
trends, workflow considerations and supply chain behaviors, and examines the implications of these trends and behaviors on
the production, discoverability, purchasing and consumption of digital book products.
相似文献
Andy WeissbergEmail: |
6.
There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article,
we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering
are captured by different metric families. These formal constraints are validated in an experiment involving human assessments,
and compared with other constraints proposed in the literature. Our analysis of a wide range of metrics shows that only BCubed satisfies all formal constraints. We also extend the analysis to the problem of overlapping clustering, where items can simultaneously
belong to more than one cluster. As Bcubed cannot be directly applied to this task, we propose a modified version of Bcubed
that avoids the problems found with other metrics.
相似文献
Felisa VerdejoEmail: |
7.
Diego Reforgiato Recupero 《Information Retrieval》2007,10(6):563-579
Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval
results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use
of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality
of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised
method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured
document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step
we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and
speed.
相似文献
Diego Reforgiato RecuperoEmail: |
8.
Sandeep Chaufla 《Publishing Research Quarterly》2008,24(3):187-201
A review and analysis of the rules and regulations including the tax aspects of making an investment in India is presented.
The full range from Foreign Direct Investment to different forms of doing business with specific examples from the publishing
industry is explored to help understand current policies and regulations.
相似文献
Sandeep ChauflaEmail: Email: |
9.
A summary overview of the children’s and young adult publishing industry in China with a focus on the size of the market,
ten major publishing houses, copyright and trends. Special emphasis has been placed on specific transaction for the sale of
translation rights from German language publishers to China and minimal activities of German rights sold to Chinese publishers.
相似文献
Jing BartzEmail: |
10.
11.
A comparison of analyses of the Scottish publishing industry carried out in 1992, 2002 and 2007 underscores the fragility
of the sector within a small country within the English-language community. A number of indices reveal either stability or
stagnation and the picture emerges of the remarkable tenacity of publishing in Scotland. Although there is already a significant
and vital element of state support for publishing in Scotland, further intervention will be necessary to ensure fulfilment
of its potential.
相似文献
Alistair McCleeryEmail: |
12.
13.
Xiaojun Wan 《Information Retrieval》2008,11(1):25-49
In recent years graph-ranking based algorithms have been proposed for single document summarization and generic multi-document
summarization. The algorithms make use of the “votings” or “recommendations” between sentences to evaluate the importance
of the sentences in the documents. This study aims to differentiate the cross-document and within-document relationships between
sentences for generic multi-document summarization and adapt the graph-ranking based algorithm for topic-focused summarization.
The contributions of this study are two-fold: (1) For generic multi-document summarization, we apply the graph-based ranking
algorithm based on each kind of sentence relationship and explore their relative importance for summarization performance.
(2) For topic-focused multi-document summarization, we propose to integrate the relevance of the sentences to the specified
topic into the graph-ranking based method. Each individual kind of sentence relationship is also differentiated and investigated
in the algorithm. Experimental results on DUC 2002–DUC 2005 data demonstrate the great importance of the cross-document relationships
between sentences for both generic and topic-focused multi-document summarizations. Even the approach based only on the cross-document
relationships can perform better than or at least as well as the approaches based on both kinds of relationships between sentences.
相似文献
Xiaojun WanEmail: |
14.
Content-oriented XML retrieval approaches aim at a more focused retrieval strategy: Instead of retrieving whole documents, document components that are exhaustive to the information need while at the same time being as specific as possible should be retrieved. In this article, we show that the evaluation methods developed for standard retrieval must be modified in order to deal with the structure of XML documents. More precisely, the size and overlap of document components must be taken into account. For this purpose, we propose a new effectiveness metric based on the definition of a concept space defined upon the notions of exhaustiveness and specificity of a search result. We compare the results of this new metric by the results obtained with the official metric used in INEX, the evaluation initiative for content-oriented XML retrieval.
相似文献
Gabriella KazaiEmail: |
15.
On rank-based effectiveness measures and optimization 总被引:1,自引:0,他引:1
Many current retrieval models and scoring functions contain free parameters which need to be set—ideally, optimized. The process
of optimization normally involves some training corpus of the usual document-query-relevance judgement type, and some choice
of measure that is to be optimized. The paper proposes a way to think about the process of exploring the space of parameter
values, and how moving around in this space might be expected to affect different measures. One result, concerning local optima,
is demonstrated for a range of rank-based evaluation measures.
相似文献
Hugo ZaragozaEmail: |
16.
Query Expansion is commonly used in Information Retrieval to overcome vocabulary mismatch issues, such as synonymy between
the original query terms and a relevant document. In general, query expansion experiments exhibit mixed results. Overall TREC
Genomics Track results are also mixed; however, results from the top performing systems provide strong evidence supporting
the need for expansion. In this paper, we examine the conditions necessary for optimal query expansion performance with respect
to two system design issues: IR framework and knowledge source used for expansion. We present a query expansion framework
that improves Okapi baseline passage MAP performance by 185%. Using this framework, we compare and contrast the effectiveness
of a variety of biomedical knowledge sources used by TREC 2006 Genomics Track participants for expansion. Based on the outcome
of these experiments, we discuss the success factors required for effective query expansion with respect to various sources
of term expansion, such as corpus-based cooccurrence statistics, pseudo-relevance feedback methods, and domain-specific and
domain-independent ontologies and databases. Our results show that choice of document ranking algorithm is the most important
factor affecting retrieval performance on this dataset. In addition, when an appropriate ranking algorithm is used, we find
that query expansion with domain-specific knowledge sources provides an equally substantive gain in performance over a baseline
system.
相似文献
Nicola StokesEmail: Email: |
17.
Oren Kurland 《Information Retrieval》2009,12(4):437-460
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based
re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using
information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking
of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent
clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing
method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking
approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further
exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering
algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to
using single documents to this end.
相似文献
Oren KurlandEmail: |
18.
We present software that generates phrase-based concordances in real-time based on Internet searching. When a user enters
a string of words for which he wants to find concordances, the system sends this string as a query to a search engine and
obtains search results for the string. The concordances are extracted by performing statistical analysis on search results
and then fed back to the user. Unlike existing tools, this concordance consultation tool is language-independent, so concordances
can be obtained even in a language for which there are no well-established analytical methods. Our evaluation has revealed
that concordances can be obtained more effectively than by only using a search engine directly.
相似文献
Yuichiro IshiiEmail: |
19.
Fernando Diaz 《Information Retrieval》2007,10(6):531-562
We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have
similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective
by adjusting retrieval scores so that topically related documents receive similar scores. We refer to this process as score
regularization. Because score regularization operates on retrieval scores, regardless of their origin, we can apply the technique
to arbitrary initial retrieval rankings. Document rankings derived from regularized scores, when compared to rankings derived
from un-regularized scores, consistently and significantly result in improved performance given a variety of baseline retrieval
algorithms. We also present several proofs demonstrating that regularization generalizes methods such as pseudo-relevance
feedback, document expansion, and cluster-based retrieval. Because of these strong empirical and theoretical results, we argue
for the adoption of score regularization as general design principle or post-processing step for information retrieval systems.
相似文献
Fernando DiazEmail: |
20.
This article analyses the extent to which archival exemptions for historical, scientific and statistical research in privacy
legislation support preservation in selected European Union countries, and comparable aspects of Australian, American and
Canadian law within a legal, ethical and digital archival perspective. The authors recommend that the further processing of
personal data under data protection law be given a wider scope of interpretation for archival preservation purposes in both
the public and private sector, coupled with the use of researcher and archival codes in relation to access to personal data.
They also recommend early appraisal and integration of privacy with freedom of information and archival regimes.
相似文献
Malcolm ToddEmail: |