首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper, we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures, our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine applications.
ChengXiang ZhaiEmail:
  相似文献   

2.
Collaborative Filtering (CF) Systems have been studied extensively for more than a decade to confront the “information overload” problem. Nearest-neighbor CF is based either on similarities between users or between items, to form a neighborhood of users or items, respectively. Recent research has tried to combine the two aforementioned approaches to improve effectiveness. Traditional clustering approaches (k-means or hierarchical clustering) has been also used to speed up the recommendation process. In this paper, we use biclustering to disclose this duality between users and items, by grouping them in both dimensions simultaneously. We propose a novel nearest-biclusters algorithm, which uses a new similarity measure that achieves partial matching of users’ preferences. We apply nearest-biclusters in combination with two different types of biclustering algorithms—Bimax and xMotif—for constant and coherent biclustering, respectively. Extensive performance evaluation results in three real-life data sets are provided, which show that the proposed method improves substantially the performance of the CF process.
Yannis ManolopoulosEmail:
  相似文献   

3.
To put an end to the large copyright trade deficit, both Chinese government agencies and publishing houses have been striving for entering the international publication market. The article analyzes the background of the going-global strategy, and sums up the performance of both Chinese administrations and publishers.
Qing Fang (Corresponding author)Email:
  相似文献   

4.
This article examines the archival methods developed by Colbert to train his son in state administration. Based on Colbert’s correspondence with his son, it reveals the practices Colbert thought necessary to collect and manage information in his state encyclopedic archive during the last half of the 17th century.
Jacob SollEmail:
  相似文献   

5.
This article analyzes current industry practices toward the identification of digital book content. It highlights key technology trends, workflow considerations and supply chain behaviors, and examines the implications of these trends and behaviors on the production, discoverability, purchasing and consumption of digital book products.
Andy WeissbergEmail:
  相似文献   

6.
There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article, we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families. These formal constraints are validated in an experiment involving human assessments, and compared with other constraints proposed in the literature. Our analysis of a wide range of metrics shows that only BCubed satisfies all formal constraints. We also extend the analysis to the problem of overlapping clustering, where items can simultaneously belong to more than one cluster. As Bcubed cannot be directly applied to this task, we propose a modified version of Bcubed that avoids the problems found with other metrics.
Felisa VerdejoEmail:
  相似文献   

7.
Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.
Diego Reforgiato RecuperoEmail:
  相似文献   

8.
A review and analysis of the rules and regulations including the tax aspects of making an investment in India is presented. The full range from Foreign Direct Investment to different forms of doing business with specific examples from the publishing industry is explored to help understand current policies and regulations.
Sandeep ChauflaEmail: Email:
  相似文献   

9.
A summary overview of the children’s and young adult publishing industry in China with a focus on the size of the market, ten major publishing houses, copyright and trends. Special emphasis has been placed on specific transaction for the sale of translation rights from German language publishers to China and minimal activities of German rights sold to Chinese publishers.
Jing BartzEmail:
  相似文献   

10.
11.
A comparison of analyses of the Scottish publishing industry carried out in 1992, 2002 and 2007 underscores the fragility of the sector within a small country within the English-language community. A number of indices reveal either stability or stagnation and the picture emerges of the remarkable tenacity of publishing in Scotland. Although there is already a significant and vital element of state support for publishing in Scotland, further intervention will be necessary to ensure fulfilment of its potential.
Alistair McCleeryEmail:
  相似文献   

12.
13.
In recent years graph-ranking based algorithms have been proposed for single document summarization and generic multi-document summarization. The algorithms make use of the “votings” or “recommendations” between sentences to evaluate the importance of the sentences in the documents. This study aims to differentiate the cross-document and within-document relationships between sentences for generic multi-document summarization and adapt the graph-ranking based algorithm for topic-focused summarization. The contributions of this study are two-fold: (1) For generic multi-document summarization, we apply the graph-based ranking algorithm based on each kind of sentence relationship and explore their relative importance for summarization performance. (2) For topic-focused multi-document summarization, we propose to integrate the relevance of the sentences to the specified topic into the graph-ranking based method. Each individual kind of sentence relationship is also differentiated and investigated in the algorithm. Experimental results on DUC 2002–DUC 2005 data demonstrate the great importance of the cross-document relationships between sentences for both generic and topic-focused multi-document summarizations. Even the approach based only on the cross-document relationships can perform better than or at least as well as the approaches based on both kinds of relationships between sentences.
Xiaojun WanEmail:
  相似文献   

14.
Evaluating the effectiveness of content-oriented XML retrieval methods   总被引:1,自引:0,他引:1  
Content-oriented XML retrieval approaches aim at a more focused retrieval strategy: Instead of retrieving whole documents, document components that are exhaustive to the information need while at the same time being as specific as possible should be retrieved. In this article, we show that the evaluation methods developed for standard retrieval must be modified in order to deal with the structure of XML documents. More precisely, the size and overlap of document components must be taken into account. For this purpose, we propose a new effectiveness metric based on the definition of a concept space defined upon the notions of exhaustiveness and specificity of a search result. We compare the results of this new metric by the results obtained with the official metric used in INEX, the evaluation initiative for content-oriented XML retrieval.
Gabriella KazaiEmail:
  相似文献   

15.
On rank-based effectiveness measures and optimization   总被引:1,自引:0,他引:1  
Many current retrieval models and scoring functions contain free parameters which need to be set—ideally, optimized. The process of optimization normally involves some training corpus of the usual document-query-relevance judgement type, and some choice of measure that is to be optimized. The paper proposes a way to think about the process of exploring the space of parameter values, and how moving around in this space might be expected to affect different measures. One result, concerning local optima, is demonstrated for a range of rank-based evaluation measures.
Hugo ZaragozaEmail:
  相似文献   

16.
Exploring criteria for successful query expansion in the genomic domain   总被引:1,自引:0,他引:1  
Query Expansion is commonly used in Information Retrieval to overcome vocabulary mismatch issues, such as synonymy between the original query terms and a relevant document. In general, query expansion experiments exhibit mixed results. Overall TREC Genomics Track results are also mixed; however, results from the top performing systems provide strong evidence supporting the need for expansion. In this paper, we examine the conditions necessary for optimal query expansion performance with respect to two system design issues: IR framework and knowledge source used for expansion. We present a query expansion framework that improves Okapi baseline passage MAP performance by 185%. Using this framework, we compare and contrast the effectiveness of a variety of biomedical knowledge sources used by TREC 2006 Genomics Track participants for expansion. Based on the outcome of these experiments, we discuss the success factors required for effective query expansion with respect to various sources of term expansion, such as corpus-based cooccurrence statistics, pseudo-relevance feedback methods, and domain-specific and domain-independent ontologies and databases. Our results show that choice of document ranking algorithm is the most important factor affecting retrieval performance on this dataset. In addition, when an appropriate ranking algorithm is used, we find that query expansion with domain-specific knowledge sources provides an equally substantive gain in performance over a baseline system.
Nicola StokesEmail: Email:
  相似文献   

17.
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to using single documents to this end.
Oren KurlandEmail:
  相似文献   

18.
We present software that generates phrase-based concordances in real-time based on Internet searching. When a user enters a string of words for which he wants to find concordances, the system sends this string as a query to a search engine and obtains search results for the string. The concordances are extracted by performing statistical analysis on search results and then fed back to the user. Unlike existing tools, this concordance consultation tool is language-independent, so concordances can be obtained even in a language for which there are no well-established analytical methods. Our evaluation has revealed that concordances can be obtained more effectively than by only using a search engine directly.
Yuichiro IshiiEmail:
  相似文献   

19.
We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective by adjusting retrieval scores so that topically related documents receive similar scores. We refer to this process as score regularization. Because score regularization operates on retrieval scores, regardless of their origin, we can apply the technique to arbitrary initial retrieval rankings. Document rankings derived from regularized scores, when compared to rankings derived from un-regularized scores, consistently and significantly result in improved performance given a variety of baseline retrieval algorithms. We also present several proofs demonstrating that regularization generalizes methods such as pseudo-relevance feedback, document expansion, and cluster-based retrieval. Because of these strong empirical and theoretical results, we argue for the adoption of score regularization as general design principle or post-processing step for information retrieval systems.
Fernando DiazEmail:
  相似文献   

20.
This article analyses the extent to which archival exemptions for historical, scientific and statistical research in privacy legislation support preservation in selected European Union countries, and comparable aspects of Australian, American and Canadian law within a legal, ethical and digital archival perspective. The authors recommend that the further processing of personal data under data protection law be given a wider scope of interpretation for archival preservation purposes in both the public and private sector, coupled with the use of researcher and archival codes in relation to access to personal data. They also recommend early appraisal and integration of privacy with freedom of information and archival regimes.
Malcolm ToddEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号