首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
Automatic document classification can be used to organize documents in a digital library, construct on-line directories, improve the precision of web searching, or help the interactions between user and search engines. In this paper we explore how linkage information inherent to different document collections can be used to enhance the effectiveness of classification algorithms. We have experimented with three link-based bibliometric measures, co-citation, bibliographic coupling and Amsler, on three different document collections: a digital library of computer science papers, a web directory and an on-line encyclopedia. Results show that both hyperlink and citation information can be used to learn reliable and effective classifiers based on a kNN classifier. In one of the test collections used, we obtained improvements of up to 69.8% of macro-averaged F 1 over the traditional text-based kNN classifier, considered as the baseline measure in our experiments. We also present alternative ways of combining bibliometric based classifiers with text based classifiers. Finally, we conducted studies to analyze the situation in which the bibliometric-based classifiers failed and show that in such cases it is hard to reach consensus regarding the correct classes, even for human judges.  相似文献   

2.
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.  相似文献   

3.
Document clustering of scientific texts using citation contexts   总被引:3,自引:0,他引:3  
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate the power of these citation-specific word features, and compare them with the original document’s textual representation in a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific journal datasets.  相似文献   

4.
We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term saturation in isolation and BM25 term saturation with idf, confirming that both are superior to their non-BM25 counterparts under several common clustering quality measures. Finally, we investigate estimation of the k1 BM25 parameter when clustering. Our results indicate that typical values of k1 from other IR tasks are not appropriate for clustering; k1 needs to be higher.  相似文献   

5.
Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. Yet a constant-size pool represents an increasingly small percentage of the document set as document sets grow larger, and at some point the assumption of approximately complete judgments must become invalid. This paper shows that the judgment sets produced by traditional pooling when the pools are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic title words. This phenomenon is wholly dependent on the collection size and does not depend on the number of relevant documents for a given topic. We show that the AQUAINT test collection constructed in the recent TREC 2005 workshop exhibits this biased relevance set; it is likely that the test collections based on the much larger GOV2 document set also exhibit the bias. The paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable test collections to be built.
Ellen VoorheesEmail:
  相似文献   

6.
Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clustering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clusters were compared with each of the 30 queries to retrieve the best-matching cluster. The four-level relevance assessment was collapsed into a binary one by (A) considering all the relevant and (B) only the highly relevant documents relevant, respectively. Single linkage (SL) was the worst method. It created many tiny clusters, and, consequently, searches enabled with it had high precision and low recall. The complete linkage (CL), average linkage (AL), and Ward's methods (WM) returned reasonably-sized clusters typically of 18–32 documents. Their recall (A: 27–52%, B: 50–82%) and precision (A: 83–90%, B: 18–21%) was higher than and comparable to those of the SL clusters, respectively. The AL and WM clustering had 1–8% better effectiveness than nearest neighbor searching (NN), and SL and CL were 1–9% less efficient that NN. However, the differences were statistically insignificant. When evaluated with the liberal assessment A, the results suggest that the AL and WM clustering offer better retrieval ability than NN. Assessment B renders the AL and WM clustering better than NN, when recall is considered more important than precision. The results imply that collections in the highly inflectional and agglutinative languages, such as Finnish, may be clustered as the collections in English, provided that documents are appropriately preprocessed.  相似文献   

7.
Streaming data poses a variety of new and interesting challenges for information retrieval and text analysis. Unlike static document collections, which are typically analyzed and indexed off-line to support ad-hoc queries, streaming data often must be analyzed on the fly and acted on as the data passes through the analysis system. Speech is one example of streaming data that is a challenge to exploit, yet has significant potential to provide value in a knowledge management system. We are specifically interested in techniques that analyze streaming data and automatically find collateral information, or information that clarifies, expands, and generally enhances the value of the streaming data. We present a system that analyzes a data stream and automatically finds documents related to the current topic of discussion in the data stream. Experimental results show that the system generates result lists with an average precision at 10 hits of better than 60%. We also present a hit-list re-ranking technique based on named entity analysis and automatic text categorization that can improve the search results by 6%–12%.  相似文献   

8.
Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.
Diego Reforgiato RecuperoEmail:
  相似文献   

9.
The infrastructure of a typical search engine can be used to calculate and resolve persistent document identifiers: a string that can uniquely identify and locate a document on the Internet without reference to its original location (URL). Bookmarking a document using such an identifier allows its retrieval even if the document's URL, and, in many cases, its contents change. Web client applications can offer facilities for users to bookmark a page by reference to a search engine and the persistent identifier instead of the original URL. The identifiers are calculated using a global Internet term index; a document's unique identifier consists of a word or word combination that occurs uniquely in the specific document. We use a genetic algorithm to locate a minimal unique document identifier: the shortest word or word combination that will locate the document. We tested our approach by implementing tools for indexing a document collection, calculating the persistent identifiers, performing queries, and distributing the computation and storage load among many computers.  相似文献   

10.
Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.  相似文献   

11.
Cluster-based and passage-based document retrieval paradigms were shown to be effective. While the former are based on utilizing query-related corpus context manifested in clusters of similar documents, the latter address the fact that a document can be relevant even if only a very small part of it contains query-pertaining information. Hence, cluster-based approaches could be viewed as based on “expanding” the document representation, while passage-based approaches can be thought of as utilizing a “contracted” document representation. We present a study of the relative benefits of using each of these two approaches, and of the potential merits of their integration. To that end, we devise two methods that integrate whole-document-based, cluster-based and passage-based information. The methods are applied for the re-ranking task, that is, re-ordering documents in an initially retrieved list so as to improve precision at the very top ranks. Extensive empirical evaluation attests to the potential merits of integrating these information types. Specifically, the resultant performance substantially transcends that of the initial ranking; and, is often better than that of a state-of-the-art pseudo-feedback-based query expansion approach.  相似文献   

12.
On Collection Size and Retrieval Effectiveness   总被引:3,自引:0,他引:3  
The relationship between collection size and retrieval effectiveness is particularly important in the context of Web search. We investigate it first analytically and then experimentally, using samples and subsets of test collections. Different retrieval systems vary in how the score assigned to an individual document in a sample collection relates to the score it receives in the full collection; we identify four cases.We apply signal detection (SD) theory to retrieval from samples, taking into account the four cases and using a variety of shapes for relevant and irrelevant distributions. We note that the SD model subsumes several earlier hypotheses about the causes of the decreased precision in samples. We also discuss other models which contribute to an understanding of the phenomenon, particularly relating to the effects of discreteness. Different models provide complementary insights.Extensive use is made of test data, some from official submissions to the TREC-6 VLC track and some new, to illustrate the effects and test hypotheses. We empirically confirm predictions, based on SD theory, that P@n should decline when moving to a sample collection and that average precision and R-precision should remain constant. SD theory suggests the use of recall-fallout plots as operating characteristic (OC) curves. We plot OC curves of this type for a real retrieval system and query set and show that curves for sample collections are similar but not identical to the curve for the full collection.  相似文献   

13.
Documents formatted in eXtensible Markup Language (XML) are available in collections of various document types. In this paper, we present an approach for the summarisation of XML documents. The novelty of this approach lies in that it is based on features not only from the content of documents, but also from their logical structure. We follow a machine learning, sentence extraction-based summarisation technique. To find which features are more effective for producing summaries, this approach views sentence extraction as an ordering task. We evaluated our summarisation model using the INEX and SUMMAC datasets. The results demonstrate that the inclusion of features from the logical structure of documents increases the effectiveness of the summariser, and that the learnable system is also effective and well-suited to the task of summarisation in the context of XML documents. Our approach is generic, and is therefore applicable, apart from entire documents, to elements of varying granularity within the XML tree. We view these results as a step towards the intelligent summarisation of XML documents.
Mounia LalmasEmail:
  相似文献   

14.
To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval.  相似文献   

15.
ABSTRACT

In libraries and archives, efforts to document underrepresented communities and diversify collections can be fraught with political tension. We explore an interdepartmental collaboration to create and preserve a digital collection documenting the Urban Native Relocation Program of the mid- to late-twentieth century in Grand Rapids, Michigan. Involving the Grand Valley State University Libraries, the Kutsche Office of Local History, and the university's Native American Advisory Board, the project serves as a model not just for collaborative collection development but also for community engagement and outreach. We find that process is as important as product in developing collaborative digital collections.  相似文献   

16.
Document clustering offers the potential of supporting users in interactive retrieval, especially when users have problems in specifying their information need precisely. In this paper, we present a theoretic foundation for optimum document clustering. Key idea is to base cluster analysis and evalutation on a set of queries, by defining documents as being similar if they are relevant to the same queries. Three components are essential within our optimum clustering framework, OCF: (1) a set of queries, (2) a probabilistic retrieval method, and (3) a document similarity metric. After introducing an appropriate validity measure, we define optimum clustering with respect to the estimates of the relevance probability for the query-document pairs under consideration. Moreover, we show that well-known clustering methods are implicitly based on the three components, but that they use heuristic design decisions for some of them. We argue that with our framework more targeted research for developing better document clustering methods becomes possible. Experimental results demonstrate the potential of our considerations.  相似文献   

17.
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to using single documents to this end.
Oren KurlandEmail:
  相似文献   

18.
文献系统标识结构   总被引:1,自引:0,他引:1  
阐述文献标识对文献组织的必要性和标识元对文献组织的意义,分析标识元集的性质,并得出标识结构模型。  相似文献   

19.
In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form “for document d i , category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.  相似文献   

20.
This study is intended to answer the following questions: Have university libraries in Taiwan acquired books recommended by selection aids? Can university library collections in Taiwan sufficiently fill users' requests? Do libraries of different collection sizes process users' requests in an unbiased and efficient manner?This study employed interviews and document delivery tests. Librarians from 15 university libraries were interviewed to investigate their thoughts on ILL services. Samples from Choice and Bowker's Global Books in Print were searched for through the online catalogs of 148 university libraries. This study concludes that resource sharing will not operate in an equitable way if most user requests are only directed towards a small percentage of Taiwan's university libraries. Without an adequate quantity of volumes in the library collections of any region or country, resource sharing cannot be realized.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号