首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 643 毫秒
1.
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to using single documents to this end.
Oren KurlandEmail:
  相似文献   

2.
基于IIG和LSI组合特征提取方法的文本聚类研究   总被引:8,自引:0,他引:8  
本文利用改进的信息增益特征选择方法和潜在语义索引技术组合的特征提取方法 ,对文本进行了有效的自动聚类。从语料库中抽取了 2 5 0篇文本 ,首先利用向量空间模型和改进的信息增益特征选择方法 ,构造文本特征向量 ,利用C 均值方法聚类 ,聚类结果准确率、查全率、F measure分别达到 0 .82、0 . 88、0 .83。在此基础上 ,对最优的特征选择结果运用潜在语义索引方法 ,对奇异值分解的结果进行截断处理 ,发现奇异值K取 4 0时聚类结果的准确率、查全率、F measure达到 0 . 95、0. 5 7、0 . 78,在有效地降维的同时 ,大幅度地提高了聚类的准确率。  相似文献   

3.
Automatic document classification can be used to organize documents in a digital library, construct on-line directories, improve the precision of web searching, or help the interactions between user and search engines. In this paper we explore how linkage information inherent to different document collections can be used to enhance the effectiveness of classification algorithms. We have experimented with three link-based bibliometric measures, co-citation, bibliographic coupling and Amsler, on three different document collections: a digital library of computer science papers, a web directory and an on-line encyclopedia. Results show that both hyperlink and citation information can be used to learn reliable and effective classifiers based on a kNN classifier. In one of the test collections used, we obtained improvements of up to 69.8% of macro-averaged F 1 over the traditional text-based kNN classifier, considered as the baseline measure in our experiments. We also present alternative ways of combining bibliometric based classifiers with text based classifiers. Finally, we conducted studies to analyze the situation in which the bibliometric-based classifiers failed and show that in such cases it is hard to reach consensus regarding the correct classes, even for human judges.  相似文献   

4.
Cluster-based and passage-based document retrieval paradigms were shown to be effective. While the former are based on utilizing query-related corpus context manifested in clusters of similar documents, the latter address the fact that a document can be relevant even if only a very small part of it contains query-pertaining information. Hence, cluster-based approaches could be viewed as based on “expanding” the document representation, while passage-based approaches can be thought of as utilizing a “contracted” document representation. We present a study of the relative benefits of using each of these two approaches, and of the potential merits of their integration. To that end, we devise two methods that integrate whole-document-based, cluster-based and passage-based information. The methods are applied for the re-ranking task, that is, re-ordering documents in an initially retrieved list so as to improve precision at the very top ranks. Extensive empirical evaluation attests to the potential merits of integrating these information types. Specifically, the resultant performance substantially transcends that of the initial ranking; and, is often better than that of a state-of-the-art pseudo-feedback-based query expansion approach.  相似文献   

5.
The history of the creation and development of the VINITI RAS “Geography” reference journal from 1954 to 2008 is considered. The changes in retrofunds and dynamics of the distribution of the overall quantity of documents in the reference journal/database have been followed in relation to the changes in the content contained in the issues during the period of time under consideration. The document information flow of the “ Geography ” database during 1991–2008 was analyzed statistically.  相似文献   

6.
Using late medieval examples from Switzerland, this paper argues that the emergence of formally organized archives around 1500 was part of an important shift in how documents could be deployed. However, this shift was not away from an oral and toward a literate culture, as argued in some earlier studies, but rather away from seeing documents as testimony that reminded a community about past authoritative actors, and toward relating the texts of documents to other texts, that is, to contexts. This shift took place largely through the appropriation of methods for using and organizing written material that had been developed in the realms of scholastic theology and liturgy, and applying them to secular lordship and administration. These methods provided new models for organizing collections of parchments and papers into connected archives and gave rise to new forms of text collection such as reorganized versions of law books (Spiegel, Coutumiers) containing new search tools such as tables of contents (capitulationes) and indices (abecedaria). Individual charters and scattered legal norms were also organized into textusglossae structures in larger and smaller administrative units. In the Swiss case, the contextualization of legal texts was accompanied by an increased attribution of authority to ‘custom’ in general, because the community-oriented attribution of meaning found in earlier use was lost. Ultimately, recasting individual documents as part of larger textual contexts increased the power of rulers and ushered in an age of lawyers and of archives.  相似文献   

7.
Document clustering of scientific texts using citation contexts   总被引:3,自引:0,他引:3  
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate the power of these citation-specific word features, and compare them with the original document’s textual representation in a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific journal datasets.  相似文献   

8.
The paper focuses on the convergence of Finnish research and education in archival science with information science in general and in records management with information management in particular. Two issues influencing this development are: the convergence of professionals previously worked in the archival and library sectors and in information management and services; and the wide-spread, extensive growth in the use of digital technology to manage internal and external organizational information. At the level of society the opportunities provided by digital technology to manage heritage information in memory organizations like archives, libraries and museums, are tremendous and the role of documentary heritage at the global, European and national levels is well recognized. These developments are changing the information and operating environments of memory organizations and public and private enterprises. These changes, in turn, are generating new requirements in archival science and records management education and research. This paper focuses on the implications of these changes for the planning, implementation and further development of an information studies curriculum. This curriculum development is considered crucial in order to respond to the new demands, and is also implicitly linked to the emerging Finnish information society. This article is based on Huotari, M.-L. and Valtonen, M.R., “Integrating Records and Archives Management with Information Studies in Finland”, in L. Ashcroft (ed.),Continuity, Culture, Competition—the Future of Library and Information Studies Education, Proceedings of the 4th British-Nordic Conference on Library and Information Studies 21–23 March 2001, Dublin, Ireland, pp. 249–254 (Dublin: MCB UP Limited, 2002).  相似文献   

9.
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a record. This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.  相似文献   

10.
对一种基于动态可调自组织神经网络(the dynamic adaptive self-organizing map neural network,简称DASOM)的增量中文文本聚类方法进行研究,认为其只需处理更新数据,提高聚类速度,并能自动抽取SOM聚类结果;DASOM模型具有动态的结构,通过数值实验表明该方法对中文文本增量聚类具有有效性。  相似文献   

11.
文档聚类分析是组织文档的一种有效方法,在信息处理中被广泛应用于未知话题的自动发现并取得不错的效果。本文提出了一个轻量级聚类算法。该算法利用减小原始文档的索引数,来处理大量小文档,并把它们分组到几千个簇,或者通过更改特定参数,将聚类簇的数量减小到几十个。理论分析和实际应用表明,该算法改善了对高维数据和大量小文档处理效率。  相似文献   

12.
Streaming data poses a variety of new and interesting challenges for information retrieval and text analysis. Unlike static document collections, which are typically analyzed and indexed off-line to support ad-hoc queries, streaming data often must be analyzed on the fly and acted on as the data passes through the analysis system. Speech is one example of streaming data that is a challenge to exploit, yet has significant potential to provide value in a knowledge management system. We are specifically interested in techniques that analyze streaming data and automatically find collateral information, or information that clarifies, expands, and generally enhances the value of the streaming data. We present a system that analyzes a data stream and automatically finds documents related to the current topic of discussion in the data stream. Experimental results show that the system generates result lists with an average precision at 10 hits of better than 60%. We also present a hit-list re-ranking technique based on named entity analysis and automatic text categorization that can improve the search results by 6%–12%.  相似文献   

13.
As the volume and variety of information sources continues to grow, there is increasing difficulty with respect to obtaining information that accurately matches user information needs. A number of factors affect information retrieval effectiveness (the accuracy of matching user information needs against the retrieved information). First, users often do not present search queries in the form that optimally represents their information need. Second, the measure of a document’s relevance is often highly subjective between different users. Third, information sources might contain heterogeneous documents, in multiple formats and the representation of documents is not unified. This paper discusses an approach for improvement of information retrieval effectiveness from document databases. It is proposed that retrieval effectiveness can be improved by applying computational intelligence techniques for modelling information needs, through interactive reinforcement learning. The method combines qualitative (subjective) user relevance feedback with quantitative (algorithmic) measures of the relevance of retrieved documents. An information retrieval is developed whose retrieval effectiveness is evaluated using traditional precision and recall.  相似文献   

14.
Web search queries are often ambiguous or faceted, and the task of identifying the major underlying senses and facets of queries has received much attention in recent years. We refer to this task as query subtopic mining. In this paper, we propose to use surrounding text of query terms in top retrieved documents to mine subtopics and rank them. We first extract text fragments containing query terms from different parts of documents. Then we group similar text fragments into clusters and generate a readable subtopic for each cluster. Based on the cluster and the language model trained from a query log, we calculate three features and combine them into a relevance score for each subtopic. Subtopics are finally ranked by balancing relevance and novelty. Our evaluation experiments with the NTCIR-9 INTENT Chinese Subtopic Mining test collection show that our method significantly outperforms a query log based method proposed by Radlinski et al. (2010) and a search result clustering based method proposed by Zeng et al. (2004) in terms of precision, I-rec, D-nDCG and D#-nDCG, the official evaluation metrics used at the NTCIR-9 INTENT task. Moreover, our generated subtopics are significantly more readable than those generated by the search result clustering method.  相似文献   

15.
Based on previous findings and theoretical considerations, it was suggested that bibliographic coupling could be combined with a cluster method to provide a method for science mapping, complementary to the prevailing co-citation cluster analytical method. The complete link cluster method was on theoretical grounds assumed to provide a suitable cluster method for this purpose. The objective of the study was to evaluate the proposed method's capability to identify coherent research themes. Applying a large multidisciplinary test bed comprising more than 600,000 articles and 17 million references, the proposed method was tested in accordance with two lines of mapping. In the first line of mapping, all significant (strong) links connecting ‘core documents’ (strongly and frequently coupled documents) in clusters with any other core document was mapped. This resulted in a depiction of all significant artificially broken links between core documents in a cluster and core documents extrinsic to that cluster. The second line of mapping involved the application of links between clusters only. They were used to successively merge clusters on two subsequent levels of fusion, where the first generation of clusters were considered objects for a second clustering, and the second generation of clusters gave rise to a final cluster fusion. Changes of cluster composition on the three levels were evaluated with regard to several variables. Findings showed that the proposed method could provide with valid depictions of current research, though some severe restrictions would adhere to its application.  相似文献   

16.
The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.  相似文献   

17.
利用改进的信息增益特征选择的方法,对文本进行了有效的自动聚类。从语料库中抽取了250篇文本,利用向量空间模型和信息增益特征降维方法,构造文本特征向量,并最终利用C-均值方法聚类,聚类结果精度、召回率、F-measure分别达到0.82、0.88、0.83。  相似文献   

18.
This paper gives the results of the scientometric analysis of foreign publications by Kazakh authors that was reflected in the SCOPUS DB in 1991–2008. The publication activity is expressed in 3883 documents, the citation index of which is 10 132. The average share of Kazakh publications in the total worldwide flow is equal to 0.017%. The citation rate of publications was revealed to have significantly grown since the 1996–2000 period. It is shown that most articles were written in English and published in periodical editions. The main themes of publications are represented by physics and chemistry. The leading foreign partners of Kazakhstan in the scientific sphere were determined. Kazakh-Russian scientific cooperation is developing most fruitfully.  相似文献   

19.
Objective:The aim of this project was to validate search filters for systematic reviews, intervention studies, and observational studies translated from Ovid MEDLINE and Embase syntax and used for searches in PubMed and Embase.com during the development of evidence summaries supporting first aid guidelines. We aimed to achieve a balance among recall, specificity, precision, and number needed to read (NNR).Methods:Reference gold standards were constructed per study type derived from existing evidence summaries. Search filter performance was assessed through retrospective searches and measurement of relative recall, specificity, precision, and NNR when using the translated search filters. Where necessary, search filters were optimized. Adapted filters were validated in separate validation gold standards.Results:Search filters for systematic reviews and observational studies reached recall of ≥85% in both PubMed and Embase. Corresponding specificities for systematic review filters were ≥96% in both databases, with a precision of 9.7% (NNR 10) in PubMed and 5.4% (NNR 19) in Embase. For observational study filters, specificity, precision, and NNR were 68%, 2%, and 51 in PubMed and 47%, 0.8%, and 123 in Embase, respectively. These filters were considered sufficiently effective. Search filters for intervention studies reached a recall of 85% and 83% in PubMed and Embase, respectively. Optimization led to recall of ≥95% with specificity, precision, and NNR of 49%, 1.3%, and 79 in PubMed and 56%, 0.74%, and 136 in Embase, respectively.Conclusions:We report validated filters to search for systematic reviews, observational studies, and intervention studies in guideline projects in PubMed and Embase.com.  相似文献   

20.
The problem of finding documents written in a language that the searcher cannot read is perhaps the most challenging application of cross-language information retrieval technology. In interactive applications, that task involves at least two steps: (1) the machine locates promising documents in a collection that is larger than the searcher could scan, and (2) the searcher recognizes documents relevant to their intended use from among those nominated by the machine. This article presents the results of experiments designed to explore three techniques for supporting interactive relevance assessment: (1) full machine translation, (2) rapid term-by-term translation, and (3) focused phrase translation. Machine translation was found to better support this task than term-by-term translation, and focused phrase translation further improved recall without an adverse effect on precision. The article concludes with an assessment of the strengths and weaknesses of the evaluation framework used in this study and some remarks on implications of these results for future evaluation campaigns.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号