首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 80 毫秒
1.
Most existing research on applying machine learning techniques to document summarization explores either classification models or learning-to-rank models. This paper presents our recent study on how to apply a different kind of learning models, namely regression models, to query-focused multi-document summarization. We choose to use Support Vector Regression (SVR) to estimate the importance of a sentence in a document set to be summarized through a set of pre-defined features. In order to learn the regression models, we propose several methods to construct the “pseudo” training data by assigning each sentence with a “nearly true” importance score calculated with the human summaries that have been provided for the corresponding document set. A series of evaluations on the DUC data sets are conducted to examine the efficiency and the robustness of the proposed approaches. When compared with classification models and ranking models, regression models are consistently preferable.  相似文献   

2.
This research has investigated the feasibility of using a distance measure, called the Bayesian distance, for automatic sequential document classification. It has been shown that by observing the variation of this distance measure as keywords are extracted sequentially from a document, the occurrence of noisy keywords may be detected. This property of the distance measure has been utilized to design a sequential classification algorithm which works in two phases. In the first phase keywords extracted from a document are partitioned into two groups—the good keyword group and the noisy keyword group. In the second phase these two groups of keywords are analyzed separately to assign primary and secondary classes to a document. The algorithm has been applied to several data bases of documents and very encouraging results have been obtained.  相似文献   

3.
We study several machine learning algorithms for cross-language patent retrieval and classification. In comparison with most of other studies involving machine learning for cross-language information retrieval, which basically used learning techniques for monolingual sub-tasks, our learning algorithms exploit the bilingual training documents and learn a semantic representation from them. We study Japanese–English cross-language patent retrieval using Kernel Canonical Correlation Analysis (KCCA), a method of correlating linear relationships between two variables in kernel defined feature spaces. The results are quite encouraging and are significantly better than those obtained by other state of the art methods. We also investigate learning algorithms for cross-language document classification. The learning algorithm are based on KCCA and Support Vector Machines (SVM). In particular, we study two ways of combining the KCCA and SVM and found that one particular combination called SVM_2k achieved better results than other learning algorithms for either bilingual or monolingual test documents.  相似文献   

4.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.  相似文献   

5.
This paper presents a semantically rich document representation model for automatically classifying financial documents into predefined categories utilizing deep learning. The model architecture consists of two main modules including document representation and document classification. In the first module, a document is enriched with semantics using background knowledge provided by an ontology and through the acquisition of its relevant terminology. Acquisition of terminology integrated to the ontology extends the capabilities of semantically rich document representations with an in depth-coverage of concepts, thereby capturing the whole conceptualization involved in documents. Semantically rich representations obtained from the first module will serve as input to the document classification module which aims at finding the most appropriate category for that document through deep learning. Three different deep learning networks each belonging to a different category of machine learning techniques for ontological document classification using a real-life ontology are used.Multiple simulations are carried out with various deep neural networks configurations, and our findings reveal that a three hidden layer feedforward network with 1024 neurons obtain the highest document classification performance on the INFUSE dataset. The performance in terms of F1 score is further increased by almost five percentage points to 78.10% for the same network configuration when the relevant terminology integrated to the ontology is applied to enrich document representation. Furthermore, we conducted a comparative performance evaluation using various state-of-the-art document representation approaches and classification techniques including shallow and conventional machine learning classifiers.  相似文献   

6.
鲁国轩  杨冠灿  宋欣 《情报科学》2022,40(9):154-158
【目的/意义】数字人文作为人文社科和计算机技术的跨界学科,在高速发展的同时面临概念界定不明确、 专题期刊缺乏等问题,增加了文献搜集难度。需要构建适合的识别分类模型,形成数字人文专题文献库,助力数字 人文研究。【方法/过程】分析数字人文学科的内涵,归纳数字人文文献特征,在人工识读标注的基础上构建机器学 习模型,实现对数字人文文献的自动识别与分类。【结果/结论】提出一种基于机器学习的数字人文文献识别分类模 型,对图情领域数字人文文献实现了较好的识别效果。【创新/局限】将机器学习算法应用到数字人文文献分类,较 好应对了词汇复杂和数据量较小的问题;进一步研究可使用深度学习等更复杂的模型,并实现不同领域数字人文 文献的多分类。  相似文献   

7.
This paper is concerned with the quality of training data in learning to rank for information retrieval. While many data selection techniques have been proposed to improve the quality of training data for classification, the study on the same issue for ranking appears to be insufficient. As pointed out in this paper, it is inappropriate to extend technologies for classification to ranking, and the development of novel technologies is sorely needed. In this paper, we study the development of such technologies. To begin with, we propose the concept of “pairwise preference consistency” (PPC) to describe the quality of a training data collection from the ranking point of view. PPC takes into consideration the ordinal relationship between documents as well as the hierarchical structure on queries and documents, which are both unique properties of ranking. Then we select a subset of the original training documents, by maximizing the PPC of the selected subset. We further propose an efficient solution to the maximization problem. Empirical results on the LETOR benchmark datasets and a web search engine dataset show that with the subset of training data selected by our approach, the performance of the learned ranking model can be significantly improved.  相似文献   

8.
Automatic text classification is the problem of automatically assigning predefined categories to free text documents, thus allowing for less manual labors required by traditional classification methods. When we apply binary classification to multi-class classification for text classification, we usually use the one-against-the-rest method. In this method, if a document belongs to a particular category, the document is regarded as a positive example of that category; otherwise, the document is regarded as a negative example. Finally, each category has a positive data set and a negative data set. But, this one-against-the-rest method has a problem. That is, the documents of a negative data set are not labeled manually, while those of a positive set are labeled by human. Therefore, the negative data set probably includes a lot of noisy data. In this paper, we propose that the sliding window technique and the revised EM (Expectation Maximization) algorithm are applied to binary text classification for solving this problem. As a result, we can improve binary text classification through extracting potentially noisy documents from the negative data set using the sliding window technique and removing actually noisy documents using the revised EM algorithm. The results of our experiments showed that our method achieved better performance than the original one-against-the-rest method in all the data sets and all the classifiers used in the experiments.  相似文献   

9.
Decisions in thesaurus construction and use   总被引:1,自引:0,他引:1  
A thesaurus and an ontology provide a set of structured terms, phrases, and metadata, often in a hierarchical arrangement, that may be used to index, search, and mine documents. We describe the decisions that should be made when including a term, deciding whether a term should be subdivided into its subclasses, or determining which of more than one set of possible subclasses should be used. Based on retrospective measurements or estimates of future performance when using thesaurus terms in document ordering, decisions are made so as to maximize performance. These decisions may be used in the automatic construction of a thesaurus. The evaluation of an existing thesaurus is described, consistent with the decision criteria developed here. These kinds of user-focused decision-theoretic techniques may be applied to other hierarchical applications, such as faceted classification systems used in information architecture or the use of hierarchical terms in “breadcrumb navigation”.  相似文献   

10.
We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features: (1) it requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a document classification tree and (3) the hierarchy obtained by the algorithm explicitly reveals a collection structure. We confirm these features and thus show the algorithm's feasibility through clustering experiments in which we use two collections of Japanese documents, the sizes of which are 83,099 and 14,701 documents. We also introduce an application of this algorithm to a document browser. This browser is used in our Japanese-to-English translation aid system. The browsing module of the system consists of a huge database of Japanese news articles and their English translations. The Japanese article collection is clustered into a hierarchy by our method. Since each node in the hierarchy corresponds to a topic in the collection, we can use the hierarchy to directly access articles by topic. A user can learn general translation knowledge of each topic by browsing the Japanese articles and their English translations. We also discuss techniques of presenting a large tree-formed hierarchy on a computer screen.  相似文献   

11.
Transductive classification is a useful way to classify texts when labeled training examples are insufficient. Several algorithms to perform transductive classification considering text collections represented in a vector space model have been proposed. However, the use of these algorithms is unfeasible in practical applications due to the independence assumption among instances or terms and the drawbacks of these algorithms. Network-based algorithms come up to avoid the drawbacks of the algorithms based on vector space model and to improve transductive classification. Networks are mostly used for label propagation, in which some labeled objects propagate their labels to other objects through the network connections. Bipartite networks are useful to represent text collections as networks and perform label propagation. The generation of this type of network avoids requirements such as collections with hyperlinks or citations, computation of similarities among all texts in the collection, as well as the setup of a number of parameters. In a bipartite heterogeneous network, objects correspond to documents and terms, and the connections are given by the occurrences of terms in documents. The label propagation is performed from documents to terms and then from terms to documents iteratively. Nevertheless, instead of using terms just as means of label propagation, in this article we propose the use of the bipartite network structure to define the relevance scores of terms for classes through an optimization process and then propagate these relevance scores to define labels for unlabeled documents. The new document labels are used to redefine the relevance scores of terms which consequently redefine the labels of unlabeled documents in an iterative process. We demonstrated that the proposed approach surpasses the algorithms for transductive classification based on vector space model or networks. Moreover, we demonstrated that the proposed algorithm effectively makes use of unlabeled documents to improve classification and it is faster than other transductive algorithms.  相似文献   

12.
Transfer learning utilizes labeled data available from some related domain (source domain) for achieving effective knowledge transformation to the target domain. However, most state-of-the-art cross-domain classification methods treat documents as plain text and ignore the hyperlink (or citation) relationship existing among the documents. In this paper, we propose a novel cross-domain document classification approach called Link-Bridged Topic model (LBT). LBT consists of two key steps. Firstly, LBT utilizes an auxiliary link network to discover the direct or indirect co-citation relationship among documents by embedding the background knowledge into a graph kernel. The mined co-citation relationship is leveraged to bridge the gap across different domains. Secondly, LBT simultaneously combines the content information and link structures into a unified latent topic model. The model is based on an assumption that the documents of source and target domains share some common topics from the point of view of both content information and link structure. By mapping both domains data into the latent topic spaces, LBT encodes the knowledge about domain commonality and difference as the shared topics with associated differential probabilities. The learned latent topics must be consistent with the source and target data, as well as content and link statistics. Then the shared topics act as the bridge to facilitate knowledge transfer from the source to the target domains. Experiments on different types of datasets show that our algorithm significantly improves the generalization performance of cross-domain document classification.  相似文献   

13.
Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document.In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets1), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics.Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM’s parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.  相似文献   

14.
A fast algorithm is described for comparing the lists of terms representing documents in automatic classification experiments. The speed of the procedure arises from the fact that all of the non-zero-valued coefficients for a given document are identified together, using an inverted file to the terms in the document collection. The complexity and running time of the algorithm are compared with previously described procedures.  相似文献   

15.
Contextual feature selection for text classification   总被引:1,自引:0,他引:1  
We present a simple approach for the classification of “noisy” documents using bigrams and named entities. The approach combines conventional feature selection with a contextual approach to filter out passages around selected features. Originally designed for call for tender documents, the method can be useful for other web collections that also contain non-topical contents. Experiments are conducted on our in-house collection as well as on the 4-Universities data set, Reuters 21578 and 20 Newsgroups. We find a significant improvement on our collection and the 4-Universities data set (10.9% and 4.1%, respectively). Although the best results are obtained by combining bigrams and named entities, the impact of the latter is not found to be significant.  相似文献   

16.
李锦兰 《现代情报》2010,30(12):44-46,50
通过对网络地方文献的概念、类型及特点的研究,提出了用书目控制的原理与方法对网络地方文献实施书目控制的设想,详细论述了网络地方文献的评价、选择、描述、编目等书目控制方法。建立网络地方文献数据库是对网络地方文献实施书目控制的终极产品之一。  相似文献   

17.
A new concept of a bipolar query against collections of textual documents, i.e. in the context of information retrieval (IR), is introduced using recent developments in bipolar information modeling and bipolar database queries. Specifically, a particular approach to bipolar queries with an explicit “and possibly” type of an aggregation operator is used. An effective and efficient processing of such bipolar queries using standard IR data structures is briefly discussed. The bipolar queries proposed combine a flexibility provided by fuzzy logic with a more sophisticated representation of user preferences and intentions. This combination can make the search of vast resources of textual document, notably those available via the Internet, more intelligent.  相似文献   

18.
网上文献传递服务的定量分析与对策——以南昌大学为例   总被引:3,自引:0,他引:3  
本文根据南昌大学图书馆2006-2007年网上文献传递数据,从文献传递申请量、申请文献类型、申请文献学科分布和申请文献语种4个方面进行统计分析。结果表明,我国图书馆文献申请量出现显著增长的势头;读者申请的文献类型以期刊论文所占比例最大,学位论文次之,会议论文位居第三;材料学院的文献申请数最多,其次是生命学院。最后,指出了网上文献传递中存在的4个问题,并针对性地提出了改进的建议和对策。  相似文献   

19.
This paper reports our experimental investigation into the use of more realistic concepts as opposed to simple keywords for document retrieval, and reinforcement learning for improving document representations to help the retrieval of useful documents for relevant queries. The framework used for achieving this was based on the theory of Formal Concept Analysis (FCA) and Lattice Theory. Features or concepts of each document (and query), formulated according to FCA, are represented in a separate concept lattice and are weighted separately with respect to the individual documents they present. The document retrieval process is viewed as a continuous conversation between queries and documents, during which documents are allowed to learn a set of significant concepts to help their retrieval. The learning strategy used was based on relevance feedback information that makes the similarity of relevant documents stronger and non-relevant documents weaker. Test results obtained on the Cranfield collection show a significant increase in average precisions as the system learns from experience.  相似文献   

20.
With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号