首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.  相似文献   

2.
The vector space model (VSM) is a textual representation method that is widely used in documents classification. However, it remains to be a space-challenging problem. One attempt to alleviate the space problem is by using dimensionality reduction techniques, however, such techniques have deficiencies such as losing some important information. In this paper, we propose a novel text classification method that neither uses VSM nor dimensionality reduction techniques. The proposed method is a space efficient method that utilizes the first order Markov model for hierarchical Arabic text classification. For each category and sub-category, a Markov chain model is prepared based on the neighboring characters sequences. The prepared models are then used for scoring documents for classification purposes. For evaluation, we used a hierarchical Arabic text data collection that contains 11,191 documents that belong to eight topics distributed into 3-levels. The experimental results show that the Markov chains based method significantly outperforms the baseline system that employs the latent semantic indexing (LSI) method. That is, the proposed method enhances the F1-measure by 3.47%. The novelty of this work lies on the idea of decomposing words into sequences of characters, which found to be a promising approach in terms of space and accuracy. Based on our best knowledge, this is the first attempt to conduct research for hierarchical Arabic text classification with such relatively large data collection.  相似文献   

3.
In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.  相似文献   

4.
This study employs our proposed semi-supervised clustering method called Constrained-PLSA to cluster tagged documents with a small amount of labeled documents and uses two data sets for system performance evaluations. The first data set is a document set whose boundaries among the clusters are not clear; while the second one has clear boundaries among clusters. This study employs abstracts of papers and the tags annotated by users to cluster documents. Four combinations of tags and words are used for feature representations. The experimental results indicate that almost all of the methods can benefit from tags. However, unsupervised learning methods fail to function properly in the data set with noisy information, but Constrained-PLSA functions properly. In many real applications, background knowledge is ready, making it appropriate to employ background knowledge in the clustering process to make the learning more fast and effective.  相似文献   

5.
The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.  相似文献   

6.
Language modeling is an effective and theoretically attractive probabilistic framework for text information retrieval. The basic idea of this approach is to estimate a language model of a given document (or document set), and then do retrieval or classification based on this model. A common language modeling approach assumes the data D is generated from a mixture of several language models. The core problem is to find the maximum likelihood estimation of one language model mixture, given the fixed mixture weights and the other language model mixture. The EM algorithm is usually used to find the solution.  相似文献   

7.
Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most existing text categorization techniques deal with monolingual documents (i.e., written in the same language) during the learning of the text categorization model and category assignment (or prediction) for unclassified documents. However, with the globalization of business environments and advances in Internet technology, an organization or individual may generate and organize into categories documents in one language and subsequently archive documents in different languages into existing categories, which necessitate cross-lingual text categorization (CLTC). Specifically, cross-lingual text categorization deals with learning a text categorization model from a set of training documents written in one language (e.g., L1) and then classifying new documents in a different language (e.g., L2). Motivated by the significance of this demand, this study aims to design a CLTC technique with two different category assignment methods, namely, individual- and cluster-based. Using monolingual text categorization as a performance reference, our empirical evaluation results demonstrate the cross-lingual capability of the proposed CLTC technique. Moreover, the classification accuracy achieved by the cluster-based category assignment method is statistically significantly higher than that attained by the individual-based method.  相似文献   

8.
Transfer learning utilizes labeled data available from some related domain (source domain) for achieving effective knowledge transformation to the target domain. However, most state-of-the-art cross-domain classification methods treat documents as plain text and ignore the hyperlink (or citation) relationship existing among the documents. In this paper, we propose a novel cross-domain document classification approach called Link-Bridged Topic model (LBT). LBT consists of two key steps. Firstly, LBT utilizes an auxiliary link network to discover the direct or indirect co-citation relationship among documents by embedding the background knowledge into a graph kernel. The mined co-citation relationship is leveraged to bridge the gap across different domains. Secondly, LBT simultaneously combines the content information and link structures into a unified latent topic model. The model is based on an assumption that the documents of source and target domains share some common topics from the point of view of both content information and link structure. By mapping both domains data into the latent topic spaces, LBT encodes the knowledge about domain commonality and difference as the shared topics with associated differential probabilities. The learned latent topics must be consistent with the source and target data, as well as content and link statistics. Then the shared topics act as the bridge to facilitate knowledge transfer from the source to the target domains. Experiments on different types of datasets show that our algorithm significantly improves the generalization performance of cross-domain document classification.  相似文献   

9.
As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for versatile applications. However, most document clustering methods still suffer from challenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Itemset-Based Hierarchical Clustering (F2IHC) approach, which uses fuzzy association rule mining algorithm to improve the clustering accuracy of Frequent Itemset-Based Hierarchical Clustering (FIHC) method. In our approach, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, Re0, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC.  相似文献   

10.
Text classification is an important research topic in natural language processing (NLP), and Graph Neural Networks (GNNs) have recently been applied in this task. However, in existing graph-based models, text graphs constructed by rules are not real graph data and introduce massive noise. More importantly, for fixed corpus-level graph structure, these models cannot sufficiently exploit the labeled and unlabeled information of nodes. Meanwhile, contrastive learning has been developed as an effective method in graph domain to fully utilize the information of nodes. Therefore, we propose a new graph-based model for text classification named CGA2TC, which introduces contrastive learning with an adaptive augmentation strategy into obtaining more robust node representation. First, we explore word co-occurrence and document word relationships to construct a text graph. Then, we design an adaptive augmentation strategy for the text graph with noise to generate two contrastive views that effectively solve the noise problem and preserve essential structure. Specifically, we design noise-based and centrality-based augmentation strategies on the topological structure of text graph to disturb the unimportant connections and thus highlight the relatively important edges. As for the labeled nodes, we take the nodes with same label as multiple positive samples and assign them to anchor node, while we employ consistency training on unlabeled nodes to constrain model predictions. Finally, to reduce the resource consumption of contrastive learning, we adopt a random sample method to select some nodes to calculate contrastive loss. The experimental results on several benchmark datasets can demonstrate the effectiveness of CGA2TC on the text classification task.  相似文献   

11.
With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.  相似文献   

12.
The quality of feedback documents is crucial to the effectiveness of query expansion (QE) in ad hoc retrieval. Recently, machine learning methods have been adopted to tackle this issue by training classifiers from feedback documents. However, the lack of proper training data has prevented these methods from selecting good feedback documents. In this paper, we propose a new method, called AdapCOT, which applies co-training in an adaptive manner to select feedback documents for boosting QE’s effectiveness. Co-training is an effective technique for classification over limited training data, which is particularly suitable for selecting feedback documents. The proposed AdapCOT method makes use of a small set of training documents, and labels the feedback documents according to their quality through an iterative process. Two exclusive sets of term-based features are selected to train the classifiers. Finally, QE is performed on the labeled positive documents. Our extensive experiments show that the proposed method improves QE’s effectiveness, and outperforms strong baselines on various standard TREC collections.  相似文献   

13.
提出一种基于向量空间模型的科技文献自动分类方法.首先构建分类代表文本,计算单文本与分类代表文本之间的相似度;然后利用训练集计算出分类归属度,以判断某一文本分类的可能性,实现科技文献的自动分类.以3个学科的文献为例,验证该方法的可用性.  相似文献   

14.
Transductive classification is a useful way to classify texts when labeled training examples are insufficient. Several algorithms to perform transductive classification considering text collections represented in a vector space model have been proposed. However, the use of these algorithms is unfeasible in practical applications due to the independence assumption among instances or terms and the drawbacks of these algorithms. Network-based algorithms come up to avoid the drawbacks of the algorithms based on vector space model and to improve transductive classification. Networks are mostly used for label propagation, in which some labeled objects propagate their labels to other objects through the network connections. Bipartite networks are useful to represent text collections as networks and perform label propagation. The generation of this type of network avoids requirements such as collections with hyperlinks or citations, computation of similarities among all texts in the collection, as well as the setup of a number of parameters. In a bipartite heterogeneous network, objects correspond to documents and terms, and the connections are given by the occurrences of terms in documents. The label propagation is performed from documents to terms and then from terms to documents iteratively. Nevertheless, instead of using terms just as means of label propagation, in this article we propose the use of the bipartite network structure to define the relevance scores of terms for classes through an optimization process and then propagate these relevance scores to define labels for unlabeled documents. The new document labels are used to redefine the relevance scores of terms which consequently redefine the labels of unlabeled documents in an iterative process. We demonstrated that the proposed approach surpasses the algorithms for transductive classification based on vector space model or networks. Moreover, we demonstrated that the proposed algorithm effectively makes use of unlabeled documents to improve classification and it is faster than other transductive algorithms.  相似文献   

15.
Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification. In addition, the training time and scaling are also important concerns. On the other hand, other techniques naturally extensible to handle multi-class classification are generally not as accurate as SVM. This paper presents a simple and efficient solution to multi-class text categorization. Classification problems are first formulated as optimization via discriminant analysis. Text categorization is then cast as the problem of finding coordinate transformations that reflects the inherent similarity from the data. While most of the previous approaches decompose a multi-class classification problem into multiple independent binary classification tasks, the proposed approach enables direct multi-class classification. By using generalized singular value decomposition (GSVD), a coordinate transformation that reflects the inherent class structure indicated by the generalized singular values is identified. Extensive experiments demonstrate the efficiency and effectiveness of the proposed approach.  相似文献   

16.
This research has investigated the feasibility of using a distance measure, called the Bayesian distance, for automatic sequential document classification. It has been shown that by observing the variation of this distance measure as keywords are extracted sequentially from a document, the occurrence of noisy keywords may be detected. This property of the distance measure has been utilized to design a sequential classification algorithm which works in two phases. In the first phase keywords extracted from a document are partitioned into two groups—the good keyword group and the noisy keyword group. In the second phase these two groups of keywords are analyzed separately to assign primary and secondary classes to a document. The algorithm has been applied to several data bases of documents and very encouraging results have been obtained.  相似文献   

17.
Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process.  相似文献   

18.
One of the most important problems in information retrieval is determining the order of documents in the answer returned to the user. Many methods and algorithms for document ordering have been proposed. The method introduced in this paper differs from them especially in that it uses a probabilistic model of a document set. In this model documents are regarded as states of a Markov chain, where transition probabilities are directly proportional to similarities between documents. Steady-state probabilities reflect similarities of particular documents to the whole answer set. If documents are ordered according to these probabilities, at the top of a list there will be documents that are the best representatives of the set, and at the bottom those which are the worst representatives. The method was tested against databases INSPEC and Networked Computer Science Technical Reference Library (NCSTRL). Test results are positive. Values of the Kendall rank correlation coefficient indicate high similarity between rankings generated by the proposed method and rankings produced by experts. Results are comparable with rankings generated by the vector model using standard weighting schema tf·idf.  相似文献   

19.
针对图书、期刊论文等数字文献文本特征较少而导致特征向量语义表达不够准确、分类效果差的问题,本文提出一种基于特征语义扩展的数字文献分类方法。该方法首先利用TF-IDF方法获取对数字文献文本表示能力较强、具有较高TF-IDF值的核心特征词;其次分别借助知网(Hownet)语义词典以及开放知识库维基百科(Wikipedia)对核心特征词集进行语义概念的扩展,以构建维度较低、语义丰富的概念向量空间;最后采用MaxEnt、SVM等多种算法构造分类器实现对数字文献的自动分类。实验结果表明:相比传统基于特征选择的短文本分类方法,该方法能有效地实现对短文本特征的语义扩展,提高数字文献分类的分类性能。  相似文献   

20.
This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号